views:

69

answers:

3

A legacy app program has a huge String Buffer (size sometimes upto an Mb) and it is processed sequentially for modifying the contents. I have to implement a change wherein I need to update the string buffer to remove some lines starting with certain specific words. What are the possible ways to implement this ?

Ex:

ABC:djfk kdjf kdsjfk#
ABC:jfue eijf iefe# 
DEL:kdjfi efe eei # 
DEL:ieeif dfddf dfdf#
HJU:heuir fwer ouier# 
ABC:dsf erereree ererre #

I need to delete lines starting with DEL. Splitting the string buffer to string, processing the lines and again joining the strings to create a string buffer would be a bit costly. Pls let me know the possible solutions.

Thanks

+1  A: 

Splitting the string buffer to string, processing the lines and again joining the strings to create a string buffer would be a bit costly.

Removing the lines would actually be much more costly, because you'd end up copying the rest of the buffer for each line you remove.

The fastest way would probably be java.util.matcher.replaceAll() to get a copy of the buffer without all the lines you don't want.

Michael Borgwardt
+2  A: 

It is possible to do this in-place efficiently. You'd have to overwrite the characters in the buffer at the proper intervals, and you'd then logically shorten the buffer with setLength. It's going to be quite complex, but it would be in-place and O(N).

The reason why you'd want to overwrite instead of using delete/insert is because that would be O(N^2) instead, because things need to be shifted around unnecessarily.

Doing this out-of-place is quite trivial and O(N) but would require a secondary buffer, doubling the space requirement.


Proof-of-concept

Here's a simple proof-of-concept. removeIntervals takes a StringBuffer and an int[][] intervals. Each int[] is assumed to be a pair of { start, end } values (half-open range, exclusive upper bound). In linear time and in-place, these intervals are removed from the StringBuffer by a simple overwrite. This works when intervals are sorted and non-overlapping, and processed left-to-right.

The buffer is then shortened with setLength, cutting off as many characters that were removed.

static void overwrite(StringBuffer sb, int dst, int srcFrom, int srcTo) {
    for (int i = srcFrom; i < srcTo; i++) {
        sb.setCharAt(dst++, sb.charAt(i));
    }
}
static int safeGet(int[][] arr, int index, int defaultValue) {
    return (index < arr.length) ? arr[index][0] : defaultValue;
}
static void removeIntervals(StringBuffer sb, int[][] intervals) {
    int dst = safeGet(intervals, 0, 0);
    int removed = 0;
    for (int i = 0; i < intervals.length; i++) {
        int start = intervals[i][0];
        int end   = intervals[i][1];
        int nextStart = safeGet(intervals, i+1, sb.length());
        overwrite(sb, dst, end, nextStart);
        removed += end - start;
        dst += nextStart - end;
    }
    sb.setLength(sb.length() - removed);
}

Then we can test it as follows:

    String text = "01234567890123456789";
    int[][][] tests = {
        { { 0, 5, },
        }, // simple test, removing prefix
        { { 1, 2, }, { 3, 4, }, { 5, 6, }
        }, // multiple infix removals
        { { 3, 7, }, { 18, 20, },
        }, // suffix removal
        {
        }, // no-op
        { { 0, 20 },
        }, // remove whole thing
        { { 7, 10 }, { 10, 13 }, {15, 15 }, 
        }, // adjacent intervals, empty intervals
    };

    for (int[][] test : tests) {
        StringBuffer sb = new StringBuffer(text);
        System.out.printf("> '%s'%n", sb);
        System.out.printf("- %s%n", java.util.Arrays.deepToString(test));
        removeIntervals(sb, test);
        System.out.printf("= '%s'%n%n", sb);
    }

This prints (as seen on ideone.com):

> '01234567890123456789'
- [[0, 5]]
= '567890123456789'

> '01234567890123456789'
- [[1, 2], [3, 4], [5, 6]]
= '02467890123456789'

> '01234567890123456789'
- [[3, 7], [18, 20]]
= '01278901234567'

> '01234567890123456789'
- []
= '01234567890123456789'

> '01234567890123456789'
- [[0, 20]]
= ''

> '01234567890123456789'
- [[7, 10], [10, 13], [15, 15]]
= '01234563456789'

Getting the intervals

In this specific case, the intervals can either be built in a preliminary pass (using indexOf), or the whole process can be done in one pass if absolutely required. The point is that this can definitely be done in-place in linear time (and if absolutely necessary, in a single-pass).


An out-of-place solution

This is out-of-place using a secondary buffer and regex. It's offered for consideration due to its simplicity. Unless further optimization is provably required (after evidentiary profiling results), this should be good enough:

    String text =
        "DEL: line1\n" +
        "KEP: line2\r\n" +
        "DEL: line3\n" +
        "KEP: line4\r" +
        "DEL: line5\r" +
        "DEL: line6\r" +
        "KEP: line7\n" +
        "DEL: line8";
    StringBuffer sb = new StringBuffer(text);
    Pattern delLine = Pattern.compile("(?m)^DEL:.*$");
    String cleanedUp = delLine.matcher(sb).replaceAll("<deleted!>");
    System.out.println(cleanedUp);

This prints (as seen on ideone.com):

<deleted!>
KEP: line2
<deleted!>
KEP: line4
<deleted!>
<deleted!>
KEP: line7
<deleted!>

References

polygenelubricants
+1  A: 

If the lines in the stringbuffer are separated by newlines, you could read it in and create a new buffer. For a 1 meg buffer this completes in tens of milliseconds and is faster than a Regex. You could create a custom version of StringReader to directly read the StringBuffer rather than convert to string to save a little bit more time.


final String NEWLINE = System.getProperty("line.separator");
StringBuffer nuBuffer = new StringBuffer();
BufferedReader br = new BufferedReader(new StringReader(sbData.toString()));
String line;
while ( (line = br.readLine()) != null) {
    if (!line.startsWith("DEL:")) {  // don't copy lines starting with DEL:
        nuBuffer.append(line).append(NEWLINE);
    }
}
br.close();