views:

108

answers:

3

What is the most efficient way to split a string by a very simple separator?

Some background:

I am porting a function I wrote in C with a bunch of pointer arithmetic to java and it is incredibly slow(After some optimisation still 5* slower). Having profiled it, it turns out a lot of that overhead is in String.split

The function in question takes a host name or ip address and makes it generic:

123.123.123.123->*.123.123.123

a.b.c.example.com->*.example.com

This can be run over several million items on a regular basis, so performance is an issue.

Edit: the rules for converting are thus:

  • If it's an ip address, replace the first part
  • Otherwise, find the main domain name, and make the preceding part generic.

foo.bar.com-> *.bar.com foo.bar.co.uk-> *.bar.co.uk

I have now rewritten using lastIndexOf and substring to work myself in from the back and the performance has improved by leaps and bounds.

I'll leave the question open for another 24 hours before settling on the best answer for future reference

Here's what I've come up with now(the ip part is an insignificant check before calling this function)

private static String hostConvert(String in) {
    final String [] subs = { "ac", "co", "com", "or", "org", "ne", "net", "ad", "gov", "ed" };

    int dotPos = in.lastIndexOf('.');
    if(dotPos == -1)
        return in;
    int prevDotPos = in.lastIndexOf('.', dotPos-1);
    if(prevDotPos == -1)
        return in;
    CharSequence cs = in.subSequence(prevDotPos+1, dotPos);
    for(String cur : subs) {
        if(cur.contentEquals(cs)) {
            int start = in.lastIndexOf('.', prevDotPos-1);
            if(start == -1 || start == 0)
                return in;
            return "*" + in.substring(start);
        }
    }

    return "*" + in.substring(prevDotPos);
}

If there's any space for further improvement it would be good to hear.

+2  A: 

I'd try using .indexOf("."), and .substring(index)

You didn't elaborate on the exact pattern you wanted to match but if you can avoid split(), it should cut down on the number of new strings it allocates (1 instead of several).

colithium
+2  A: 

It's unclear from your question exactly what the code is supposed to do. Does it find the first '.' and replace everything up to it with a '*'? Or is there some fancier logic behind it? Maybe everything up to the nth '.' gets replaced by '*'?

If you're trying to find an instance of a particular string, use something like the Boyer-Moore algorithm. It should be able to find the match for you and you can then replace what you want.

Keep in mind that String in Java is immutable. It might be faster to change the sequence in-place. Check out other CharSequence implementations to see what you can do, e.g. StringBuffer and CharBuffer. If concurrency is not needed, StringBuilder might be an option.

By using a mutable CharSequence instead of the methods on String, you avoid a bunch of object churn. If all you're doing is replacing some slice of the underlying character array with a shorter array (i.e. {'*'}), this is likely to yield a speedup since such array copies are fairly optimized. You'll still be doing an array copy at the end of the day, but it may be faster than new String allocations.

UPDATE

All the above is pretty much hogwash. Sure, maybe you can implement your own CharSequence that gives you better slicing and lazily resizes the array (aka doesn't actually truncate anything until it absolutely must), returning Strings based on offsets and whatnot. But StringBuffer and StringBuilder, at least directly, do not perform as well as the solution poly posted. CharBuffer is entirely inapplicable; I didn't realize it was an nio class earlier: it's meant for other things entirely.

There are some interesting things about poly's code, which I wonder whether he/she knew before posting it, namely that changing the "*" on the final lines of the methods to a '*' results in a significant slowdown.

Nevertheless, here is my benchmark. I found one small optimization: declaring the '.' and "*" expressions as constants adds a bit of a speedup as well as using a locally-scoped StringBuilder instead of the binary infix string concatenation operator.

I know the gc() is at best advisory and at worst a no-op, but I figured adding it with a bit of sleep time might let the VM do some cleanup after creating 1M Strings. Someone may correct me if this is totally naïve.

Simple Benchmark

import java.util.ArrayList;
import java.util.Arrays;

public class StringSplitters {
    private static final String PREFIX = "*";
    private static final char DOT = '.';

    public static String starOutFirst(String s) {
        final int k = s.indexOf(DOT);
        return PREFIX + s.substring(k);
    }

    public static String starOutFirstSb(String s) {
        StringBuilder sb = new StringBuilder();
        final int k = s.indexOf(DOT);
        return sb.append(PREFIX).append(s.substring(k)).toString();
    }

    public static void main(String[] args) throws InterruptedException {
        double[] firstRates = new double[10];
        double[] firstSbRates = new double[10];
        double firstAvg = 0;
        double firstSbAvg = 0;
        double firstMin = Double.POSITIVE_INFINITY;
        double firstMax = Double.NEGATIVE_INFINITY;

        double firstSbMin = Double.POSITIVE_INFINITY;
        double firstSbMax = Double.NEGATIVE_INFINITY;

        for (int i = 0; i < 10; i++) {
            firstRates[i] = testFirst();

            firstAvg += firstRates[i];

            if (firstRates[i] < firstMin)
                firstMin = firstRates[i];
            if (firstRates[i] > firstMax)
                firstMax = firstRates[i];

            Thread.sleep(100);
            System.gc();
            Thread.sleep(100);
        }

        firstAvg /= 10.0d;

        for (int i = 0; i < 10; i++) {
            firstSbRates[i] = testFirstSb();

            firstSbAvg += firstSbRates[i];

            if (firstSbRates[i] < firstSbMin)
                firstSbMin = firstSbRates[i];
            if (firstSbRates[i] > firstSbMax)
                firstSbMax = firstSbRates[i];

            Thread.sleep(100);
            System.gc();
            Thread.sleep(100);
        }

        firstSbAvg /= 10.0d;

        System.out.printf("First:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstMin, firstMax,
                firstAvg, Arrays.toString(firstRates));
        System.out.printf("FirstSb:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstSbMin,
                firstSbMax, firstSbAvg, Arrays.toString(firstSbRates));

    }

    private static double testFirst() {
        ArrayList<String> strings = new ArrayList<String>(1000000);

        for (int i = 0; i < 1000000; i++) {
            int first = (int) (Math.random() * 128);
            int second = (int) (Math.random() * 128);
            int third = (int) (Math.random() * 128);
            int fourth = (int) (Math.random() * 128);

            strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
        }

        long before = System.currentTimeMillis();
        for (String s : strings) {
            starOutFirst(s);
        }
        long after = System.currentTimeMillis();

        return 1000000000.0d / (after - before);
    }

    private static double testFirstSb() {
        ArrayList<String> strings = new ArrayList<String>(1000000);

        for (int i = 0; i < 1000000; i++) {
            int first = (int) (Math.random() * 128);
            int second = (int) (Math.random() * 128);
            int third = (int) (Math.random() * 128);
            int fourth = (int) (Math.random() * 128);

            strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
        }

        long before = System.currentTimeMillis();
        for (String s : strings) {
            starOutFirstSb(s);
        }
        long after = System.currentTimeMillis();

        return 1000000000.0d / (after - before);
    }
}

Output

First:
    Min:    3802281.369 Max:    5434782.609 Avg:    5185796.131
    Rates:  [3802281.3688212926, 5181347.150259067, 5291005.291005291, 5376344.086021505, 5291005.291005291, 5235602.094240838, 5434782.608695652, 5405405.405405405, 5434782.608695652, 5405405.405405405]

FirstSb:
    Min:    4587155.963 Max:    5747126.437 Avg:    5462087.511
    Rates:  [4587155.963302752, 5747126.436781609, 5617977.528089887, 5208333.333333333, 5681818.181818182, 5586592.17877095, 5586592.17877095, 5524861.878453039, 5524861.878453039, 5555555.555555556]
jasonmp85
I made the targets of my question a bit clearer.I got rid of all array usage, just extracting any necessary CharSequences for comparison.
juhanic
Oh and in case my inscrutable code is unclear, the point is the `starOutFirstSb` method which uses the local `StringBuilder` is fastest (probably because it's using a `StringBuilder` vs the `StringBuffer` the `+` uses). So if you can make sense of them, apply these findings to your other method as well.
jasonmp85
[Ropes for Java](http://ahmadsoft.org/ropes/) would be such a optimized `CharSequence` implementation.
Joachim Sauer
Yeah, my first naive implementation was using string.format for the outputs. Replacing thos with stringbuilders halved the time the program took. The other performance killer was the splits, which when replaced with lastIndexOf and substring cut down runtime by a ton.Now the routine works very near my original C implementation, though beating it would likely not be possible (due to all the pointer arithmetic and direct memory manipulations).
juhanic
+5  A: 

Something like this is about as fast as you can make it:

static String starOutFirst(String s) {
    final int K = s.indexOf('.');
    return "*" + s.substring(K);
}
static String starOutButLastTwo(String s) {
    final int K = s.lastIndexOf('.', s.lastIndexOf('.') - 1);
    return "*" + s.substring(K);
}

Then you can do:

    System.out.println(starOutFirst("123.123.123.123"));
    // prints "*.123.123.123"

    System.out.println(starOutButLastTwo("a.b.c.example.com"));
    // prints "*.example.com"

You may need to use regex to see which of the two method is applicable for any given string.

polygenelubricants
would a regex not lose any potential performance gains? As is I'm checking whether the last part is numeric(as no top level domain is) and if it isn't, pass it the function I added in above.Perhaps a regex would be faster than my comparison against the possible secondary top-level domains? something like "co[m]|or[g]|ne[t]|ed"I can't help but feel that it would be slower than a few trivial comparisons. I'll have to try and pre-compile and do some performance tests later.
juhanic
@juhanic: I have to admit that I'm not familiar with the specification. Perhaps checking just the last character and seeing if it's a digit is enough? I don't actually know what the legal "main domain names" are. I will have to look up the spec.
polygenelubricants
@juhanic: So is it the case that there are special suffixes where you only need one more part (e.g. `whatever.com`), and there are others where you need two (e.g. `something.co.jp`)? Can you basically decide if it's `.com`, take one more part, and if it's `.jp`, take two? And if so, how big are these sets? Is it basically, `.gov`, `.com`, `.org`, `.net`, and a handful others, take one, and everything else take two?
polygenelubricants
@polygenelubricants: unfortunately there are domains matching both something.co.jp and something.jp. Abstracting down to the final jp would give pretty useless data in the form of *.co.jp which tells nothing about the host. The purpose is for batch analysis of logs and storing "generic" results. 100 entries from *.host1.co.uk carries actually meaning while 100,000 entries from *.co.uk is pretty meaningless.I've tried to poke around regarding the secondary names that come after TLD's, but it seems to be that it is up to the place responsible for each one what they will allow.
juhanic
As expected, using a (precompiled) regex to check the TLD's was significantly slower
juhanic
@juhanic: I think a Patricia trie dictionary of suffixes may be fastest (http://en.wikipedia.org/wiki/Patricia_trie), but it'd be overly complicated for what shouldn't be such a demanding string manipulation routine. The dictionary would have the suffixes reversed, and given a url, you read one letter at a time from the end, i.e. instead of `.co.jp`, the trie contains `pj.oc.`
polygenelubricants