ansaurus

Question

Performance of Occurences of Substring in String

Answer 1

A:

As usual, it depends.

The theoretic best approach is to probably to use suffix trees -- but they only start making sense on very large strings. Suffix arrays are slightly harder to use, but make sense for smaller strings. IIRC, the zlib deflate algorithm uses suffix arrays to find repeated substrings. In either case, the algorithms are not straightforward, and need quite a bit of study to understand and to implement efficiently.

If you're just worried about programmer productivity and easily understood code, I guess it's hard to beat what you've written. Assuming a reasonably intelligent regexp parser, it might be fast enough for normal use.

Hari 2010-08-27 09:57:28

Answer 2

+2 A:

There are quite some impressive substring algorithms. Often the Boyer-Moore algorithm (http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) is mentioned, but there are other alternatives, like http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and http://en.wikipedia.org/wiki/Rabin-karp.

Patrick 2010-08-27 10:02:42

+1 for Boyer-Moore. BTW, there was some buzz on the internets (Reddit perhaps) about BM, can't find the links. But google for it and you should see some animated discussions about it. Very useful.

Mikos 2010-08-27 12:04:06

Answer 3

+1 A:

Without the overhead of regular expressions:

public static void main(String[] args) {

    int count = 0;
    String s = "The cat sat on the mat";
    String substring = "at";

    int pos = s.indexOf(substring);
    while (pos > -1) {
        count++;
        pos = s.indexOf(substring, pos + 1);
    }

    System.out.println("Pattern: "+pattern+" Count: "+count);
}

I did a quick test searching for "at" in the text of the Boyer–Moore string search algorithm article on Wikipedia. They both find the same number of matches, but doing this 10.000 times on my machine took the regex algorithm 1702 milliseconds and this just 192!

Kwebble 2010-08-28 15:37:33

Hey great! Thanks a lot!

evermean 2010-08-29 09:36:26

ansaurus

tags:

views:

answers:

Performance of Occurences of Substring in String

related questions