views:

59

answers:

3

Hey,

I came across the task to find all occurences of a substring in another string and was wondering what will be the best algorithm to solve this problem.

For demonstration purposes I used the string "The cat sat on the mat" and search for all occurences of the substring "at". This should ultimately result in an occurence count of 3. Since I'm programming in java at the moment the first thing that popped into my mind was this:

    public static void main(String[] args) {

      int count=0;
      String s = "The cat sat on the mat";

      Pattern pattern = Pattern.compile("at");
      Matcher matcher = pattern.matcher(s);
      while(matcher.find()){
          count++;
      }

      System.out.println("Pattern: "+pattern+" Count: "+count);
    }

Somehow I doubt that this is the optimal solution for this problem. So if someone knows how an optimal (or at least pretty good) solution should look ... please answer! You can post your answer in any language not necessarily java (altough that would be great :)).

Thanks a lot!

A: 

As usual, it depends.

The theoretic best approach is to probably to use suffix trees -- but they only start making sense on very large strings. Suffix arrays are slightly harder to use, but make sense for smaller strings. IIRC, the zlib deflate algorithm uses suffix arrays to find repeated substrings. In either case, the algorithms are not straightforward, and need quite a bit of study to understand and to implement efficiently.

If you're just worried about programmer productivity and easily understood code, I guess it's hard to beat what you've written. Assuming a reasonably intelligent regexp parser, it might be fast enough for normal use.

Hari
+2  A: 

There are quite some impressive substring algorithms. Often the Boyer-Moore algorithm (http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) is mentioned, but there are other alternatives, like http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and http://en.wikipedia.org/wiki/Rabin-karp.

Patrick
+1 for Boyer-Moore. BTW, there was some buzz on the internets (Reddit perhaps) about BM, can't find the links. But google for it and you should see some animated discussions about it. Very useful.
Mikos
+1  A: 

Without the overhead of regular expressions:

public static void main(String[] args) {

    int count = 0;
    String s = "The cat sat on the mat";
    String substring = "at";

    int pos = s.indexOf(substring);
    while (pos > -1) {
        count++;
        pos = s.indexOf(substring, pos + 1);
    }

    System.out.println("Pattern: "+pattern+" Count: "+count);
}

I did a quick test searching for "at" in the text of the Boyer–Moore string search algorithm article on Wikipedia. They both find the same number of matches, but doing this 10.000 times on my machine took the regex algorithm 1702 milliseconds and this just 192!

Kwebble
Hey great! Thanks a lot!
evermean