ansaurus

Question

Count Occurence of Needle String in Haystack String, most optimally?

Answer 1

+3 A:

How about:

boolean found = haystack.indexOf("ABC") >= 0;

**Edit - The question asks for number of occurences, so here's a modified version of the above:

public static void main(String[] args)
{
    String needle = "ABC";
    String haystack = "ABCDSGDABCSAGAABCCCCAAABAABC";

    int numberOfOccurences = 0;
    int index = haystack.indexOf(needle);
    while (index != -1)
    {
        numberOfOccurences++;
        haystack = haystack.substring(index+needle.length());
        index = haystack.indexOf(needle);
    }

    System.out.println("" + numberOfOccurences);
}

Chris Knight 2010-03-25 13:28:51

Should be done iteratively, each time starting at 1 beyond the last found pos (or "ABC".length() beyond index) to count the amount of occurrences (which is the question).

extraneon 2010-03-25 13:34:52

nah, that only says yes or no. I asked for count. and yes don't use String.Split("ABC")

Taranfx 2010-03-25 13:37:30

OK, that wasn't particularly clear from your question which simply asked to find the string "ABC" in a longer string.

Chris Knight 2010-03-25 13:45:24

Answer 2

A:

What about considering regex, Pattern and Matcher classes? Check this tutorial

Dan 2010-03-25 13:28:53

Answer 3

+2 A:

If you're looking for an algorithm, google for "Boyer-Moore". You can do this in sub-linear time.

edit to clarify and hopefully make all the purists happy: the time bound on Boyer-Moore is, formally speaking, linear. However the effective performance is often such that you do many fewer comparisons than you would with a simpler approach, and in particular you can often skip through the "haystack" string without having to check each character.

Pointy 2010-03-25 13:30:26

Boyer-Moore is O(N), i.e. linear.

Håvard S 2010-03-25 13:33:28

Well when you do it right, you can skip through the text (the "haystack", in terms of this question) such that you don't have to look at every character. It depends on the search string, of course; if the search string is one character long, then it's always just linear. And yes, formally the worst-case performance is linear (I think it's 3n right?)

Pointy 2010-03-25 13:37:37

Nice article on wikipedia about this!

extraneon 2010-03-25 13:37:39

@Pointy Of course, but that does not change the fact that it is linear. Complexity analysis does not care about constants, so it's O(N). There is no general guarantee that Boyer-Moore will be faster than a naive approach, it all depends on the string being searched.

Håvard S 2010-03-25 13:42:51

@Havard complexity analysis may not care about constants, but people who need the performance do. Even if you don't get a guarantee it's faster you at least get a chance.

extraneon 2010-03-25 13:49:46

@Pointy: even with skipping, it is still O(N) and lineair time, the constant is only smaller than with a more naive search method.

Ritsaert Hornstra 2010-03-25 13:51:53

Fine. However, if you're ever working for **me**, and you pick an algorithm with a bigger constant performance factor than another, well, you won't be working for me for much longer.

Pointy 2010-03-25 13:53:19

@extraneon The question asks for a generally "fast" solution, and it is important to understand that there is no such thing. You need to analyze your data; applying Boyer-Moore because it "may be faster" is not a smart thing to do.

Håvard S 2010-03-25 13:54:18

The naive method is theta(N), where as Boyer-Moore is O(N).

NickLarsen 2010-03-25 14:14:08

@Pointy Although I agree with you when performance is critical I could just as easily pick a suboptimal but easier/trivial to understand solution in other cases where development time and maintenance complexity are more limited. So I'm afraid I don't get to work at your place then :)

extraneon 2010-03-25 15:04:26

@extraneon naah I'm much nicer than that :-) This whole discussion is pretty silly given that the original question is probably from an "Intro to Algorithms" class, and all these fine points of performance analysis are waaay more sophisticated.

Pointy 2010-03-25 15:17:24

Answer 4

+1 A:

You say your challenge is to find ABC within a string. If all you need is to know if ABC exists within the string, a simple indexOf() test will suffice.

If you need to know the number of occurrences, as your posted code tries to find, a simple approach would be to use a regex:

public static int countOccurrences(string haystack, string regexToFind) {
   Pattern p = Pattern.compile(regexToFind);
   Matcher m = p.matcher(haystack); // get a matcher object
   int count = 0;
   while(m.find()) {
       count++;
   }
   return count;
}

Håvard S 2010-03-25 13:38:30

Thanks Havard, RegEx is a good option but I'm only interested in learning the "Best algorithm" that consumes least CPU time

Taranfx 2010-03-25 13:46:42

Like I've stated in two comments, that all depends on the string you're searching in. There is no general "best algorithm" answer, sadly. Prefer the simple solution, premature optimization is the root of all evil. (Knuth)

Håvard S 2010-03-25 13:49:02

Answer 5

A:

Asked by others, better in what sense? A regexp based solution will be the most concise and readable (:-) ). Boyer-Moore (http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm) will be the most efficient in terms of time (O(N)).

David Soroko 2010-03-25 13:41:58

Again, there is no guarantee that Boyer-Moore will be more efficient than a naive search (it depends on the string being searched).

Håvard S 2010-03-25 13:44:41

Performance. Memory is not an issue.

Taranfx 2010-03-25 13:45:08

Also consider Horspool's variation which is simpler to implement: http://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm

binil 2010-03-25 17:58:32

Answer 6

A:

If you don't mind implementing a new datastructure as replacement for strings, have a look at Tries: http://c2.com/cgi/wiki?StringTrie or http://en.wikipedia.org/wiki/Trie

If you don't look for a regular expression but an exact match they should provide the fastest solution (proportional to length of search string).

Searles 2010-03-25 13:54:40

Answer 7

+1 A:

Have a look at http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Thorbjørn Ravn Andersen 2010-03-25 14:16:57

ansaurus

tags:

views:

answers:

Count Occurence of Needle String in Haystack String, most optimally?

related questions