views:

430

answers:

8

I recently had an interview question that went something like this:

Given a large string (haystack), find a substring (needle)?

I was a little stumped to come up with a decent solution.

What is the best way to approach this that doesn't have a poor time complexity?

+9  A: 

You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.

James McNellis
Got halfway done typing an answer when I saw this, read the article, and realized it was the same thing. Never heard of it before, but I used something like it for a binary pattern match state machine back in '94. Oh, well.
Mike DeSimone
+3  A: 

I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.

This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.

Thilo
The Knuth-Morris-Pratt algorithm @James McNellis mentions is an improvement on this that allows to skip a few characters after an unsuccessful match. Still O(n), though.
Thilo
Remember this is a search, not a comparison. Consider searching for a string with a lot of repeated characters, in a string with a lot of repeated characters, and you'll realise your algorithm is actually O(nm) worst case where n and m are each the length of one string. The product of two variables isn't classed as linear (it's not the same as an unknown constant times a variable), and certainly isn't as good as e.g. the sum of those two variables, which is achievable.
Steve314
@Steve314: That's what I meant by "degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix" (prefix should have been substring, though).
Thilo
Mind you, a naive string search is usually pretty good in practice, because most text doesn't have the kinds of patterns that cause it problems. You're basically doing a compare at each offset in the "haystack", and most comparison mismatches are detected on the first or second character test.
Steve314
@Thilo - OK - I should have realised.
Steve314
So is this algorithm O(N*M) (Quadratic) or O(N+M) (Linear)?
Mithrax
@Mithrax: If you consider M to be a needle, hence negligible, it is O(N). If the needle is long enough, it is a different problem.
Thilo
It's not that bad for a large needle. Since, in that case, your haystack is reduced accordingly. If your haystack is 1000 chars and your needle in 999 chars, you only have to check positions 0 and 1 within the haystack, even though for both of those, you need to check up to 998 characters. The worst case is when your needle is exactly half the size of your haystack (the largest area of a quadrilateral with given circumference is a square - 500*500 > 501*499 > 502*498 > ... > 1000*1).
paxdiablo
+1  A: 

This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:

bool hasSubstring(const char *str, const char *find) {
    if (str[0] == '\0' && find[0] == '\0')
        return true;

    for(int i = 0; str[i] != '\0'; i++) {
        bool foundNonMatch = false;
        for(int j = 0; find[j] != '\0'; j++) {
            if (str[i + j] != find[j]) {
                foundNonMatch = true;
                break;
            }
        }
        if (!foundNonMatch)
            return true;
    }
    return false;
}
Brian McKenna
would be good to return the start index of the match. In the traditional haystack scenario, you already know that the needle is in there. somewhere. :-)
Thilo
That is a good link! http://courses.csail.mit.edu/iap/interview/materials.php
Courtney85
+10  A: 

I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).

Ether
+1; Very cool; I wasn't familiar with this one. For multiple needles, you might also consider the Aho-Corasick algorithm.
James McNellis
+1 I learnt about Boyer-Moore's many years ago in the "Computer Language" magazine ... a gem (both the algorithm and the magazine)
belisarius
+2  A: 

A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.

It returns either the position of the match or -1 if not found.

foreach hpos in range 0 thru len(haystack)-len(needle):
    found = true
    foreach npos in range 0 thru len(needle):
        if haystack[hpos+npos] != needle[npos]:
            found = false
            break
    if found:
        return hpos
return -1

It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.

It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).

The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.

One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.

That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.

paxdiablo
One of those cases where the ability to `continue` an outer loop from an inner loop would be awesome.
Mike DeSimone
@Mike: "goto considered awesome"? ;-)
Steve Jessop
I think there is break/contine with a label, too. No need to goto extremes.
Thilo
@Steve: I thought of that, but `goto` actually might not cut it. In C++ terms, if `goto` fails to properly clean up things that go out of scope, then it can cause some nasty bugs. So a `setjmp`/`longjmp`-style `goto` would be bad, but a thrown and caught exception would be fine were it not for its overhead. I don't know if `goto` cleans up right in C++ since I never use it (too hard to defend in a code review). It doesn't in C, but that's because it's moot due to C's lack of destructors and hence RAII.
Mike DeSimone
Some languages have a `continue 2` or `continue hpos` construct that would work best.
Mike DeSimone
C++ does clean up the scope on a `goto`, but jumping over an initializer is either undefined or ill-formed, can't remember which. In practice, this means you can jump out of a scope, but not in general into one. Not that there would be anything to clean up in this example. I agree that an improved `continue` is best in this case - it expresses the intent exactly. I only mentioned `goto` because it's a constant source of amusement to me (a) how rarely good programmers even want to use it, and yet (b) how terrified many good programmers are of it, to the extent of refusing to solve problems ;-)
Steve Jessop
+5  A: 

The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.

Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.

For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.

polygenelubricants
"For an interview answer, I think it's better to show breadth as well." I don't know. Are you supposed to know these off the top of your head? In a real-life situation, you would look up the algorithm on the Internet. In an interview, I would expect someone to come up with an implementation like Brian or paxdiablo show, and being able to explain when those would perform badly.
Thilo
Keep in mind my background is in academia; we do a lot of reading of survey papers. I'd guess that people would expect candidates to know *ABOUT* these off the top of their head, at least to the level of detail shown in my post (i.e. not a lot). You're right, the details of the algorithms can be looked up on the net, but they need to know *ABOUT* them, so they know what to look for and that it exists. Also, I suppose it depends on what type of position you're interviewing for. Sometimes they just need things done. Other times they need people who know what's really going on out there.
polygenelubricants
Other example: candidate is asked to solve TSP, without giving away the problem name. One candidate quickly comes up with an `O(N!)` brute force algorithm, correctly pointing out that it's slow etc. Another immediately recognizes it's TSP, what that means in terms of NP-completeness, what approaches have been attempted, existence of inexact but good approximation algorithms (even without knowing their details), etc. I'd think that people would agree that the second candidate is better than the first, at least for certain positions.
polygenelubricants
Indeed breadth is important. You can't look-up things you have never heard of. The main difficulty for look-up is generally to know the name of the problem in the literature, because there are many ways to describe it ;)
Matthieu M.
A: 

Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.

I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.

Newtopian
A: 

the easiest way I think is "haystack".Contains("needle") ;) just fun,dont take it seriously.I think already you have got your answer.

Raihan Alam