views:

173

answers:

7

Hi,

Given a string s and an array of smaller strings, T, design a method to search s for each small string in T.

Thanks.

A: 

This sounds like a simple for loop:

for(string t : T)
{
    if (t.equals(s)) {
     /* do stuff with t */
    }
}

From How to use For Each loop

Nick Bedford
== will compare memory addresses; you meant to say .equals(). If she's searching, however, .indexOf() would be more appropriate.
Matt Boehm
@Matt Yeah it depends if she wants to do an enumeration through the matching elements or just wants to find the elements. Question is a little vague I guess. Strings in Java aren't operator overloaded? (Sorry I've only used JavaScript).
Nick Bedford
@Nick: "Strings in Java aren't operator overloaded?" No they are not. Don't assume that anything you learned about Javascript also applies to Java. They are completely different languages.
Stephen C
@Matt my mistake, should have checked the String docs first.
Nick Bedford
A: 

You can't pick a "best" algorithm without knowing more details about the data set.

  • Are these statistically random strings?
  • Is there a lot or a little repetition in the small strings?
  • Do you want to optimize for speed of execution or low memory consumption?
  • Will you perform this search multiple times with the same sub strings (T) or the same main string (s)?

Without this information, the "best" solution is the simplest one.

static IEnumerable<string> FindIn(this IEnumerable<string> T, string s) {
    return T.Where(t => s.Contains(t));
}
Frank Krueger
+6  A: 

Assuming you have a significant number of smaller strings, Rabin-Karp is the standard way to search for multiple small strings in a very very large string. if You only have a few smaller strings, simply repeating Boyer-Moore for each one might be a better alternative.

Falaina
A: 

Could you please clarify a bit?

**The algorithm would STRONGLY depend on what you mean by "Search for". **

  • Do you wish to find whether every string in T is a proper sub-string of S? Or Any string?

  • Do you need Yes/No answer or indexes?

  • Do you care if the answers overlap (e.g. "ABCDE" contains both "ABC" and "CDE" but ONLY if you don't care about overlapping).

A simplistic method (assuming search strings all start out fairly differently) is to:

  • Have a map of "first character" => map_of_first_2_characters__to__list_of_strings.

  • Loop over each position in S, find the character as a key in above map.

    • The value will be another map, mapping 2-character strings to a list of substrings starting with these 2 characters.

    • Find the character and its right neighbour in the submap the value will be a list of strings starting with these 2 values.

    • Assuming fairly equal distribution of starting characters in T and T not being too big (if it's too big, merely build the data structure one level more, by mapping 3 characters) - we just found a very short list of plausible matches starting with the current position. String-compare them all. Mark off the ones (if any) that are substrings of S starting with current position. If the goal is not to find ALL matches of ALL strings, eliminate the ones you found as matches from the data structure.

You may wish to read this for advanced stuff

DVK
I just want to see if every string in T is a subset of S or not. About second question I need index if it matches and I do not take into account overlapping.
Rachel
OK, edited with my own home-brew fairly efficient algorithm (at least, if your language has decent hashmaps - e.g. Perl) as well as a link to something more fancy/scholarly if my rough cut is not as good as you require..
DVK
A: 

Let's make it into a Java solution

boolean isSubset(String[] t, String s) {
    for (String sample: t)
        if (!sample.equals(s))
            return false;
    return true;
}

You can make it faster using Falaina's recommendations, but do you really need it?

tulskiy
A: 

If you have room for a table of pointers (Pointer Size * NumCharsInSource), you can sort each string in the source (the string starting at character) using something like QSort. You can then BSearch the smaller strings into the pointer table. Assuming N characters and M substrings, the sort will have O(N lg N) performance and the lookups will have O(M lg N) performance. The overall performance should be O((N+M) lg N).

However, there can be degenerate cases in which the strings in the source are highly repetitive (i.e. 100,000 a's followed by a b). That will make the compare for the sort portion very slow :-( to get around this, you can special case long runs of characters but that gets much more complicated.

The algorithm to choose really depends on your source data and how much spare memory you have to work with.

Adisak
A: 

The fastest way that I know of that solves this problem is the Aho-Corasick algorithm. For large strings and large numbers of patterns to be searched, it is faster than applying a linear time search (e.g. KMP, Rabin-Karp, Boyer-Moore) for each pattern.

But are you sure you need something like this and that your strings are too long for the straightforward method of string matching?

MAK