Hi,
Given a string s and an array of smaller strings, T, design a method to search s for each small string in T.
Thanks.
Hi,
Given a string s and an array of smaller strings, T, design a method to search s for each small string in T.
Thanks.
This sounds like a simple for loop:
for(string t : T)
{
if (t.equals(s)) {
/* do stuff with t */
}
}
You can't pick a "best" algorithm without knowing more details about the data set.
Without this information, the "best" solution is the simplest one.
static IEnumerable<string> FindIn(this IEnumerable<string> T, string s) {
return T.Where(t => s.Contains(t));
}
Assuming you have a significant number of smaller strings, Rabin-Karp is the standard way to search for multiple small strings in a very very large string. if You only have a few smaller strings, simply repeating Boyer-Moore for each one might be a better alternative.
Could you please clarify a bit?
**The algorithm would STRONGLY depend on what you mean by "Search for". **
Do you wish to find whether every string in T is a proper sub-string of S? Or Any string?
Do you need Yes/No answer or indexes?
Do you care if the answers overlap (e.g. "ABCDE" contains both "ABC" and "CDE" but ONLY if you don't care about overlapping).
A simplistic method (assuming search strings all start out fairly differently) is to:
Have a map of "first character" => map_of_first_2_characters__to__list_of_strings.
Loop over each position in S, find the character as a key in above map.
The value will be another map, mapping 2-character strings to a list of substrings starting with these 2 characters.
Find the character and its right neighbour in the submap the value will be a list of strings starting with these 2 values.
Assuming fairly equal distribution of starting characters in T and T not being too big (if it's too big, merely build the data structure one level more, by mapping 3 characters) - we just found a very short list of plausible matches starting with the current position. String-compare them all. Mark off the ones (if any) that are substrings of S starting with current position. If the goal is not to find ALL matches of ALL strings, eliminate the ones you found as matches from the data structure.
You may wish to read this for advanced stuff
Let's make it into a Java solution
boolean isSubset(String[] t, String s) {
for (String sample: t)
if (!sample.equals(s))
return false;
return true;
}
You can make it faster using Falaina's recommendations, but do you really need it?
If you have room for a table of pointers (Pointer Size * NumCharsInSource), you can sort each string in the source (the string starting at character) using something like QSort. You can then BSearch the smaller strings into the pointer table. Assuming N characters and M substrings, the sort will have O(N lg N) performance and the lookups will have O(M lg N) performance. The overall performance should be O((N+M) lg N).
However, there can be degenerate cases in which the strings in the source are highly repetitive (i.e. 100,000 a's followed by a b). That will make the compare for the sort portion very slow :-( to get around this, you can special case long runs of characters but that gets much more complicated.
The algorithm to choose really depends on your source data and how much spare memory you have to work with.
The fastest way that I know of that solves this problem is the Aho-Corasick algorithm. For large strings and large numbers of patterns to be searched, it is faster than applying a linear time search (e.g. KMP, Rabin-Karp, Boyer-Moore) for each pattern.
But are you sure you need something like this and that your strings are too long for the straightforward method of string matching?