views:

412

answers:

6

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?

A: 

Not really anything that's viable, no, unless you have additional a priori knowledge of your data and/or search term - for instance, if you're only searching for matches at the beginning of your strings, then you could sort the strings and only look at the ones within the bounds of your search term (or even store them in a binary tree and only look at the branches that could possibly match). Likewise, if your potential search terms are limited, you could run all the possible searchs against a string when it's initially input and then just store a table of which terms match and which don't.

Aside from that kind of thing, just iterating through is basically it.

Amber
A: 

That depends on if the substring is at the beginning of the string or can be anywhere in the string.

If it's anywhere then you pretty much need to iterate over the entire list unless your list is so large and the query happens sufficiently often that it's worth building a more sophisticated indexing solution.

If the substring is at the beginning of the string then it's easy. Sort the list, find the start/end by biseciton search and take that subset.

cletus
+2  A: 

Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.

Guffa
Of course, it depends on the data as to how effective this will actually be at optimizing things.
Amber
What is this algorithm called? I implemented this recently and it was fairly straight-forward and resulted in a significant speed improvement for my specific use case.
ChrisInEdmonton
Ah, this seems to be an inverted index of all bigrams (n-gram, where n = 2).
ChrisInEdmonton
+6  A: 

Wikipedia article lists a few ways to index substrings. You've got:

Mark Rushakoff
+1  A: 

If you can preprocess the collection then you can do a lot of different things.

For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.

Strilanc
+1  A: 

If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.

There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.

ire_and_curses