tags:

views:

335

answers:

6

I n my application i have upt o millions of short strings (mostly shorter than 32 characters). I want to implement a search box with a attached list that contains only elements that contain the whole string entered in the search box. How can i prebuild a index to find such strings fast? All sorted STL containers check the whole string.

For a entered search string "str" i need to find all strings containg "str": "main street", "struve", "ustr" etc.

+3  A: 

You might start by looking at trie's. Although they are mostly used as prefix trees, the data structure itself may be adapted for faster general search.

Kornel Kisielewicz
+3  A: 

You can build a Permuterm indexes.

For "struve" you would insert into a Radix tree (or a general purpose search tree):

struve$
truve$s
ruve$st
uve$str
ve$stru
e$struv
$struve

To search the infix you would search from the root node for matching prefix strings.

Thomas Jung
Thats an interesting solution, but it would blow up used data by some multitudes.
RED SOFT ADAIR
I don't think you'll get a better answer than this, If I'm understanding the problem correctly..
Matt Joiner
You can save significant amounts of memory if you concatenate all your searchable text to a single string, and only insert pairs of pointers into your tree (one for each suffix). If n=text length, and a char* is 4 bytes, then it's 8n+n+1 vs. (n+1)^2 above.The pseudo-code is something like: content.append("struve$"); tree_insert([[0,7],[1,7],[2,7],[3,7],[4,7],[5,7],[6,7]], record_id);
Mark Renouf
A: 

I would use SQLite. Maybe using a database in memory if anyway you load everything in RAM and need extream performances.

Tristram Gräbener
Will SQLite (or other SQL DBMS) really help with infix search? Does it have special kind of index for that?
WildWezyr
yes, see http://www.sqlite.org/cvstrac/wiki?p=FtsOne
Tristram Gräbener
@Tristram Gräbener: FTS is not infix search! Even in Lucene infix searches are disabled by default because they are too expensive to perform. So having FTS in SQLite will not help for infix search performance.
WildWezyr
I'm a bit confused. What I understand from infix search, would be to find every row that matches /.*my_string.*/In a general DBMS, searching "WHERE string = '%my_string%'" is very uneffective.However, if you have a full text search index, then it's extreamly fast. On benchmarks on MySQL (that also uses this match against syntax), searching in millions of strings takes a few ms. I didn't benchmark SQLite, but given it's usual performances, I'm quite confident.Obviously, you need to add the FTS index on your column!
Tristram Gräbener
+2  A: 

If the strings are of arbitrary length, and of arbitrary count, you could try the Aho-Corasick algorithm, which is simple to implement and scales at O(n) of the length of the search text, and performs searches on all strings simultaneously.

Alternatively, if the number of strings you're looking for are small, try the Horspool algorithms, which is exceptionally easy to implement and less than O(n) on average per string.

Matt Joiner
Thanks for your answer, but Aho-Corasick does not what i want: It looks for complete occurance of dictionary strings in a searchtext. I need to find all entries in the dictionary in which (somewhere) the complete searched text is found.
RED SOFT ADAIR
I don't quite follow. You want to find all occurrences of a string in a search text?
Matt Joiner
+1  A: 

You say you have millions of short string so I assume you can't store it RAM and keep it in a database. Let's assume you keep your "short strings" in table named my_string (id, string). Create another table, let's name it my_substring (id, substring[unique]), containing every substring of every string in my_string. Also create a joining table for two tables above: my_substring_to_string (id, substring_id, string_id), its contents is obvious I suppose.

Now searching is straightforward and fast: search for your substring in my_substring (remember to create an index on my_substring.substring) and join it with my_string via my_substring_to_string.

Adding and removing of a new short string will require an update in my_substring and my_substring_to_string but these are quite straightforward.

If this solution will produce my_substring table with unacceptably large size, it can be optimized. Instead of keeping every substring try to keep every suffix and search for 'substring%' with ilike. For example if the word is 'blues', you have to store suffixes: 'blues', 'lues', 'ues', 'es', 's' (joined with 'blues'). Then search for 'lu' (ilike 'lu%') will match 'lues'. This way a database will still be ale to use an index created on my_substring.substring column, so searching still will be fast.

witzar
I was asking for a algorithm in C++. Your answer is SQL.
RED SOFT ADAIR
RAM is cheap. Why not keep it in RAM? 1 million strings of avg. length 32 chars is only 32MB or RAM. Using an efficient indexing scheme on that size dataset may only consume on the order of a Gigabyte.
Mark Renouf
A: 

I'd probably start with an inverted index -- i.e. a list of letters, and attached to each a list of the words containing that letter. If you're only working with letters (especially if you restrict it to English, or at least Western European languages), you can also pretty easily create inverted indices for digraphs (i.e. each pair of letters), trigraphs, and so on -- though much beyond trigraphs may not gain a lot, since by then you've usually reduced the list to the point that you can do normal string searches within the lists pretty easily.

Note that I don't intend "list" to mean "linked list", but just "some sort of sequential data structure", which usually means a vector...

Jerry Coffin