ansaurus

Question

Answer 1

+3 A:

You might start by looking at trie's. Although they are mostly used as prefix trees, the data structure itself may be adapted for faster general search.

Kornel Kisielewicz 2010-01-18 11:10:19

Answer 2

+3 A:

You can build a Permuterm indexes.

For "struve" you would insert into a Radix tree (or a general purpose search tree):

struve$
truve$s
ruve$st
uve$str
ve$stru
e$struv
$struve

To search the infix you would search from the root node for matching prefix strings.

Thomas Jung 2010-01-18 11:10:49

Thats an interesting solution, but it would blow up used data by some multitudes.

RED SOFT ADAIR 2010-01-18 12:05:13

I don't think you'll get a better answer than this, If I'm understanding the problem correctly..

Matt Joiner 2010-01-20 23:18:16

You can save significant amounts of memory if you concatenate all your searchable text to a single string, and only insert pairs of pointers into your tree (one for each suffix). If n=text length, and a char* is 4 bytes, then it's 8n+n+1 vs. (n+1)^2 above.The pseudo-code is something like: content.append("struve$"); tree_insert([[0,7],[1,7],[2,7],[3,7],[4,7],[5,7],[6,7]], record_id);

Mark Renouf 2010-03-19 01:07:17

Answer 3

A:

I would use SQLite. Maybe using a database in memory if anyway you load everything in RAM and need extream performances.

Tristram Gräbener 2010-01-18 11:11:14

Will SQLite (or other SQL DBMS) really help with infix search? Does it have special kind of index for that?

WildWezyr 2010-01-26 14:18:22

yes, see http://www.sqlite.org/cvstrac/wiki?p=FtsOne

Tristram Gräbener 2010-01-26 15:53:31

@Tristram Gräbener: FTS is not infix search! Even in Lucene infix searches are disabled by default because they are too expensive to perform. So having FTS in SQLite will not help for infix search performance.

WildWezyr 2010-01-28 11:15:55

I'm a bit confused. What I understand from infix search, would be to find every row that matches /.*my_string.*/In a general DBMS, searching "WHERE string = '%my_string%'" is very uneffective.However, if you have a full text search index, then it's extreamly fast. On benchmarks on MySQL (that also uses this match against syntax), searching in millions of strings takes a few ms. I didn't benchmark SQLite, but given it's usual performances, I'm quite confident.Obviously, you need to add the FTS index on your column!

Tristram Gräbener 2010-01-28 17:16:13

Answer 4

+2 A:

If the strings are of arbitrary length, and of arbitrary count, you could try the Aho-Corasick algorithm, which is simple to implement and scales at O(n) of the length of the search text, and performs searches on all strings simultaneously.

Alternatively, if the number of strings you're looking for are small, try the Horspool algorithms, which is exceptionally easy to implement and less than O(n) on average per string.

Matt Joiner 2010-01-18 11:25:40

Thanks for your answer, but Aho-Corasick does not what i want: It looks for complete occurance of dictionary strings in a searchtext. I need to find all entries in the dictionary in which (somewhere) the complete searched text is found.

RED SOFT ADAIR 2010-01-18 13:40:01

I don't quite follow. You want to find all occurrences of a string in a search text?

Matt Joiner 2010-01-19 10:31:07

Answer 5

+1 A:

You say you have millions of short string so I assume you can't store it RAM and keep it in a database. Let's assume you keep your "short strings" in table named my_string (id, string). Create another table, let's name it my_substring (id, substring[unique]), containing every substring of every string in my_string. Also create a joining table for two tables above: my_substring_to_string (id, substring_id, string_id), its contents is obvious I suppose.

Now searching is straightforward and fast: search for your substring in my_substring (remember to create an index on my_substring.substring) and join it with my_string via my_substring_to_string.

Adding and removing of a new short string will require an update in my_substring and my_substring_to_string but these are quite straightforward.

If this solution will produce my_substring table with unacceptably large size, it can be optimized. Instead of keeping every substring try to keep every suffix and search for 'substring%' with ilike. For example if the word is 'blues', you have to store suffixes: 'blues', 'lues', 'ues', 'es', 's' (joined with 'blues'). Then search for 'lu' (ilike 'lu%') will match 'lues'. This way a database will still be ale to use an index created on my_substring.substring column, so searching still will be fast.

witzar 2010-01-18 11:30:38

I was asking for a algorithm in C++. Your answer is SQL.

RED SOFT ADAIR 2010-01-18 12:06:35

RAM is cheap. Why not keep it in RAM? 1 million strings of avg. length 32 chars is only 32MB or RAM. Using an efficient indexing scheme on that size dataset may only consume on the order of a Gigabyte.

Mark Renouf 2010-03-19 01:12:00

Answer 6

A:

I'd probably start with an inverted index -- i.e. a list of letters, and attached to each a list of the words containing that letter. If you're only working with letters (especially if you restrict it to English, or at least Western European languages), you can also pretty easily create inverted indices for digraphs (i.e. each pair of letters), trigraphs, and so on -- though much beyond trigraphs may not gain a lot, since by then you've usually reduced the list to the point that you can do normal string searches within the lists pretty easily.

Note that I don't intend "list" to mean "linked list", but just "some sort of sequential data structure", which usually means a vector...

Jerry Coffin 2010-01-18 18:15:13

ansaurus

tags:

views:

answers:

fast index for "contains string"

related questions