Search engines and databases allow you to use consecutive string search (such as "this is a test"
), which matches this is a test that will match
, but won't match test this is a
.
I know that some databases have built-in features that allow you to use the same functionality without writing a single line of code (e.g. MySQL's full text search). That's not the kind of answer I am looking for.
What I want to know is what kind of algorithm and database structures are used to enable this fast searching of strings.
What would the indexed table look like given the above example? Would it be something similar to this?
// IndexedItemID | Position | Word
1 | 0 | this
1 | 1 | is
1 | 2 | a
1 | 3 | test
1 | 4 | that
1 | 5 | will
1 | 6 | match
2 | 0 | test
2 | 1 | this
2 | 2 | is
2 | 3 | a
Now that there are indexed items, how do you efficiently create a SQL statement that matches those items?
Here's one example I can think of:
select IndexedItemID form
(select IndexedItemID, Position from indexedWords where Word = "this") as word1Position
where
exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "is" AND Position = word1Position.Position + 1)
AND exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "a" AND Position = word1Position.Position + 2)
AND exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "test" AND Position = word1Position.Position + 3)
I'm sure there is probably a more standardized way that's more efficient.