views:

297

answers:

3

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document. but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.

example: search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)

in some documents.

it is possible that there is a document containing the string "rohit banga"

to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.

i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.

A: 

As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.

Please look lucene API docs for TokenFilter class.

Yaroslav
i am more concerned with the string matching algoritms like KMP, suffix tree etc.
iamrohitbanga
neither Lucene nor it's closest competitor Xapian to the best of my knowledge support KMP or suffix trees. there is one open-source project attempting to do searching using suffix trees but it's rather not stable so stable (it's called libdoodle)
Yaroslav
+1  A: 

The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.) The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP. However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.

Yuval F
+1  A: 

As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).

In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.

The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:

Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).

So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):

Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.

Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.

Robert Muir