views:

494

answers:

4

I'm building a search function for a php website using Zend Lucene and i'm having a problem. My web site is a Shop Director (something like that).

For example i have a shop named "FooBar" but my visitors seach for "Foo Bar" and get zero results. Also if a shop is named "Foo Bar" and visitor seaches "FooBar" nothing is found.

I tried to seach for " foobar~ " (fuzzy seach) but did not found articles named "Foo Bar"

Is there a speciar way to build the index or to make the query?

A: 

Did you tried "*foo* AND *bar*" or "*foo* OR *bar*"? It works in Ferret and I read it is based on Lucene.

klew
it works IF the queri is FOO BAR and in the database i have FOOBAR but if you are seaching for FOOBAR and in the DB you have FOO BAR, it doesn't work
Daniel
Right, my mistake... I have crazy idea: try to put '*' between every character "f*o*o*b*a*r" and set some string length limit (if str_len > 5). Or you can try to put spaces between down and upper cased letters - then you will seperate "FooBar" to "Foo Bar" - but user needs to put this string in camel case.
klew
+1  A: 

Option 1: Break the input query string in two parts at various points and search them. eg. In this case query would be (+fo +bar) OR (+foo +bar) OR (+foob +ar) The problem is this tokenization assumes there are two tokens in input query string. Also, you may get extra, possibly irrelevant, results such as results of (+foob +ar)

Option 2: Use n-gram tokenization while indexing and querying. While indexing the tokens for "foo bar" would be fo, oo, ba, ar. While searching with foobar, tokens would be fo, oo, ob, ba, ar. Searching with OR as operator will give you the documents with maximum n-gram matches at the top. This can achieved with NGramTokenizer

Shashikant Kore
Op. 2 sounds good, have any idea how to use n-gram tokenization? thanks
Daniel
A: 

If you don't care about performance, use WildcardQuery (performance is significantly worse):

new WildcardQuery( new Term( "propertyName", "Foo?Bar" ) );

For zero or more characters, use '*', for zero or one character, use '?'

If performance is important, try using BooleanQuery.

Cambium
if the user searches for "foobar" and in the database i have "foo bar" there is no way for the script to know where to put that"?" or "*"
Daniel
+1  A: 

Manually add index entries for most common name confusions. Get your customers to type them in on a special form.

Aaron Watters