views:

439

answers:

3

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.

My index contains documents with a Location field containing values like

  • Ohio
  • Dayton, Ohio
  • Dublin, Ohio
  • Columbus, Ohio

I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".

I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....

BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);

In short, I'd like to get this working like a LIKE clause similar to

SELECT * from Location where Name LIKE '%ohi%'

Can I do this with Lucene?

A: 

it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).

lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.

people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).

get down in to deep into understand lucene. it's good stuff. :-)

Zac Bowling
A: 

Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.

If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.

Shashikant Kore
Thanks for the response. I'm not too worried about the slow queries yet as I'd like to see it work first before I decide if it's too slow or not. My location list should stay steady at around 4,000 documents so I'm not too worried about it getting any bigger.When you say, "Yes, this can be done." could you elaborate a little more? I thought that the code I displayed above should be doing what I'm expecting, but it's not. Any ideas on what I'm doing wrong?
thinkzig
+2  A: 

Try this:

parser.Parse(query.Keywords.ToLower() + "*")

:)

j3fft
That did the trick! You had just what I needed./GBT: werd!!!
thinkzig