views:

44

answers:

1

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.

According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).

Now I have tried two different approaches:

a) Using the Query Parser, with an input like: narrower:+narrowing_text fuzzy:fuzzy_text~0.9

b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery

Neither did work, I'm getting about the same times than the ones when the narrower is not used.

Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.

In case anyone wonders, I'm using pylucene 3.0.2.

+1  A: 

Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter. Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:

  1. Create the query that narrows this down to 5000 hits.
  2. Use it to build a QueryWrapperFilter.
  3. Use the filter in a search involving the fuzzy query.
Yuval F
I thought about that solution too, but if you check the Lucene FAQ link that I gave in the question, it says that using a QueryFilter is not the recommended solution, so I'm trying to find out why the "correct" solution isn't working for me.
Doppelganger
Sounds like you get bad performance for the "correct" solution, so I suggest you try this one as well...
Yuval F