views:

41

answers:

2

I have a DB having text file attributes and text file primary key IDs and indexed around 1 million text files along with their IDs (primary keys in DB).

Now, I am searching at two levels. First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)

Then i make a Boolean query for instance as following

+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )

and search it in my Index file.

The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....

Is there any optimization solution for this problem ?

+1  A: 

Assuming you can reuse the dbid part of your queries:

  1. Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
  2. Make both parts into queries
  3. Convert the pkid query to a filter (by using QueryWrapperFilter)
  4. Convert the filter into a cached filter (using CachingWrapperFilter)
  5. Hang onto the filter, perhaps via some kind of dictionary
  6. Next time you do a search, use the overload that allows you to use a query and filter

As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).

HTH


p.s.

I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!

Moleski
sorry for late reply but you are quite right. Now i have moved all the DB records to my Lucene file (and made a big flat table just like DB) and I don't have to use millions of IDs as input.
Umer
+1  A: 

The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.

In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

buru
thanks, that was quite definite answer but unfortunately i can't go for DB after getting result from Lucene.
Umer