views:

37

answers:

2

Well I was developing an application usin Symfony 1.4 and Doctrine when I realized a major drawback on my Zend Lucene implementation.

I have a model called Publication that is related (via foreign key relations) with a few other models (subjects, genres, languages, authors, etc.) and I'm getting they're names when adding a new document to the index (using the Jobeet tutorial way) so that I can search for publications with a given subject, genre, language, author, etc... The problem is if for some reason I decide to alter the name of one of those related models the Zend Lucene index will not get updated.

The only two solutions I could come up with were:

  1. Re-index all the publications regularly to guarantee that any changes made to the related models gets updated on the index (however this solution doesn't allow the index to be updated in real time)

  2. Get all the publications that are related with a given model and re-index them after it gets updated (using the save(), postSave(), postUpdate() or whatever you can come up with on Doctrine). --> This solution seemed great... It will only rebuild the index for the publications that are linked to the updated model right? Well, if you have something like a thousand (1000) publications linked to it will take a few minutes to update (yeah I tested it) and on a user form it will timeout because it takes over 30 seconds (and even if it don't it would be bad to have a user looking at the screen for a few minutes awaiting for the page to finish to load).

So what I want to know is if there's another solution? Is there a way to update an index on the fly based on a change on a related model without hanging the whole pahe? Maybe putting the task to run on the background or something? Is there such a way?

If there's no way to do this with Lucene is there any way to use Full-Text Search with MySQL (with InnoDB tables) without using Zend Lucene that doesn't have such a drawback? If there's such a tool I'd glady refactor my code to accommodate a different library.

Could you please help me with this? Thanks in advance!

A: 

A Lucene Document cannot be updated. You can only delete hits, and re-add them back in. For that reason, my original solution is not valid.

I was looking at alternatives for you, and there is one that caught my eye: http://www.sphinxsearch.com/

It seems that Sphinx is very fast at indexing, but slower to perform searches. Might be something worth taking a look at.

From what I have been reading, the PHP implementation of Lucene is not very fast, and this is normal as a behavior. There are ways to improve the speed of indexing large quantities of data, which mainly involve increasing RAM in order to let Lucene write larger doc sizes into memory before dumping out the files.

Jon
And how can I do that. As far as I know I can't only update one term on a "document" indexed on Zend Lucene. If such thing is possible I'd gladly try it. Any idea how to do it?
petersaints
You are right. I was just thinking about it, but did not look up if it was possible. A Lucene Document cannot be updated.
Jon
Well I ended up with the solution that I wrote below. It's not perfect but I find it "good enough". However it would be really cool if Lucene could improve a little bit in terms of updating only a single term of document and maybe even provide some kind of Real Time indexing.
petersaints
As referenced above, sphinx is much faster to search than zend lucene. Orders of magnitude faster to both index and search.
benlumley
A: 

Well... I'm answering my self. After thinking about it for some time I ended with a compromisse solution.

On my model I already have a one-to-one relationship with a table that is used only for storing meta-information about a publication so I ended up inserting a new column called reindex (that is a "boolean"). This way everytime I update an entity related with a publication (something that in production will happen very seldom but I want to be prepared for it) it will mark every publication that is related to it as needing reindexing. Than I have a task that can be run on cron job or Task Scheduler that will only reindex the publications that are marked as needing it. This way I can set this task to run a few times a week at late hours to keep the Index consistent.

It's not a perfect solution but is the best I can came up with using only PHP and Zend Lucene.

petersaints