tags:

views:

936

answers:

1

I am setting up a Solr Search Engine that will index multiple languages. I created a custom UpdateProcessorFactory to figure out which sections of the input text are which language, and then I copy those sections of the document into language specific fields. For example, with this text:

"Hello World, Bonjour le Monde, Hallo Welt."

It copies "Hello World" into the en-text field, "Bonjour le Monde" into the fr-text field, and "Hallo Welt" into the de-text field. Each field has the appropriate language analyzers to tokenize and stem the words.

In the end I would like to have one box for a user to enter search terms that would search across all languages. The search terms don't need to be translated, but they should be stemmed appropriately. What is the best way to accomplish this? I'm also very concerned about the performance of the searches.

+2  A: 

The best way is to use the DisMaxRequestHandler. It will appropriately analyze each field for the appropriate language (as defined in schema.xml).

So, if your query looks like /solr/select?qt=dismax&qf=en-text%20fr-text%20de-text&q=hello%world Solr will do the right thing.

(assuming you configured dismax as a solr.DisMaxRequestHandler in a requestHandler block in solrconfig.xml)

Most analysis is fast. Your performance bounds are mostly on your index size, total term counts, etc. Be sure to tune everything according to the solr perfomance guide on their wiki. I'm currently running a 60GB index and continue to get searches in the sub 100ms range on hardware that isn't all that fancy.

Trey