tags:

views:

399

answers:

2

I've created a custom Tokenizer in Solr that looks for named entities. I would like to be able to use this information to populate separate fields within the lucene/solr document.

As an example, I want to populate a multivalued field called, "locations" with all the location names that were extracted from the text. To extract locations the text is first tokenized to separate the words and to determine which tokens are locations. After this process, I would like to emit the tokens for the tokenizer, but also populate the field "locations" with all the location names that were extracted from the text.

From the research that I've done, there is no way to access the SolrDocument object from the Tokenizer or the TokenizerFactory so there is no way to populate fields from here.

The solution that I've come up with so far is to create a custom UpdateRequestProcessorFactory that processes the text and extracts the fields, and then the Tokenizer processes the text AGAIN to get the tokens. I would like to find a way to be able to do this work and only process the text once.

A: 

Here's an idea I think would work in lucene, but I have no idea if it's possible in solr. You could tokenize the string outside the typical tokenstream chain as you suggest then manually add the tokens to the document using the NOT_ANALYZED option. You have to add each token separately with document.add(...) which lucene will treat as a single field for searching.

jshen
+1  A: 

The way I do it is less elegant that what it looks like you are shooting for:

I preprocess the documents using a named entity recognizer and save all of the entities in a separate file. Then, when I am publishing to Solr, I just read the entities from this file and populate the entity fields (different for people, locations, and organizations). This could be simplified, but since I had already done the parsing for other work, it was easier to just reuse what already existed.

David