I've created a custom Tokenizer in Solr that looks for named entities. I would like to be able to use this information to populate separate fields within the lucene/solr document.
As an example, I want to populate a multivalued field called, "locations" with all the location names that were extracted from the text. To extract locations the text is first tokenized to separate the words and to determine which tokens are locations. After this process, I would like to emit the tokens for the tokenizer, but also populate the field "locations" with all the location names that were extracted from the text.
From the research that I've done, there is no way to access the SolrDocument object from the Tokenizer or the TokenizerFactory so there is no way to populate fields from here.
The solution that I've come up with so far is to create a custom UpdateRequestProcessorFactory that processes the text and extracts the fields, and then the Tokenizer processes the text AGAIN to get the tokens. I would like to find a way to be able to do this work and only process the text once.