Identifying strings in documents, with nutch+solr?

tags:

nutch
solr

views:

answers:

Identifying strings in documents, with nutch+solr?

Hi, I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.

I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?

Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?

Thanks.

+1 A:

Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.

To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.

Pascal Dimassimo 2010-08-18 13:42:52

I gave this a try yesterday and it seems like a more flexible approach than doing it in Nutch, although I find this part of Solr a little under-documented I now got something spinning. Thank you Pascal!

grm 2010-08-25 05:43:00

+1 A:

You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry). See Behemoth for an example of how to embed GATE in mapreduce

Julien Nioche 2010-08-27 11:06:45

Very interesting project...

Pascal Dimassimo 2010-09-27 17:21:15

ansaurus

tags:

views:

answers:

Identifying strings in documents, with nutch+solr?

related questions