views:

908

answers:

3
+1  Q: 

solr + Heritrix

how is it possible to integrate solr with heritrix ? I want to archive a site using heritrix and then index and search locally this file using solr.

thanks

A: 

According to this message, yes:

It is pretty easy to add custom writers to Heritrix. We write our crawls to MySQL and then ingest into Solr from there. It would not be hard to write a Heritrix writer that writes directly to Solr however.

-- Sean Timm

Or you might want to use Nutch instead, there is more work done towards integrating it with Solr:

Mauricio Scheffer
+2  A: 

There is a section in the Solr 1.4 Enterprise Search book about using Heritrix and Solr together. Basically use Heritrix to crawl, and then in a seperate process parse the archive files and add them Solr. While you loose out on things like page rank scores that Nutch provides, it does simplify things because your crawler and your search engine are separate tools.

This is basically the approach that Mauricio uses, storing data into MySQL as an intermediate step. We published all the source for the book on an Amazon EC2 AMI, look for "solrbook". Also, the support site at Packt (http://www.packtpub.com/solr-1-4-enterprise-search-server) will let you download the sample.

Eric Pugh
+2  A: 

The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank').

Using Nutch will give you a much better index however as it does use pagerank.

NutchWAX

If however you are deadset on using Heritrix and would like pagerank based search results you could use NutchWAX (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing).

NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index).

Solr

If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr.

The Solr end is just an XML file posted via HTTP and is dead simple.

The Heritrix end is little bit more complicated, but the Developer's Manual will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.).

Kris