views:

70

answers:

3

I'm supposed to write a web crawler in Java. The crawling part is easy, but the indexing part is difficult. I need to be able to query the indexer and have it return matches (multiple word queries). What would be the best data structure for doing such a thing?

A: 

If you're buliding this from scratch you should look at the inverted index data structure. If you can use one off the shelf then look at the Nutch project.

teabot
+1  A: 

Use an indexing tool such as Lucene, Solr or Compass.

skaffman
+1  A: 

The solution to the index & search step is to use an inverted index data structure, and the best available open source package that implements this for indexing & search is Lucence.

There are also open source projects that provide a composite solution to the crawling, indexing & searching steps which may be of interest, e.g. nutch

This free online book on information retrieval may help you (see chapter on constructing an inverted index).

Joel