I'm supposed to write a web crawler in Java. The crawling part is easy, but the indexing part is difficult. I need to be able to query the indexer and have it return matches (multiple word queries). What would be the best data structure for doing such a thing?
A:
If you're buliding this from scratch you should look at the inverted index data structure. If you can use one off the shelf then look at the Nutch project.
teabot
2009-12-02 14:10:57
+1
A:
The solution to the index & search step is to use an inverted index data structure, and the best available open source package that implements this for indexing & search is Lucence.
There are also open source projects that provide a composite solution to the crawling, indexing & searching steps which may be of interest, e.g. nutch
This free online book on information retrieval may help you (see chapter on constructing an inverted index).
Joel
2009-12-02 14:14:52