views:

64

answers:

2

is parallel system or distributed system is good for web site crawler and web indexer which is develop in JAVA,if so which are the available frameworks?

+6  A: 

One of the best crawler/indexer combos you'll ever find for Java is Nutch, which is an Apache project now (see Wiki) and thus open source.

Features:

  1. Fetching, parsing and indexation in parallel and/ou distributed
  2. Plugins: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
  3. Ontology
  4. Clustering
  5. MapReduce
  6. Distributed filesystem (via Hadoop)
  7. Link-graph database
  8. NTLM authentication (Windows/Exchange/etc)
Nikolaos
+1  A: 

Nutch is unbeatable. Another more simple lib which I used successfully in projects is https://crawler.dev.java.net/. You find examples on https://crawler.dev.java.net/samples.html.

Christian Ullenboom