is parallel system or distributed system is good for web site crawler and web indexer which is develop in JAVA,if so which are the available frameworks?
+6
A:
One of the best crawler/indexer combos you'll ever find for Java is Nutch, which is an Apache project now (see Wiki) and thus open source.
Features:
- Fetching, parsing and indexation in parallel and/ou distributed
- Plugins: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
- Ontology
- Clustering
- MapReduce
- Distributed filesystem (via Hadoop)
- Link-graph database
- NTLM authentication (Windows/Exchange/etc)
Nikolaos
2010-08-01 10:13:38
+1
A:
Nutch is unbeatable. Another more simple lib which I used successfully in projects is https://crawler.dev.java.net/. You find examples on https://crawler.dev.java.net/samples.html.
Christian Ullenboom
2010-08-01 10:37:40