views:

204

answers:

1

I'm doing work on information extraction, and I need a tool to crawl data from web page , is there a popular one in windows?

+8  A: 

From: http://en.wikipedia.org/wiki/Web_crawler:

  • Aspseek is a crawler, indexer and a search engine written in C and licenced under the GPL
  • arachnode.net is a .NET web crawler written in C# using SQL 2008 and Lucene.
  • DataparkSearch is a crawler and search engine released under the GNU General Public License.
  • GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
  • GRUB is an open source distributed search crawler that Wikia Search ( http://wikiasearch.com ) uses to crawl the web.
  • Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
  • ht://Dig includes a Web crawler in its indexing engine.
  • HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
  • ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer's free CPU resources only.
  • mnoGoSearch is a crawler, indexer and a search engine written in C and licenced under the GPL
  • Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
  • Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
  • YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).

And some reading: Spidering Hacks 100 Industrial-Strength Tips & Tools:

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.

The MYYN