ansaurus

Question

best library to do web-scraping

Answer 1

A:

What language do you want to use?

curl with awk might be all you need.

Silas 2008-09-15 21:20:09

Answer 2

+1 A:

Check out this question for all your answers.

Dillie-O 2008-09-15 21:21:56

Answer 3

A:

You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.

Jim 2008-09-15 21:22:30

Answer 4

+1 A:

The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.

Paul Dixon 2008-09-15 21:22:49

Answer 5

A:

I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.

klingon_programmer 2008-09-15 21:22:58

Answer 6

+1 A:

I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html

WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.

LWP would give you a basic crawler for you to build on.

Hugh Buchanan 2008-09-15 21:24:55

Answer 7

+1 A:

I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:

require 'rubygems'
require 'hpricot'
require 'open-uri'

sites = %w(http://www.google.com http://www.stackoverflow.com)

sites.each do |site|
  doc = Hpricot(open(site))

  # iterate over each div in the document (or use xpath to grab whatever you want)
  (doc/"div").each do |div|
    # do something with divs here
  end
end

For more on Hpricot see http://code.whytheluckystiff.net/hpricot/

Drew Olson 2008-09-15 21:28:17

Answer 8

+5 A:

The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");

You can find it here. http://www.codeplex.com/htmlagilitypack

Mike 2008-09-15 21:30:23

Answer 9

A:

Is there something similar to these libraries in .NET?

gyurisc 2008-09-15 21:30:29

You can use Tidy with .NET, just look on the Tidy homepage I linked to for the link.

Jim 2008-09-15 21:42:59

Answer 10

+2 A:

I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).

For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).

8jean 2008-09-15 21:32:38

Answer 11

+5 A:

If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).

An extremely capable library, makes scraping a breeze.

2008-09-15 21:41:48

Answer 12

+1 A:

There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.

Daniel Spiewak 2008-09-15 21:43:26

Answer 13

+1 A:

I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.

hamstar 2009-02-26 10:09:13

many websites wont allow curl. It gives a permission denied error.

zengr 2010-10-24 20:17:00

Answer 14

A:

what someone said.

use ANY LANGUAGE.

as long as you have a good parser library and http library, you are set.

the tree stuff are slower, then just using a good parse library.

2009-03-05 23:56:03

ansaurus

tags:

views:

answers:

best library to do web-scraping

related questions