views:

1748

answers:

14

I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?

A: 

What language do you want to use?

curl with awk might be all you need.

Silas
+1  A: 

Check out this question for all your answers.

Dillie-O
A: 

You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.

Jim
+1  A: 

The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.

Paul Dixon
A: 

I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.

klingon_programmer
+1  A: 

I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html

WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.

LWP would give you a basic crawler for you to build on.

Hugh Buchanan
+1  A: 

I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:

require 'rubygems'
require 'hpricot'
require 'open-uri'

sites = %w(http://www.google.com http://www.stackoverflow.com)

sites.each do |site|
  doc = Hpricot(open(site))

  # iterate over each div in the document (or use xpath to grab whatever you want)
  (doc/"div").each do |div|
    # do something with divs here
  end
end

For more on Hpricot see http://code.whytheluckystiff.net/hpricot/

Drew Olson
+5  A: 

The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");

You can find it here. http://www.codeplex.com/htmlagilitypack

Mike
A: 

Is there something similar to these libraries in .NET?

gyurisc
You can use Tidy with .NET, just look on the Tidy homepage I linked to for the link.
Jim
+2  A: 

I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).

For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).

8jean
+5  A: 

If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).

An extremely capable library, makes scraping a breeze.

+1  A: 

There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.

Daniel Spiewak
+1  A: 

I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.

hamstar
many websites wont allow curl. It gives a permission denied error.
zengr
A: 

what someone said.

use ANY LANGUAGE.

as long as you have a good parser library and http library, you are set.

the tree stuff are slower, then just using a good parse library.