ansaurus

Question

Options for HTML scraping?

Answer 1

+3 A:

I would first find out if the site(s) in question provide an API server or RSS Feeds for access the data you require.

GateKiller 2008-08-05 21:11:29

Answer 2

+19 A:

The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.

Joey deVilla 2008-08-05 21:13:32

This link is deprecated since why_the_lucky_stiff's disappearance from the internet.

Oliver N. 2009-11-20 23:28:34

here it is: http://wiki.github.com/hpricot/hpricot/

Sney 2010-03-31 00:50:03

Answer 3

+1 A:

Regular expressions work pretty well for HTML scraping as well ;-) Though after looking at Beautiful Soup, I can see why this would be a valuable tool.

pix0r 2008-08-05 21:29:51

Regular expressions? [The center cannot hold it is too late](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

Andrew Grimm 2010-08-22 08:00:16

Answer 4

+1 A:

You probably have as much already, but I think this is what you are trying to do:

from __future__ import with_statement
import re, os

profile = ""

os.system('wget --no-cookies --header "Cookie: soba=(SeCreTCODe)" http://stackoverflow.com/users/30/myProfile.html')
with open("myProfile.html") as f:
    for line in f:
        profile = profile + line
f.close()
p = re.compile('summarycount">(\d+)</div>') #Rep is found here
print p
m = p.search(profile)
print m
print m.group(1)
os.system("espeak \"Rep is at " + m.group(1) + " points\""
os.remove("myProfile.html")

Grant 2008-08-05 22:58:06

Answer 5

+4 A:

For Perl, there's WWW::Mechanize.

superjoe30 2008-08-05 23:37:44

Answer 6

+3 A:

I use hpricot on ruby. As an example this is a snippet of code that I use to retrieve all book titles from the 6 pages of my HireThings account (as they don't seem to provide a single page with this info)

pagerange = 1..6
proxy = Net::HTTP::Proxy(proxy, port, user, pwd)
proxy.start('www.hirethings.co.nz') do |http|
  pagerange.each do |page|
    resp, data = http.get "/perth_dotnet?page=#{page}" 
    if resp.class == Net::HTTPOK
      (Hpricot(data)/"h3 a").each { |a| puts a.innerText }
    end
  end
end

It's pretty much complete. All that comes before this are library imports and the settings for my proxy.

Wolfbyte 2008-08-06 05:57:24

Answer 7

+15 A:

BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.

Cristian 2008-08-07 18:18:59

Answer 8

+8 A:

I found HTMLSQL to be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.

The queries are super-intuitive - like:

SELECT title from img WHERE $class == 'userpic'

There are now some other alternatives that take the same approach.

deadprogrammer 2008-08-07 18:31:17

FYI, this is a PHP library

Tristan Havelick 2010-04-18 15:19:49

Answer 9

+18 A:

In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

http://www.codeplex.com/htmlagilitypack

Jon Galloway 2008-08-07 18:38:30

combine linq with it and it seems more like HTMLSQL, no?

Bless Yahu 2008-11-22 20:16:24

Answer 10

+2 A:

I have used LWP and TreeBuilder with perl, and found them very useful.

LWP (short for libwww-perl) lets you connect to websites and scrape the html, you can get the module here and the O'Reilly book seems to be online here.

TreeBuilder allows you to construct a tree from the HTML, and documentation and source are here.

There might be two much heavy-lifting still to do with something like this approach though, I have not looked at the Mechanize module suggested by another answer, so I may well do that.

kaybenleroll 2008-08-17 14:13:14

Answer 11

+4 A:

Scraping Stack Overflow is especially easy with Shoes and Hpricot.

require 'hpricot'

Shoes.app :title => "Ask Stack Overflow", :width => 370 do
  SO_URL = "http://stackoverflow.com"
  stack do
    stack do
      caption "What is your question?"
      flow do
        @lookup = edit_line "stackoverflow", :width => "-115px"
        button "Ask", :width => "90px" do
          download SO_URL + "/search?s=" + @lookup.text do |s|
            doc = Hpricot(s.response.body)
            @rez.clear()
            (doc/:a).each do |l|
              href = l["href"]
              if href.to_s =~ /\/questions\/[0-9]+/ then
                @rez.append do
                  para(link(l.inner_text) { visit(SO_URL + href) })
                end
              end
            end
            @rez.show()
          end
        end
      end
    end
    stack :margin => 25 do
      background white, :radius => 20
      @rez = stack do
      end
    end
    @rez.hide()
  end
end

Frank Krueger 2008-08-22 10:20:38

Answer 12

+2 A:

I've used BeautifulSoup a lot with Python. It is much better than regexp checking because it works like using the DOM, even if the HTML is poorly formatted. You can quickly find HTML tags and text with simpler syntax than regexp. Once you find an element you can iterate over it an its children, which is more useful for understanding the contents in code than it is with regexp. I wish BeautifulSoup existed years ago when I had to do a lot of screenscraping -- It would have saved me a lot of time and headache since HTML structure was so poor before people started validating it.

Acuminate 2008-08-22 13:58:44

Answer 13

+1 A:

In Java, you can use TagSoup.

Peter Hilton 2008-08-24 10:32:37

Answer 14

+1 A:

http://scrubyt.org/ uses Ruby and Hpricot to do nice easy web scraping. I wrote a scraper for my uni's library service using this in about 30mins.

robintw 2008-08-25 12:02:23

Answer 15

+3 A:

Another option for perl would be Web::Scraper which is based on Ruby's Scrapi. In a nutshell, with nice and concise syntax you can get robust scraper directly into datastructures.

dpavlin 2008-08-26 22:46:37

Answer 16

+1 A:

JonnyGold 2008-08-27 09:43:07

Answer 17

+2 A:

I've had mixed results in .NET using SgmlReader which was originally started by Chris Lovett and appears to have been updated by MindTouch.

Shawn Miller 2008-08-27 18:49:53

Answer 18

+2 A:

I've had some success with HtmlUnit, in Java. It's a simple framework for writing unit tests on web UI's, but equally useful for HTML scraping.

Henry 2008-08-31 12:09:33

Answer 19

+7 A:

The Python lxml library acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML.

akaihola 2008-09-17 12:44:55

Answer 20

+3 A:

you would be a fool not to use perl.. here come the flames..

bone up on the following modules and ginsu any scrape around.

use LWP use HTML::TableExtract use HTML::TreeBuilder use HTML::Form use Data::Dumper

2008-09-17 12:56:25

-1 Sentences begin with a capital letter.

Andrew Grimm 2010-08-22 07:32:04

@Andrew: That's a lousy reason for a downvote.

Charles Stewart 2010-10-22 22:01:54

Answer 21

+5 A:

I've had great success with the combination of HTML Agility Pack + Regex + XDocument (Linq->XMLy stuff)

It's extremley powerfull - here's a blog post by Vijay Santhanam that got me hooked on it

http://vijay.screamingpens.com/archive/2008/05/26/linq-amp-lambda-part-3-html-agility-pack-to-linq.aspx

scotta 2008-09-17 15:14:08

link not working :(

bortao 2010-10-05 06:07:31

Answer 22

+4 A:

The templatemaker utility from Adrian Holovaty (of Django fame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).

akaihola 2008-09-18 20:13:40

how did you get templatemaker working for large HTML pages? I found it crashes when I give it anything non-trivial.

Plumo 2010-01-30 14:11:43

I suppose I've had no large HTML pages.No filed Issues seem to exist for that problem at http://code.google.com/p/templatemaker/issues/list so it's probably appropriate to send a test case there.It doesn't look like Adrian is maintaining the library though. I wonder what he uses nowadays at EveryBlock since they surely do a lot of scraping.

akaihola 2010-02-03 08:18:19

Answer 23

+2 A:

Implementations of the HTML5 parsing algorithm: html5lib (Python, Ruby), Validator.nu HTML Parser (Java, JavaScript; C++ in development), Hubbub (C), Twintsam (C#; upcoming).

hsivonen 2008-10-09 20:53:21

Answer 24

+2 A:

I've also had great success utilizing Aptana's Jaxer + jQuery to parse pages. Its not as fast or 'script like' in nature, but jQuery selectors + real js/DOM is a lifesavor on more complicated (or malformed) pages.

kkubasik 2008-11-19 19:11:58

Answer 25

+2 A:

Another tool for .NET is MhtBuilder

GeekyMonkey 2009-02-13 12:58:01

Answer 26

+5 A:

'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

Find it here

There is also a blog post about it here.

Orange Box 2009-07-31 19:39:57

I second this one. Dont need to install any mod_python, etc into the web server just to make it work

Brock Woolf 2010-03-21 13:24:08

Answer 27

+2 A:

I know and love Screen-Scraper.

screen-scraper is a tool for extracting data from websites. screen-scraper automates:

* Clicking links on websites
* Entering data into forms and submitting
* Iterating through search result pages
* Downloading files (PDF, MS Word, images, etc.)

Common uses:

* Download all products, records from a website
* Build a shopping comparison site
* Perform market research
* Integrate or migrate data

Technical:

* Graphical interface--easy automation
* Cross platform (Linux, Mac, Windows, etc.)
* Integrates with most programming languages (Java, PHP, .NET, ASP, Ruby, etc.)
* Runs on workstations or servers

Three editions of screen-scraper:

* Enterprise: The most feature-rich edition of screen-scraper. All capabilities are enabled.
* Professional: Designed to be capable of handling most common scraping projects.
* Basic: Works great for simple projects, but not nearly as many features as its two older brothers.

raiglstorfer 2009-08-16 20:56:23

Answer 28

+5 A:

Python has several options for html scraping in addition to Beatiful Soup. Here are some others:

mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww. Supports various options to traverse and select elements (e.g. XPath and CSS selection)
scrapemark: high level library using templates to extract informations from html.
pyquery: allows you to make jquery like queries on xml documents.
scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing

filippo 2009-12-28 16:59:14

Answer 29

A:

For more complex scraping applications, I would recommend the IRobotSoft web scraper. It is a dedicated free software for screen scraping. It has a strong query language for HTML pages, and it provides a very simple web recording interface that will free you from many programming effort.

seagulf 2010-05-17 15:58:39

Answer 30

A:

I like Google Spreadsheets' ImportXML(url, xpath) function

It will repeat cells down the column if your xpath returns more than one value

You can have up to 50 importxml() functions on one spreadsheet

RapidMiner's Web Plugin is also pretty easy to use. Can do posts, accepts cookies, and can set the user-agent.

el chief 2010-07-22 04:31:27

ansaurus

tags:

views:

answers:

Options for HTML scraping?

related questions