ansaurus

Question

What language/tool should I use for HTML parsing?

Answer 1

+2 A:

You can use pretty much any language you like just don't try and parse HTML with regular expressions.

So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.

If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.

cletus 2009-02-24 14:26:40

Answer 2

A:

hpricot may be what you are looking for.

Colin Pickard 2009-02-24 14:31:45

Answer 3

A:

You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example

$html = file_get_contents('http://example.com');

$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);

echo $dom->saveHTML();

Ionuț G. Stan 2009-02-24 14:45:57

Answer 4

+2 A:

I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/

here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml

Stewart Robinson 2009-02-24 14:48:38

I followed the instructions on their website and I'm not able to install scribyt. Any idea?C:\Windows\system32>gem install mechanizeInstall required dependency hoe? [Yn] YERROR: While executing gem ... (Gem::GemNotFoundException) Could not find hoe (>= 1.9.0) in any repository

Martin 2009-02-24 18:30:31

you might want to try installing a previous version. Scrubyt in the past has been very specific about the versions of it's dependencies. I have it working on Mac, not Win, so can't help too much there. My compiler had to be in path to install it as an aside

Stewart Robinson 2009-02-24 18:48:41

Answer 5

+1 A:

I use Python and IE. In a lot of web sites I have had to scrape there was javascript that hid the data from a regular http request. The javascript had to run first. So I used IE to do all the work and just grabbed the data from IE.

Here is an example of what I did.

This script uses ishybrowser which you can find here and here.

Erin 2009-02-24 15:02:32

ansaurus

tags:

views:

answers:

What language/tool should I use for HTML parsing?

related questions