views:

607

answers:

8

I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags.

Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

+1  A: 

Well, WebKit is open source so you could use its own parser (in the WebCore component), if any language is acceptable

Eli Bendersky
+1  A: 

You can drive a browser of your choice with SeleniumRC.

Alex Martelli
+1  A: 

You may want to take a look at Mechanize module:

http://wwwsearch.sourceforge.net/mechanize/

Simone
+1  A: 

Ian Bicking once wrote that surprisingly lxml could be better at parsing soups than BeautifulSoup: http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/ (Just mentioning for reference, haven't tried that personally)

Tomasz Zielinski
+1  A: 

pyWebKitGTK looks like it might be of some help.

Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.

pyWebkitGTK at the cheeseshop.

You can also do this with pyQt.

Ryan Christensen
+5  A: 

Use BeautifulSoup as a tree builder for html5lib:

from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()

Output:

<html>
 <head>
 </head>
 <body>
  a
  <b>
   b
   <b>
    c
   </b>
  </b>
 </body>
</html>
J.F. Sebastian
+1  A: 

have you tried scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

schrodinger's code
A: 

From the documentation it seems that ICantBelieveItsBeautifulSoup parser is what you want:

ICantBelieveItsBeautifulSoup is also a subclass of BeautifulSoup. It has HTML heuristics that conform more closely to the HTML standard, but ignore how HTML is used in the real world. For instance, it's valid HTML to nest <B> tags, but in the real world a nested <B> tag almost always means that the author forgot to close the first <B> tag. If you run into someone who actually nests <B> tags, then you can use ICantBelieveItsBeautifulSoup.

brofield