ansaurus

Question

Web scraping with Python

Answer 1

+1 A:

Well, WebKit is open source so you could use its own parser (in the WebCore component), if any language is acceptable

Eli Bendersky 2010-03-07 18:12:48

Answer 2

+1 A:

You can drive a browser of your choice with SeleniumRC.

Alex Martelli 2010-03-07 18:18:06

Answer 3

+1 A:

You may want to take a look at Mechanize module:

http://wwwsearch.sourceforge.net/mechanize/

Simone 2010-03-07 19:14:11

Answer 4

+1 A:

Ian Bicking once wrote that surprisingly lxml could be better at parsing soups than BeautifulSoup: http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/ (Just mentioning for reference, haven't tried that personally)

Tomasz Zielinski 2010-03-07 19:22:25

Answer 5

+1 A:

pyWebKitGTK looks like it might be of some help.

Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.

pyWebkitGTK at the cheeseshop.

You can also do this with pyQt.

Ryan Christensen 2010-03-07 19:47:34

Answer 6

+5 A:

Use BeautifulSoup as a tree builder for html5lib:

from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()

Output:

<html>
 <head>
 </head>
 <body>
  a
  <b>
   b
   <b>
    c
   </b>
  </b>
 </body>
</html>

J.F. Sebastian 2010-03-07 23:23:04

Answer 7

+1 A:

have you tried scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

schrodinger's code 2010-03-28 10:56:28

Answer 8

A:

From the documentation it seems that ICantBelieveItsBeautifulSoup parser is what you want:

ICantBelieveItsBeautifulSoup is also a subclass of BeautifulSoup. It has HTML heuristics that conform more closely to the HTML standard, but ignore how HTML is used in the real world. For instance, it's valid HTML to nest <B> tags, but in the real world a nested <B> tag almost always means that the author forgot to close the first <B> tag. If you run into someone who actually nests <B> tags, then you can use ICantBelieveItsBeautifulSoup.

brofield 2010-04-19 05:14:03

ansaurus

tags:

views:

answers:

Web scraping with Python

related questions