views:

632

answers:

7

Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

A: 

I have not found anything for this. I use a combination of beautifulsoup and custom routines...

Art
A: 

you can try spidermonkey ?

ghostdog74
+6  A: 

Beautiful soup is still probably your best bet.

If you need "JavaScript support" for the purpose of intercepting Ajax requests then you should use some sort of capture too (such as YATT) to monitor what those requests are, and then emulating / parsing them.

If you need "JavaScript support" in order to be able to see what the end result of a page with static JavaScript is, then my first choice would be to try and figure out what the JavaScript is doing on a case-by-case basis (e.g. if the JavaScript is doing something based on some Xml, then just parse the Xml directly instead)

If you really want "JavaScript support" (as in you want to see what the html is after scripts have been run on a page) then I think you will probably need to create an instance of some browser control, and then read the resulting html / dom back from the browser control once its finished loading and parse it normally with beautiful soup. That would be my last resort however.

Kragen
+2  A: 

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Here you go: http://scrapy.org/

Lakshman Prasad
A: 

BeautifulSoup, Scrapy and lxml are the most common options. I personally use lxml for all my XML/HTML needs. Lxml is wrapping C-libraries so it's fast and supports XPath and stuff, but it's harder to install, because it wraps C-libraries, so it's a matter of taste, mostly.

Lennart Regebro
A: 

I've had success before with BeautifulSoup (a great HTML parsing library, particularly if you're parsing poorly-written HTML) used in combination with Mechanize (which really simplifies HTTP requests).

Donald Harvey
+4  A: 

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html
Plumo