views:

271

answers:

2

Hey all,

Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe there are better ones out there?

Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup.

<a href="javascript:openFindItem(12510109)" onclick="s_objectID=&quot;javascript:openFindItem(12510109)_1&quot;;return this.s_oc?this.s_oc(e):true">Find Item</a>

I'd like to implement this with Google App engine and Django. Thanks!

+1  A: 

What I usually do is automate an actual browser in these cases, and grab the processed HTML from there.

Edit:

Here's an example of automating InternetExplorer to navigate to a URL and grab the title and location after the page loads.

from win32com.client import Dispatch

from ctypes import Structure, pointer, windll
from ctypes import c_int, c_long, c_uint
import win32con
import pywintypes

class POINT(Structure):
    _fields_ = [('x', c_long),
                ('y', c_long)]
    def __init__( self, x=0, y=0 ):
        self.x = x
        self.y = y

class MSG(Structure):
    _fields_ = [('hwnd', c_int),
                ('message', c_uint),
                ('wParam', c_int),
                ('lParam', c_int),
                ('time', c_int),
                ('pt', POINT)]

def wait_until_ready(ie):
    pMsg = pointer(MSG())
    NULL = c_int(win32con.NULL)

    while True:

        while windll.user32.PeekMessageW(pMsg, NULL, 0, 0, win32con.PM_REMOVE) != 0:
            windll.user32.TranslateMessage(pMsg)
            windll.user32.DispatchMessageW(pMsg)

        if ie.ReadyState == 4:
            break


ie = Dispatch("InternetExplorer.Application")

ie.Visible = True

ie.Navigate("http://google.com/")

wait_until_ready(ie)

print "title:", ie.Document.Title
print "location:", ie.Document.location
Ryan Ginstrom
Is this similar to selenium? I've tried automating this way, but am having some trouble with the generated python source code. I'd need to follow all javascript links of this type and download/parse data from each
Diego
I just automate the browser directly. On Windows, you can do this with Internet Explorer, or in a cross-platform way with WebKit.
Ryan Ginstrom
+1  A: 

I use the Python bindings to webkit to render basic JavaScript and Chickenfoot for more advanced interactions. See this webkit example for more info.

Plumo