views:

649

answers:

3

Hi,

The bus company I use runs an awful website (Hebrew,English) which making a simple "From A to B timetable today" query a nightmare. I suspect they are trying to encourage the usage of the costly SMS query system.

I'm trying to harvest the entire timetable from the site, by submitting the query for every possible point to every possible point, which would sum to about 10k queries. The query result appears in a popup window. I'm quite new to web programming, but familiar with the basic aspects of python.

  1. What's the most elegant way to parse the page, select a value fro a drop-down menu, and press "submit" using a script?
  2. How do I give the program the contents of the new pop-up as input?

Thanks!

+6  A: 

Twill is a simple scripting language for Web browsing. It happens to sport a python api.

twill is essentially a thin shell around the mechanize package. All twill commands are implemented in the commands.py file, and pyparsing does the work of parsing the input and converting it into Python commands (see parse.py). Interactive shell work and readline support is implemented via the cmd module (from the standard Python library).

An example of "pressing" submit from the above linked doc:

from twill.commands import go, showforms, formclear, fv, submit

go('http://issola.caltech.edu/~t/qwsgi/qwsgi-demo.cgi/')
go('./widgets')
showforms()

formclear('1')
fv("1", "name", "test")
fv("1", "password", "testpass")
fv("1", "confirm", "yes")
showforms()

submit('0')
gimel
+7  A: 

I would suggest you use mechanize. Here's a code snippet from their page that shows how to submit a form :


import re
from mechanize import Browser

br = Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info()  # headers
print response1.read()  # body
response1.close()  # (shown for clarity; in fact Browser does this for you)

br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br["cheeses"] = ["mozzarella", "caerphilly"]  # (the method here is __setitem__)
response2 = br.submit()  # submit current form

# print currently selected form (don't call .submit() on this, use br.submit())
print br.form

Geo
+7  A: 

You very rarely want to actually "press the submit button", rather than making GET or POST requests to the handler resource directly. Look at the HTML where the form is, and see what parameters its submitting to what URL, and if it is GET or POST method. You can form these requests with urllib(2) easily enough.

ironfroggy
The mechanize package saves you from much of the boring detail of "... see what parameters are submitting ...". Twill takes mechanize and provides a higher level of abstraction.
gimel