views:

877

answers:

8

Hi, suppose, I need to perform a set of procedure on a particular website say, fill some forms, click submit button, send the data back to server, receive the response, again do something based on the response and send the data back to the server of the website. I know there is a webbrowser module in python, but I want to do this without invoking any web browser. It hast to be a pure script.

Is there a module available in python, which can help me do that?
thanks

A: 

You likely want urllib2. It can handle things like HTTPS, cookies, and authentication. You will probably also want BeautifulSoup to help parse the HTML pages.

Steven Huwig
A: 

http://twill.idyll.org/

What about javascript handling? twill doesnt do that.if the form validation is done using javascript,it sends error code 400 back. Twill is good no doubt about it,tried to find a work around but it doesnt support the javascript part.so again kinda stuck, because whatever i fill in the forms and when press submit button,its taken by another form (using javascript),that form is hidden,and using javascript itself it submits the form.might be a silly doubt,but i am stuck herethanks
kush87
A: 

There are plenty of built in python modules that whould help with this. For example urllib and htmllib.

The problem will be simpler if you change the way you're approaching it. You say you want to "fill some forms, click submit button, send the data back to server, recieve the response", which sounds like a four stage process.

In fact, what you need to do is post some data to a webserver and get a response.

This is as simple as:

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print f.read()

(example taken from the urllib docs).

What you do with the response depends on how complex the HTML is and what you want to do with it. You might get away with parsing it using a regular expression or two, or you can use the htmllib.HTMLParser class, or maybe a higher level more flexible parser like Beautiful Soup.

roomaroo
+2  A: 

You can also take a look at mechanize. Its meant to handle "stateful programmatic web browsing" (as per their site).

arcanum
mechanize, in my experience, is pretty slow, but once https, cookies, logins, are involved, it's *much* easier than urllib2.
Gregg Lind
selenium provides a lot more than mechanize but mechanize is good for just basic stuff but will cause issues if you are trying to do real browser emulation as it doesn't do things like auto download images, css files, etc and seems to always be detectable by the strictest sites as being an automated tool
Rick
A: 

You may have a look at these slides from the last italian pycon (pdf): The author listed most of the library for doing scraping and autoted browsing in python. so you may have a look at it.

I like very much twill (which has already been suggested), which has been developed by one of the authors of nose and it is specifically aimed at testing web sites.

dalloliogm
A: 

Internet Explorer specific, but rather good:

http://pamie.sourceforge.net/

The advantage compared to urllib/BeautifulSoup is that it executes Javascript as well since it uses IE.

fraca7
+2  A: 

selenium will do exactly what you want and it handles javascript

adaptive
Although I don't think this can be done headless, which is what is often implied by "pure script", this will as closely as possible emulate a real browser experience...since it's using a real browser. Most sites today are completely broken without Javascript, which makes mechanize obsolete.
Chris S
A: 

httplib2 + beautifulsoup

Use firefox + firebug + httpreplay to see what the javascript passes to and from the browser from the website. Using httplib2 you can essentially do the same via post and get