views:

1787

answers:

7

Hi,

I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.

I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?

I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?

Can anyone point me in the rigth direction?

A: 

You can do it with javascript. If the form is something like:

<form name='myform' ...

Then you can do this in javascript:

<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>

You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:

<body onLoad="submitform()" ...>
João da Silva
+3  A: 

Read about urllib2.

Also, see http://stackoverflow.com/questions/6936/using-what-ive-learned-from-stackoverflow-html-scraper

http://stackoverflow.com/questions/301924/python-urlliburllib2httplib-confusion

http://stackoverflow.com/questions/120061/fetch-a-wikipedia-article-with-python

Indeed, almost every question with urllib or urllib2 has an example you can use.

S.Lott
Can I recommend http://www.voidspace.org.uk/python/articles/urllib2.shtml if you go down the urllib2 route. Basic Urllib also has enough to handle the naive case quite easily.
Ali A
A: 

you'll need to generate a HTTP request containing the data for the form.

The form will look something like:

<form action="submit.php" method="POST"> ... </form>

This tells you the url to request is www.example.com/submit.php and your request should be a POST.

In the form will be several input items, eg:

<input type="text" name="itemnumber"> ... </input>

you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc... This will work fine for GET. POST is a little trickier.

</motivation>

Just follow S.Lott's links for some much easier to use library support :P

Cogsy
+1  A: 

Using python, I think it takes the following steps:

  1. parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").

this explains form elements in html file

  1. Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.
A: 

You don't need to parse the original page--all you need to do is know which form parameters are submitted. You can get the form information manually using a tool like Firebug/etc.

+1  A: 

From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
  3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

The unusual name caught the attention of our host, November 12, 2008.

gimel
A: 

brilliant! thank you.