views:

765

answers:

3

I need to scrape query results from an .aspx web page.

http://legistar.council.nyc.gov/Legislation.aspx

The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.

Somebody out there must know how to do this.

A: 

"Assume we need to select "all years" and "all types" from the respective dropdown menus."

What do these options do to the URL that is ultimately submitted.

After all, it amounts to an HTTP request sent via urllib2.

Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.

  1. Select '"all years" and "all types" from the respective dropdown menus'

  2. Note the URL which is actually submitted.

  3. Use this URL in urllib2.

S.Lott
Apparently the page is a form requiring POST, but the idea is the same: take note of the form field name and of the value associated with 'All years' and witn 'all types' and use urlib2.Request to get to the data.
mjv
I'm using the Charles web debugging proxy to watch all the http traffic when I surf this site and submit queries, and the url is completely static. It contains no parameters at all. There is form data to passes somehow--ajax, I guess--but I don't know how to submit that form data to the server. It all looks unintelligible to me. The fact that I can't submit a query by manipulating the url is what's confusing me.
twneale
Once you get the results form this page, if you wish to scarpe it, you may use python module HTMLParser or Beautifulsoup to parse the html page. Also scraping will likely involve more urlib2 calls to navigate to the next pages of results.
mjv
+2  A: 

Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.

What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.

Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.

Jason Whitehorn
+2  A: 

As an overview, you will need to perform four main tasks:

  • to submit request(s) to the web site,
  • to retrieve the response(s) from the site
  • to parse these responses
  • to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)

The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup

The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).

This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.

import urllib
import urllib2

uri = 'http://legistar.council.nyc.gov/Legislation.aspx'

#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type. 
headers = {
    'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
    'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
    'Content-Type': 'application/x-www-form-urlencoded'
}

# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples.  This helps
# with clarity and also makes it easy to later encoding of them.

formFields = (
   # the viewstate is actualy 800+ characters in length! I truncated it
   # for this sample code.  It can be lifted from the first page
   # obtained from the site.  It may be ok to hardcode this value, or
   # it may have to be refreshed each time / each day, by essentially
   # running an extra page request and parse, for this specific value.
   (r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),

   # following are more of these ASP form fields
   (r'__VIEWSTATE', r''),
   (r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
   (r'ctl00_RadScriptManager1_HiddenField', ''), 
   (r'ctl00_tabTop_ClientState', ''), 
   (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
   (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),

   #but then we come to fields of interest: the search
   #criteria the collections to search from etc.
                                                       # Check boxes  
   (r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'),  # file number
   (r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'),  # Legislative text
   (r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'),  # attachement
                                                       # etc. (not all listed)
   (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
   (r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'),  # Years to include
   (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
   (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
)

# these have to be encoded    
encodedFields = urllib.urlencode(formFields)

req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)     #that's the actual call to the http site.

# *** here would normally be the in-memory parsing of f 
#     contents, but instead I store this to file
#     this is useful during design, allowing to have a
#     sample of what is to be parsed in a text editor, for analysis.

try:
  fout = open('tmp.htm', 'w')
except:
  print('Could not open output file\n')

fout.writelines(f.readlines())
fout.close()

That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.

This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...

mjv