views:

73

answers:

4

I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?

this the code I'm using:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

(html output is removed by markdown)

A: 

Use Wireshark to see what your browser's request looks like, and add the missing parts so that your request looks the same.

To tweak urllib2 headers, try this.

Nicolas Raoul
+1  A: 

Simpler than Wireshark may be to use Firebug to see the form of the request being made, and then emulating the same in your code.

Sanjay
+4  A: 

With firebug watch what headers and cookies are sent to server. Then with urllib2.Request and cookielib emulate the same request.

EDIT: Also you can use mechanize.

Mykola Kharechko
Mechanize is a lovely library!
Zolomon
A: 

Probably this isn't working because you haven't supplied credentials for the admin page

Use mechanize to load the login page and fill out the username/password.

Then you should have a cookie set to allow you to continue to the admin page.

It is much harder using just urllib2. You will need to manage the cookies yourself if you choose to stick to that route.

gnibbler