views:

74

answers:

2

the html that I am receiving from urllib2 is missing dozens of fields of data that I can see when I view the source of the URL in Firefox. Any advice would be much appreciated. Here is what it looks like:

from FireFox view source:

# ...<td class=td6>as</td></tr></thead>|ManyFields|<br></div><div id="c1">...

from urllib2 return html:

# ...<td class=td6>as</td></tr></thead>|</table>|<br></div><div id="c1">...
+1  A: 

It seems from a cursory check that the page you're getting has a lot of Javascript; perhaps that Javascript cooperates in building the information that you see at the end in Firefox (at least some of it is actively altering the page's contents). If you need to scrape JS-rich pages, your best bet is to automate an actual browser via Selenium.

Alex Martelli
A: 

The extra content you're seeing is generated by JavaScript. It is not part of the raw HTML document, and hence won't be present with a plain HTTP fetcher such as urllib2.

Yang Zhao