views:

184

answers:

6

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.

Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.

A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT

I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

+3  A: 

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.

Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

RichieHindle
Have a look at http://pamie.sourceforge.net/
RichieHindle
I'll take a look at Pamie
foosion
A: 

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

Agent_9191
I don't see anything helpful in the source. Any suggestions for figuring the URLs?
foosion
A: 

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.

The class gives you full access to the DOM tree, so you can do whatever you want with it.

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser%28loband%29.aspx

i-g
A: 

I've tinkered only a little bit, but it seems the professional version of screen-scraper can do it.

Jason Bellows
A: 

Try iMacros. I am very positive it will solve your problem.

http://www.iopus.com/imacros/firefox/?ref=fxmoz

Legend
+1  A: 

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>
Peter Hoffmann
This does it. While some other answers are good general answers, this allows me to do what I want nicely and simply.
foosion