ansaurus

Question

Answer 1

+3 A:

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.

Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

RichieHindle 2009-10-09 21:21:44

Have a look at http://pamie.sourceforge.net/

RichieHindle 2009-10-09 21:27:22

I'll take a look at Pamie

foosion 2009-10-09 21:27:56

Answer 2

A:

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

Agent_9191 2009-10-09 21:23:21

I don't see anything helpful in the source. Any suggestions for figuring the URLs?

foosion 2009-10-09 21:26:45

Answer 3

A:

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.

The class gives you full access to the DOM tree, so you can do whatever you want with it.

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser%28loband%29.aspx

i-g 2009-10-09 21:29:55

Answer 4

A:

I've tinkered only a little bit, but it seems the professional version of screen-scraper can do it.

Jason Bellows 2009-10-09 21:35:15

Answer 5

A:

Try iMacros. I am very positive it will solve your problem.

http://www.iopus.com/imacros/firefox/?ref=fxmoz

Legend 2009-10-09 21:45:18

Answer 6

+1 A:

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>

Peter Hoffmann 2009-10-09 22:00:26

This does it. While some other answers are good general answers, this allows me to do what I want nicely and simply.

foosion 2009-10-10 01:40:28

ansaurus

tags:

views:

answers:

web scraping a problem site

related questions