tags:

views:

62

answers:

2

Hi everyone,

I want to save a web page. I use python urllib to parse the web page. But I find the saved file, where some content is missing. The missing part is block from the source web page, such as this part <div style="display: block;" id="GeneInts">...</div>. I don't know how to parse a whole page without something block in it. Could you help me figure it out? Thank you!

This is my program

url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&amp;ProtId=1&amp;ProtType=Receptor'
f = urllib.urlretrieve(url,'test.html') 
A: 

That page generates a great deal of its content with JavaScript executed at load-time, including, I think, the part you're trying to extract. You need a screen-scraper that can run JavaScript and then save out the modified DOM. I don't know where you get one of those.

Zack
+4  A: 

Whenever I need to let Javascript operate on a page before I can scrape it, the first thing I always turn to is SeleniumRC -- while it's mainly designed for purposes of testing, I've never found a better tool for this challenging task. For the "using it from Python" part, see here and links therefrom.

Alex Martelli
I mean when I use urllib to save the web page,there's some part within '+' part, where should be clicked on it to show the details, that I can't scrape it. How can I use python to let javascript open this part on a page then scrape the whole page? the hidden part looks like this:<img onclick="void dispFrames('GeneInts')" src="../Icons/open.png" class="paragNode" id="GeneIntsNode"><div style="display: block;" id="GeneInts"><div><table><tbody><tr><td><span class="IntRow">...</span></td></tr></tbody></table><input type="hidden" value=", R 3572 IL6ST" id="intslist"></div></div>
Herta
@Herta: Alex is telling you that you need to use SeleniumRC to do this. It will let you drive the javascript to scrape the full page. Read the links. Selenium is good stuff.
bstpierre