ansaurus

Question

Python Web Crawlers and "getting" html source code

Answer 1

+1 A:

The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.

When you say modify the page and return the modified page what do you mean?

Jim Garrison 2010-08-20 18:14:38

for all img files on a certain page, replace with a new one

danutenshu 2010-08-20 18:59:52

like if you see a google logo, replace with Mcdonalds logo

danutenshu 2010-08-20 19:06:18

the link you sent me is very big. What are the minimums i should read

danutenshu 2010-08-20 19:25:35

Google search for information about HTTP. This is the underlying protocol that carries the HTML from the server to your browser. I assume you already understand HTML and have a strategy for parsing it. If not, all the pieces are available but you will have some research and learning to do to put them together.

Jim Garrison 2010-08-20 20:36:42

Answer 2

+1 A:

Use Python 2.7, is has more 3rd party libs at the moment.

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

leoluk 2010-08-20 18:15:33

just to nitpick, what you get back from `urlopen` isn't a `request` object, it's a response object.

aaronasterling 2010-08-20 18:18:06

Oops. Thank you.

leoluk 2010-08-20 18:21:05

like if you see a google logo, replace with Mcdonalds logo. or if you go to google images, you see nothing but one certain image of your choice

danutenshu 2010-08-20 19:07:07

So you want to manipulate content in the browser?

leoluk 2010-08-20 21:01:04

ansaurus

tags:

views:

answers:

Python Web Crawlers and "getting" html source code

related questions