views:

136

answers:

1

When I view the source of the page in my browser (FireFox) (View->Page Source), copy it and paste it into my HTML editor, I view almost the same page (In this example it is www.google.com) as it appears in my browser. But when I get the HTML source through this code (through Googles App Engines)

from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
   print result.content

copy it and paste it into my HTML editor, the page then looks quite different. Why is it so? Is there something wrong with the code?

++++++++++++++++++++++++++++++

Follow-up:

By this moment (Sunday, December 13th, 2009, 1:01 PM, GMT, to be precise) I have received two comments-questions (from Aaron and Christian P.) and one answer from Alex Martelli.

Both Aaron and Christian P. are asking about what actually is different between the Fire-Fox-obtained source and Google-App-Engine-obtained source when they are both displayed through the same HTML editor.

Here I have uploaded too screen shots:

One shows the Fire-Fox-obtained source

And the other one shows Google-App-Engine-obtained source

when they are both displayed through “MS Front Page” editor.

One difference, which is quite obvious, is different encoding: In Fire-Fox code everything is displayed in English, while in the Google-App-Engine code I get a lot of various symbols, instead.

Another difference is some additional lines at the top of the page in the Google App Engine code. I think, this is what Alex Martelli was talking about in his answer (“…the fetch-and-print approach is going to have metadata around it as well…”).

One more minor difference is that the box for the Google image is split into several boxes in one code, while it remains whole in the other one.

Alex Martelli suggested that I use this code (if I understood him correctly):

from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
   print "content-type: text/plain"
   print

I’ve tried it, but in this case nothing is displayed at all.

Thank you all for your responses and, please, continue responding – I really want to see this issue finally resolved.

++++++++++++++++++++++++++++++

Follow-up:

Okay, the issue has been resolved.

I failed to pay my full attention to Alex Martelli's instructions and, therefore, came up with a wrong code. Here is he right one:

from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
   print "content-type: text/plain"
   print
   print result.content

This code displays exactly what is needed - no additional lines at the top of the page.

Well, I still get the strange symbols, but I discovered that it's probably Google's problem. The thing is I am currently in Taiwan, and Google seems to be aware of that and automatically switches from www.google.com (which is in English) to www.google.com.tw (which is in Chinese), but this one, I guess, is already another topic.

Thanks to everyone who has responded here.

+1  A: 

You have not explicitly emitted a "content type" header, and an end-of-headers empty line, so the first few lines are probably going to be lost; try adding before the final print something like

   print "content-type: text/plain"
   print

Beyond this, what you're getting in either case is essentially a big <script> with a little extra HTML around it -- that's all that Firefox is going to give you in the "view source" page, while the fetch-and-print approach is going to have metadata around it as well, e.g., the "doctype" (depending on what HTML editor you're targeting, this may or may not be an issue).

Alex Martelli
(1) Hello, Alex Martelli!!!Thank you very much for your answer. There are a few things, though, that I still don’t understand: 1) First of all, you said “You have not explicitly emitted a "content type" header and an end-of-headers empty line”. Please tell me, what do you mean here by the word “emitted”?; 2) Also, what is the “end-of-headers empty line”? Where is it located in my code, and what does it do?; 3) I have tried your code and found that now it doesn’t display any HTML source at all!
brilliant
(2) Perhaps, I misunderstood you and did something wrong. Here is the code that I’ve tried (if it’s hard for you to read it here in my comment, please, refer to the main body of this current question – I have added details there):from google.appengine.api import urlfetchurl = "http://www.google.com/"result = urlfetch.fetch(url)if result.status_code == 200: print "content-type: text/plain" print
brilliant
Heh -- @brilliant, read my answer again, and I quote with emphasis for clarity: "try *adding* **before** the final `print` something like" -- and you have *substituted* instead of *adding*, putting those two preliminary lines **instead** of "the final `print`", _**not**_ **before** it as I had recommended. As a result, it's entirely obvious that your new erroneous code will not emit `result.content` -- since you've destroyed the statement that did so! Put that statement back, **after** the two other prints (NOT **instead** again, you hear?!-).
Alex Martelli
To answer your other questions: (1) by "emitted" I mean "output", "printed", "written". (2) the end of the headers of any HTTP response (and many other kinds of messages) is indicated by an empty line; it's not "located in your code", it's located in your OUTPUT, between the end of the headers and the start of the body; what it does is tell every program that processes your responses (or other kinds of messages) that your headers are finished and your body's about to start -- that's what the second `print` I suggest is for (and then of course you removed the `print` of the body!-).
Alex Martelli
I see, Alex. Now it doesn't show any additional lines. Thank you very much, and I am sorry for having been so inattentive toward your words.
brilliant
Thank you for answering the other questions. I've really learned a lot from you answers.
brilliant