ansaurus

Question

Will this urllib2 python code download the page of the file?

Answer 1

+4 A:

The geturl() method returns the real URL of the page. In some cases, the HTTP server redirects a client to another URL. The urlopen() function handles this transparently, but in some cases the caller needs to know which URL the client was redirected to. The geturl() method can be used to get at this redirected URL.

Lukáš Lalinský 2009-12-13 09:00:25

That's all true, but it doesn't answer the question.

RichieHindle 2009-12-13 09:06:42

Answer 2

+1 A:

No. geturl() returns the url.

For example; urllib2.urlopen("http://www.python.org").geturl() returns the string 'http://www.python.org'.

You can find this sort of stuff really easily in the python interactive shell e.g;

$ python
Python 2.4.3 (#1, Jul 27 2009, 17:57:39)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> u = urllib2.urlopen("http://www.python.org")
>>> u.geturl()
'http://www.python.org'
>>>

Kimvais 2009-12-13 09:00:54

Answer 3

+3 A:

It does not. For me, a test on google.com:

x= time.time(); urllib2.urlopen("http://www.google.com").read(); print time.time()-x
0.166881084442

x= time.time(); urllib2.urlopen("http://www.google.com").geturl(); print time.time()-x
0.0772399902344

Roman Stolper 2009-12-13 09:02:44

Why downvote this guy? The answer is great. It proves it!

TIMEX 2009-12-13 09:31:15

Thanks alex. I was a little confused by the downvote.

Roman Stolper 2009-12-13 09:34:04

This answer arguably wrong, because `geturl()` *does* download (some of) the file. The way to test whether it downloads the file is to look at the network traffic with something like Wireshark, not to use a timer. If the question is "does `geturl()` download the entire file even if it's very big?" then the answer is "No", fair enough. But it's not as clear-cut as this answer makes out, and using a timer to infer what's happening on the network is unreliable.

RichieHindle 2009-12-13 11:18:19

I see. My apologies for not delving deeply enough into this problem and providing a hasty answer. Another thing I should have done, in any case, is looped over the expressions and provided an averaged time estimate.

Roman Stolper 2009-12-13 18:13:59

Answer 4

+4 A:

Tested with Wireshark and Python 2.5: urllib2.urlopen(theurl).geturl() downloads some of the body. It issues a GET, reads the header and a couple of K of the body, and then stops.

RichieHindle 2009-12-13 09:04:12

Yes, and this is because of redirects - I assume that urllib2 supports proper (http 301 / 302) redirects and "<meta refresh"-redirects. For real redirects, reading the header would be enough...

Kimvais 2009-12-13 09:07:21

`geturl()` does not download anything, `urllib2.urlopen(theurl)` does.

Lukáš Lalinský 2009-12-13 12:47:05

@Lukáš Lalinský: And you tested that with a network monitoring tool, did you? When I did that, I looked at the URL http://entrian.com/source-search and it certainly *did* download some of the content.

RichieHindle 2009-12-13 14:02:25

No, I looked at the source code. `geturl()` is defined as `return self.url`.

Lukáš Lalinský 2009-12-13 16:05:32

@Lukáš Lalinský: Ah, sorry, I misunderstood your comment. I thought you were saying that `urllib2.urlopen(theurl).geturl()` doesn't download anything. I've clarified my answer.

RichieHindle 2009-12-13 16:36:42

Answer 5

+1 A:

urllib2.urlopen() returns a file like object, so that when using urlopen() you are actually download the document, and its loaded into yor machines memory, you can use file fuctions to read write your file, like so...

#to store python.org into your local file d:\python.org.html

from urllib2 import urlopen
doc = urlopen("http://www.python.org")
html=doc.read( )
f=open("d:/python.org.html","w+")
f.write(html)
f.close()

or simply using urllib

import urllib
urllib.urlretrieve("http://www.python.org","d:/python.org.html")

hope that helps ;)

Ahmad Dwaik 2009-12-13 13:01:01

ansaurus

tags:

views:

answers:

Will this urllib2 python code download the page of the file?

related questions