ansaurus

Question

How to get the true URL of a file on the web. (Python)

Answer 1

A:

You have to read the response, realize that you got a 302 (FOUND), and parse out the real URL from the response headers, then fetch the resource using the new URI.

Jim Garrison 2009-11-17 22:31:55

How do I parse out the real URL form response headers?

TIMEX 2009-11-17 22:32:26

Python's urllib and urllib2 follow the redirects for you, and keep track of the new url, as Chris Lacasse implies in his solution.

Andrew Dalke 2009-11-17 22:47:19

However, I can't seem to test it as I off-hand know of a server to test against and don't feel like setting one up of my own. ;)

Andrew Dalke 2009-11-17 22:52:13

Try a tinyurl. I believe that's a similar case

Chris Lacasse 2009-11-17 22:55:51

try tinyurl.com as something to test off of?

Jeffrey Berthiaume 2009-11-17 22:56:02

@Jeffrey Berthiaume, yes, to test the redirect

Chris Lacasse 2009-11-17 23:00:15

Bingo - that works. The first tinyurl I used pointed to Wikipedia, which doesn't like Python's default user agent (gave me a 403 Forbidden). Thanks Chris!

Andrew Dalke 2009-11-17 23:19:16

Answer 2

+6 A:

Use urllib.getUrl()

edit: Sorry, I haven't done this in a while:

import urllib
urllib.urlopen(url).geturl()

For example:

>>> f = urllib2.urlopen("http://tinyurl.com/oex2e")
>>> f.geturl()
'http://www.amazon.com/All-Creatures-Great-Small-Collection/dp/B00006G8FI'
>>>

Chris Lacasse 2009-11-17 22:35:47

Answer 3

+2 A:

Mark Pilgrim advises to use httplib2 in "Dive Into Python3" as it handles many things (including redirects) in a smarter way.

>>> import httplib2
>>> h = httplib2.Http()
>>> response, content = h.request("http://garagaeband.com/3252243")
>>> response["content-location"]
    "http://garageband.com/michael_jackson4.mp3"

tosh 2009-11-17 22:45:51

While it looks like you did this interactively, you actually just wrote the expected result. Otherwise you wouldn't have "http" listed twice in your request URL and you would have seen that "garagaeband.com" (which was in the OP's description) does not actually exist and raises a "No address associated with nodename" error.

Andrew Dalke 2009-11-17 23:24:26

I did use an interactive session, and substituted the url with the poster's example urls :) thank you for the pointer. I am going to correct the "http://" repetition ^_^

tosh 2009-11-17 23:36:26

"curl http://garagaeband.com/3252243" - "curl: (6) Couldn't resolve host 'garagaeband.com'" . How could the interactive session work when the domain name in your request does not exist?

Andrew Dalke 2009-11-18 00:18:43

As mentioned above I did use an interactive session, though with "http://deck.cc" which redirects to "http://www.deck.cc" and then substituted it with the poster's example urls because I thought it would be more illustrative.

tosh 2009-11-18 09:02:59

somehow the comment system ate my "www." in the second deck.cc url =)

tosh 2009-11-18 09:04:13

Answer 4

A:

I solved the answer.

 import urllib2
    req = urllib2.Request('http://' + theurl)
    opener = urllib2.build_opener()
    f = opener.open(req)
    print 'the real url is......' + f .url

TIMEX 2009-11-17 23:16:34

There's no need for all those steps. Just do "urllib2.urlopen('http....').geturl()" as the simplest. If you want a Request object, then "urllib2.urlopen(req)" also works.

Andrew Dalke 2009-11-17 23:26:59

ansaurus

tags:

views:

answers:

How to get the true URL of a file on the web. (Python)

related questions