views:

160

answers:

4

I notice that sometimes audio files on the internet have a "fake" URL.

http://garagaeband.com/3252243

And this will 302 to the real URL:

http://garageband.com/michael_jackson4.mp3

My question is...when supplied with the fake URL, how can you get the REAL URL from headers?

Currently, this is my code for reading the headers of a file. I don't know if this code will get me what I want to accomplish. How do I parse out the "real" URL From the response headers?

import httplib
conn = httplib.HTTPConnection(head)
conn.request("HEAD",tail)
res = conn.getresponse()

This has a 302 redirect: http://www.garageband.com/mp3cat/.UZCMYiqF7Kum/01%5FNo%5Fpierdas%5Fla%5Ffuente%5Fdel%5Fgozo.mp3

A: 

You have to read the response, realize that you got a 302 (FOUND), and parse out the real URL from the response headers, then fetch the resource using the new URI.

Jim Garrison
How do I parse out the real URL form response headers?
TIMEX
Python's urllib and urllib2 follow the redirects for you, and keep track of the new url, as Chris Lacasse implies in his solution.
Andrew Dalke
However, I can't seem to test it as I off-hand know of a server to test against and don't feel like setting one up of my own. ;)
Andrew Dalke
Try a tinyurl. I believe that's a similar case
Chris Lacasse
try tinyurl.com as something to test off of?
Jeffrey Berthiaume
@Jeffrey Berthiaume, yes, to test the redirect
Chris Lacasse
Bingo - that works. The first tinyurl I used pointed to Wikipedia, which doesn't like Python's default user agent (gave me a 403 Forbidden). Thanks Chris!
Andrew Dalke
+6  A: 

Use urllib.getUrl()

edit: Sorry, I haven't done this in a while:

import urllib
urllib.urlopen(url).geturl()

For example:

>>> f = urllib2.urlopen("http://tinyurl.com/oex2e")
>>> f.geturl()
'http://www.amazon.com/All-Creatures-Great-Small-Collection/dp/B00006G8FI'
>>>
Chris Lacasse
+2  A: 

Mark Pilgrim advises to use httplib2 in "Dive Into Python3" as it handles many things (including redirects) in a smarter way.

>>> import httplib2
>>> h = httplib2.Http()
>>> response, content = h.request("http://garagaeband.com/3252243")
>>> response["content-location"]
    "http://garageband.com/michael_jackson4.mp3"
tosh
While it looks like you did this interactively, you actually just wrote the expected result. Otherwise you wouldn't have "http" listed twice in your request URL and you would have seen that "garagaeband.com" (which was in the OP's description) does not actually exist and raises a "No address associated with nodename" error.
Andrew Dalke
I did use an interactive session, and substituted the url with the poster's example urls :) thank you for the pointer. I am going to correct the "http://" repetition ^_^
tosh
"curl http://garagaeband.com/3252243" - "curl: (6) Couldn't resolve host 'garagaeband.com'" . How could the interactive session work when the domain name in your request does not exist?
Andrew Dalke
As mentioned above I did use an interactive session, though with "http://deck.cc" which redirects to "http://www.deck.cc" and then substituted it with the poster's example urls because I thought it would be more illustrative.
tosh
somehow the comment system ate my "www." in the second deck.cc url =)
tosh
A: 

I solved the answer.

 import urllib2
    req = urllib2.Request('http://' + theurl)
    opener = urllib2.build_opener()
    f = opener.open(req)
    print 'the real url is......' + f .url
TIMEX
There's no need for all those steps. Just do "urllib2.urlopen('http....').geturl()" as the simplest. If you want a Request object, then "urllib2.urlopen(req)" also works.
Andrew Dalke