Python's urllib2 follows 3xx redirects to get the final content. Is there a way to make urllib2 (or some other library such as httplib2) also follow meta refreshes? Or do I need to parse the HTML manually for the refresh meta tags?
using a HTML parser to just extract the meta refresh tag is overkill, atleast for my purposes. Was hoping there was a Python HTTP library that did this automatically.
Plumo
2010-02-24 01:05:07
Well `meta` it *is* a html tag, so it is unlikely that you will find this functionality in an http library.
Otto Allmendinger
2010-02-24 12:14:07
A:
OK, seems no library supports it so I have been using this code:
import urllib2
import re
def get_hops(url):
hops = []
while url:
if url not in hops:
hops.insert(0, url)
response = urllib2.urlopen(url)
if response.geturl() != url:
hops.insert(0, response.geturl())
# check for redirect meta tag
match = re.search('<meta[^>]*?(http://[^>]*?)"', response.read())
if match:
url = match.groups()[0]
else:
url = None
return hops
Plumo
2010-02-27 01:27:26
+1
A:
Here is a solution using BeautifulSoup and httplib2 (and certificate based authentication):
import BeautifulSoup
import httplib2
def meta_redirect(content):
soup = BeautifulSoup.BeautifulSoup(content)
result=soup.find("meta",attrs={"http-equiv":"Refresh"})
if result:
wait,text=result["content"].split(";")
if text.lower().startswith("url="):
url=text[4:]
return url
return None
def get_content(url, key, cert):
h=httplib2.Http(".cache")
h.add_certificate(key,cert,"")
resp, content = h.request(url,"GET")
# follow the chain of redirects
while meta_redirect(content):
resp, content = h.request(meta_redirect(content),"GET")
return content
asmaier
2010-09-08 14:30:59