ansaurus

Question

Answer 1

A:

Use BeautifulSoup or lxml to parse the HTML.

Ignacio Vazquez-Abrams 2010-02-23 13:39:04

using a HTML parser to just extract the meta refresh tag is overkill, atleast for my purposes. Was hoping there was a Python HTTP library that did this automatically.

Plumo 2010-02-24 01:05:07

Well `meta` it *is* a html tag, so it is unlikely that you will find this functionality in an http library.

Otto Allmendinger 2010-02-24 12:14:07

Answer 2

A:

OK, seems no library supports it so I have been using this code:

import urllib2
import re

def get_hops(url):
    hops = []
    while url:
        if url not in hops:
            hops.insert(0, url)
        response = urllib2.urlopen(url)
        if response.geturl() != url:
            hops.insert(0, response.geturl())
        # check for redirect meta tag
        match = re.search('<meta[^>]*?(http://[^&gt;]*?)"', response.read())
        if match:
            url = match.groups()[0]
        else:
            url = None
    return hops

Plumo 2010-02-27 01:27:26

Answer 3

+1 A:

Here is a solution using BeautifulSoup and httplib2 (and certificate based authentication):

import BeautifulSoup
import httplib2

def meta_redirect(content):
    soup  = BeautifulSoup.BeautifulSoup(content)

    result=soup.find("meta",attrs={"http-equiv":"Refresh"})
    if result:
        wait,text=result["content"].split(";")
        if text.lower().startswith("url="):
            url=text[4:]
            return url
    return None

def get_content(url, key, cert):

    h=httplib2.Http(".cache")
    h.add_certificate(key,cert,"")

    resp, content = h.request(url,"GET")

    # follow the chain of redirects
    while meta_redirect(content):
        resp, content = h.request(meta_redirect(content),"GET") 

    return content

asmaier 2010-09-08 14:30:59

ansaurus

tags:

views:

answers:

how to follow meta refreshes in Python

related questions