tags:

views:

1281

answers:

1

I'm writing a simple python script so I can test my websites from a different ip address.

The url of a page is given in the querystring, the script fetches the page and displays it to the user. The code below is used to rewrite the tags that contain urls but I don't think it's complete/totally correct.

def rel2abs(rel_url, base=loc):
    return urlparse.urljoin(base, rel_url)

def is_proxy_else_abs(tag, attr):
    if tag in ('a',):
        return True
    if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
        return False

def repl(matchobj):
    if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
        return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
    else:
        return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))

def fix_urls(page):
    get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
    page = get_link_re.sub(repl, page)
    return page

The idea is that 'a' tag's href attributes should be routed through the proxy script, but css, javascript, images, forms etc should not be, so these have to be made absolute if they are relative in the original page.

The problem is the code doesn't always work, css can be written in a number of ways etc. Is there a more comprehensive regex I can use?

+2  A: 

Please read other postings here about parsing HTML. For example http://stackoverflow.com/questions/55391/python-regular-expression-for-html-parsing-beautifulsoup and http://stackoverflow.com/questions/71151/html-parser-in-python.

Use Beautiful Soup, not regular expressions.

S.Lott