ansaurus

Question

How to find a relative URL and translate it to an absolute URL in Python

Answer 1

+1 A:

I'm not sure about what you're trying to achieve but using the BASE tag in HTML may do this trick for you without having to resort to regular expressions when doing the processing.

Ionuț G. Stan 2009-02-26 09:42:12

Answer 2

A:

Something like this should do it:

"(?:[^/:"]+|/(?!/))(?:/[^/"]+)*"

Gumbo 2009-02-26 09:48:14

Answer 3

+3 A:

Don't use regular expressions to parse HTML. Use a real parser for that. For example BeautifulSoup.

unbeknown 2009-02-26 10:00:31

Thx~ I'll read the document of BeautifulSoup carefully and find the way to achieve the goal

lucas 2009-02-26 15:39:45

Answer 4

+1 A:

First, I'd recommend using a HTML parser, such as BeautifulSoup. HTML is not a regular language, and thus can't be parsed fully by regular expressions alone. Parts of HTML can be parsed though.

If you don't want to use a full HTML parser, you could use something like this to approximate the work:

import re, urlparse

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)')

def fix_urls(document, base_url):
    ret = []
    last_end = 0
    for match in find_re.finditer(document):
        url = match.group(1)
        if url[0] in "\"'":
            url = url.strip(url[0])
        parsed = urlparse.urlparse(url)
        if parsed.scheme == parsed.netloc == '': #relative to domain
            url = urlparse.urljoin(base_url, url)
            ret.append(document[last_end:match.start(1)])
            ret.append('"%s"' % (url,))
            last_end = match.end(1)
    ret.append(document[last_end:])
    return ''.join(ret)

Example:

>>> document = '''<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'''
>>> fix_urls(document,"http://www.opensolaris.org/os/community/on/flag-days/all/")
'<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="http://www.opensolaris.org/os/community/on/flag-days/pages/2009022501/"&gt;Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/"&gt;PSARC/2008/777&lt;/a&gt;&lt;br /> CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/"&gt;PSARC/2008/663&lt;/a&gt;&lt;br /></td></tr>'
>>>

MizardX 2009-02-26 10:16:50

Thank you for your help~Actually I used Beautiful Soup in my code, but only some functions to extract contents between tags and remove some tags. I'll read its document carefully and find the better way to achieve the goal~

lucas 2009-02-26 15:38:36

I have finished the work using Beautiful Soup, haha~

lucas 2009-02-26 16:50:29

ansaurus

tags:

views:

answers:

How to find a relative URL and translate it to an absolute URL in Python

related questions