tags:

views:

1241

answers:

4

I extract some code from a web page (http://www.opensolaris.org/os/community/on/flag-days/all/) like follows,

<tr class="build">
  <th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
  <td>Feb-25</td>
  <td></td>
  <td></td>
  <td></td>
  <td>
    <a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
    cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br />
    CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br />
  </td>
</tr>

and there are some relative url path in it, now I want to search it with regular expression and replace them with absolute url path. Since I know urljoin can do the replace work like that,

>>> urljoin("http://www.opensolaris.org/os/community/on/flag-days/all/",
...         "/os/community/arc/caselog/2008/777/")
'http://www.opensolaris.org/os/community/arc/caselog/2008/777/'

Now I want to know that how to search them using regular expressions, and finally tanslate the code to,

<tr class="build">
  <th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
  <td>Feb-25</td>
  <td></td>
  <td></td>
  <td></td>
  <td>
    <a href="http://www.opensolaris.org/os/community/on/flag-days/all//pages/2009022501/"&gt;Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
    cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/"&gt;PSARC/2008/777&lt;/a&gt;&lt;br />
    CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/"&gt;PSARC/2008/663&lt;/a&gt;&lt;br />
  </td>
</tr>

My knowledge of regular expressions is so poor that I want to know how to do that. Thanks

I have finished the work using Beautiful Soup, haha~ Thx for everybody!

+1  A: 

I'm not sure about what you're trying to achieve but using the BASE tag in HTML may do this trick for you without having to resort to regular expressions when doing the processing.

Ionuț G. Stan
A: 

Something like this should do it:

"(?:[^/:"]+|/(?!/))(?:/[^/"]+)*"
Gumbo
+3  A: 

Don't use regular expressions to parse HTML. Use a real parser for that. For example BeautifulSoup.

unbeknown
Thx~ I'll read the document of BeautifulSoup carefully and find the way to achieve the goal
lucas
+1  A: 

First, I'd recommend using a HTML parser, such as BeautifulSoup. HTML is not a regular language, and thus can't be parsed fully by regular expressions alone. Parts of HTML can be parsed though.

If you don't want to use a full HTML parser, you could use something like this to approximate the work:

import re, urlparse

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)')

def fix_urls(document, base_url):
    ret = []
    last_end = 0
    for match in find_re.finditer(document):
        url = match.group(1)
        if url[0] in "\"'":
            url = url.strip(url[0])
        parsed = urlparse.urlparse(url)
        if parsed.scheme == parsed.netloc == '': #relative to domain
            url = urlparse.urljoin(base_url, url)
            ret.append(document[last_end:match.start(1)])
            ret.append('"%s"' % (url,))
            last_end = match.end(1)
    ret.append(document[last_end:])
    return ''.join(ret)

Example:

>>> document = '''<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'''
>>> fix_urls(document,"http://www.opensolaris.org/os/community/on/flag-days/all/")
'<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="http://www.opensolaris.org/os/community/on/flag-days/pages/2009022501/"&gt;Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/"&gt;PSARC/2008/777&lt;/a&gt;&lt;br /> CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/"&gt;PSARC/2008/663&lt;/a&gt;&lt;br /></td></tr>'
>>>
MizardX
Thank you for your help~Actually I used Beautiful Soup in my code, but only some functions to extract contents between tags and remove some tags. I'll read its document carefully and find the better way to achieve the goal~
lucas
I have finished the work using Beautiful Soup, haha~
lucas