tags:

views:

182

answers:

4

Hi all!

I am seeking for a regular expression pattern that could match urls in HTML that aren't wrapped into 'a' tag, in order to wrap them into 'a' tag further (i.e. highlight all non-highlighted links).

Input is simple HTML with 'a', 'b', 'i', 'br', 'p' 'img' tags allowed. All other HTML tags shouldn't appear in the input, but tags mentioned above could appear in any combinations.

So pattern should omit all urls that are parts of existing 'a' tags, and match all other links that are just plain text not wrapped into 'a' tags and thus are not highlighted and are not hyperlinks yet. It would be good if pattern will match urls beginning with http://, https:// or www., and ending with .net, .com. or .org if the url isn't begin with http://, https:// or www.

I've tried something like '(?!<[aA][^>]+>)http://%5Ba-zA-Z0-9.%5F-%5D+(?!)' to match more simple case than I described above, but it seems that this task is not so obvious.

Thanks much for any help.

+5  A: 

You could use BeautifulSoup or similar to exclude all urls that are already part of links.

Then you can match the plain text with one of the url regular expressions that's already out there (google "url regular expression", which one you want depends on how fancy you want to get).

Kiv
+1 for suggesting BeautifulSoup. This problem isn't well-suited to a pure regex solution.
Jim Lewis
+4  A: 

Parsing HTML with a single regex is almost impossible by definition, since regexes don't have state.

Build/Use a real parser instead. Maybe BeautifulSoup or html5lib.

This code below uses BeautifulSoup to extract all links from the page:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://stackoverflow.com/questions/1296778/'
stream = urlopen(url)
soup = BeautifulSoup(stream)
for link in soup.findAll('a'):
    if link.has_key('href'):
        print unicode(link.string), '->', link['href']

Similarly you could find all text using soup.findAll(text=True) and search for urls there.

Searching for urls is also very complex - you wouldn't believe on what's allowed on a url. A simple search shows thousands of examples, but none match exactly the specs. You should try what works better for you.

nosklo
A: 

You can do it as a two step process. Think of something like this:

regex1 = url wrapped around tags of interest

regex2 = url

Step 1:

  • Use regex1 to identify the offsets of the urls that are wrapped around a tag.

Step 2:

  • Use regex2 to identify all the urls.
  • Reject those that are identified by regex1.
  • Do what you wanted to with the remaining urls.
hashable
A: 

Thanks guys! Below is my solution:

from django.utils.html import urlize # Yes, I am using Django's urlize to do all dirty work :)

def urlize_html(value):
    """
    Urlizes text containing simple HTML tags.
    """
    A_IMG_REGEX = r'(<[aA][^>]+>[^<]+</[aA]>|<[iI][mM][gG][^>]+>)'
    a_img_re = re.compile(A_IMG_REGEX)

    TAG_REGEX = r'(<[a-zA-Z]+[^>]+>|</[a-zA-Z]>)'
    tag_re = re.compile(TAG_REGEX)

    def process(s, p, f):
        return "".join([c if p.match(c) else f(c) for c in p.split(s)])

    def process_urlize(s):
        return process(s, tag_re, urlize)

    return process(value, a_img_re, process_urlize)
Serge Tarkovski
I don't have to think too much to create html that would make this fail.
nosklo