ansaurus

Question

Match all urls that aren't wrapped into <a> tag

Answer 1

+5 A:

You could use BeautifulSoup or similar to exclude all urls that are already part of links.

Then you can match the plain text with one of the url regular expressions that's already out there (google "url regular expression", which one you want depends on how fancy you want to get).

Kiv 2009-08-18 22:15:25

+1 for suggesting BeautifulSoup. This problem isn't well-suited to a pure regex solution.

Jim Lewis 2009-08-18 22:23:52

Answer 2

+4 A:

Parsing HTML with a single regex is almost impossible by definition, since regexes don't have state.

Build/Use a real parser instead. Maybe BeautifulSoup or html5lib.

This code below uses BeautifulSoup to extract all links from the page:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://stackoverflow.com/questions/1296778/'
stream = urlopen(url)
soup = BeautifulSoup(stream)
for link in soup.findAll('a'):
    if link.has_key('href'):
        print unicode(link.string), '->', link['href']

Similarly you could find all text using soup.findAll(text=True) and search for urls there.

Searching for urls is also very complex - you wouldn't believe on what's allowed on a url. A simple search shows thousands of examples, but none match exactly the specs. You should try what works better for you.

nosklo 2009-08-18 22:23:58

Answer 3

A:

You can do it as a two step process. Think of something like this:

regex1 = url wrapped around tags of interest

regex2 = url

Step 1:

Use regex1 to identify the offsets of the urls that are wrapped around a tag.

Step 2:

Use regex2 to identify all the urls.
Reject those that are identified by regex1.
Do what you wanted to with the remaining urls.

hashable 2009-08-18 22:28:59

Answer 4

A:

Thanks guys! Below is my solution:

from django.utils.html import urlize # Yes, I am using Django's urlize to do all dirty work :)

def urlize_html(value):
    """
    Urlizes text containing simple HTML tags.
    """
    A_IMG_REGEX = r'(<[aA][^>]+>[^<]+</[aA]>|<[iI][mM][gG][^>]+>)'
    a_img_re = re.compile(A_IMG_REGEX)

    TAG_REGEX = r'(<[a-zA-Z]+[^>]+>|</[a-zA-Z]>)'
    tag_re = re.compile(TAG_REGEX)

    def process(s, p, f):
        return "".join([c if p.match(c) else f(c) for c in p.split(s)])

    def process_urlize(s):
        return process(s, tag_re, urlize)

    return process(value, a_img_re, process_urlize)

Serge Tarkovski 2009-08-19 13:34:47

I don't have to think too much to create html that would make this fail.

nosklo 2009-08-20 04:05:15

ansaurus

tags:

views:

answers:

Match all urls that aren't wrapped into <a> tag

related questions