tags:

views:

3430

answers:

5

Hi all

Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?

Simple question, but nothing jumped out on Google (or Stackoverflow).

Look forward to seeing how y'all do this!

Jamie

+3  A: 

Use a regular expression.

Reply to comment from the OP: I know this is not helpful. I am telling you the correct way to solve the problem as you stated it is to use a regular expression.

ddaa
I *know* I could use a regex!!!! I'm asking if there is a module or other way to do this without coming up with the ultimate regex to extract URL's.
jkp
In the end I appologise to ddaa. The only way is to use a regex and there is not a module out there that exposes this in a ready-wrapped form. The suggestion to use BeautifulSoup does not work when extracting from plain text.
jkp
+1  A: 

if you know that there is a URL following a space in the string you can do something like this:

s is the string containg the url

>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]

otherwise you need to check if find returns -1 or not.

sinzi
what about https://?
Brandon H
>>> t = s[s.find("https://"):] >>> t = t[:t.find(" ")]
islam
+4  A: 

You can use BeautifulSoup.

def extractlinks(html):
    soup = BeautifulSoup(html)
    anchors = soup.findAll('a')
    links = []
    for a in anchors:
        links.append(a['href'])
    return links

Note that the solution with regexes is faster, although will not be as accurate.

Sebastjan Trepča
Sebastian: I know BeautifulSoup but the problem is that it will only extract anchored URLs. I'm trying to search plain text for anything URL like. Thanks for the suggestion though.
jkp
+1  A: 

How about your write your own module that implements that regex?

ck
Yup, ofcourse this is an option: again, really I wanted to know if anyone had already done this! Usually in the Python world there is a module somewhere to do the job. I'd rather not reinvent the wheel: and this is common problem.
jkp
+2  A: 

Look at djangos approach here: http://code.djangoproject.com/browser/django/trunk/django/utils/html.py#L77 Regexps are to limited to the job and you have to use heuristics to get results that are mostly right.

Björn Lindqvist