ansaurus

Question

What's the cleanest way to extract URLs from a string using Python?

Answer 1

+3 A:

Use a regular expression.

Reply to comment from the OP: I know this is not helpful. I am telling you the correct way to solve the problem as you stated it is to use a regular expression.

ddaa 2009-02-06 11:54:44

I *know* I could use a regex!!!! I'm asking if there is a module or other way to do this without coming up with the ultimate regex to extract URL's.

jkp 2009-02-06 11:55:38

In the end I appologise to ddaa. The only way is to use a regex and there is not a module out there that exposes this in a ready-wrapped form. The suggestion to use BeautifulSoup does not work when extracting from plain text.

jkp 2009-02-06 13:02:07

Answer 2

+1 A:

if you know that there is a URL following a space in the string you can do something like this:

s is the string containg the url

>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]

otherwise you need to check if find returns -1 or not.

sinzi 2009-02-06 12:06:16

what about https://?

Brandon H 2009-11-24 15:27:03

>>> t = s[s.find("https://"):] >>> t = t[:t.find(" ")]

islam 2010-06-09 16:09:25

Answer 3

+4 A:

You can use BeautifulSoup.

def extractlinks(html):
    soup = BeautifulSoup(html)
    anchors = soup.findAll('a')
    links = []
    for a in anchors:
        links.append(a['href'])
    return links

Note that the solution with regexes is faster, although will not be as accurate.

Sebastjan Trepča 2009-02-06 12:12:58

Sebastian: I know BeautifulSoup but the problem is that it will only extract anchored URLs. I'm trying to search plain text for anything URL like. Thanks for the suggestion though.

jkp 2009-02-06 12:18:08

Answer 4

+1 A:

How about your write your own module that implements that regex?

ck 2009-02-06 12:17:05

Yup, ofcourse this is an option: again, really I wanted to know if anyone had already done this! Usually in the Python world there is a module somewhere to do the job. I'd rather not reinvent the wheel: and this is common problem.

jkp 2009-02-06 12:19:34

Answer 5

+2 A:

Look at djangos approach here: http://code.djangoproject.com/browser/django/trunk/django/utils/html.py#L77 Regexps are to limited to the job and you have to use heuristics to get results that are mostly right.

Björn Lindqvist 2009-12-06 17:03:32

ansaurus

tags:

views:

answers:

What's the cleanest way to extract URLs from a string using Python?

related questions