ansaurus

Question

Answer 1

+8 A:

Misunderstood question:

>>> from urllib.parse import urlparse
>>> urlparse('http://www.ggogle.com/test?t')
ParseResult(scheme='http', netloc='www.ggogle.com', path='/test',
        params='', query='t', fragment='')

or py2.* version:

>>> from urlparse import urlparse
>>> urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
        params='', query='', fragment='')

ETA: regex are indeed are the best option here:

>>> s = 'This is my tweet check it out http://tinyurl.com/blah and http://blabla.com'
>>> re.findall(r'(https?://\S+)', s)
['http://tinyurl.com/blah', 'http://blabla.com']

SilentGhost 2009-05-08 14:20:51

Answer 2

+2 A:

In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

Andrew Hare 2009-05-08 14:39:32

I get an "invalid syntax" with the last line.

Kyle Hayes 2009-05-08 14:52:16

Ok, got it to work without the print statement for some reason

Kyle Hayes 2009-05-08 14:53:23

Keep in mind that regex won't catch https:// links

Chris Lawlor 2009-05-08 17:34:35

Good point - I simply copy/pasted the original regex. I fixed it to be a bit more robust and included your suggestion - thanks!

Andrew Hare 2009-05-08 17:51:16

If you get a syntax error on the print statement, you're probably using Python 3.0, which removes the print statement and instead simply provides a print("Hello, world.") function instead.

Brandon Craig Rhodes 2009-05-08 17:55:20

Answer 3

A:

Regarding this:

import re
myString = "This is my tweet check it out http:// tinyurl.com/blah"
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

It won't work well if you have multiple urls in the string. If the string looks like:

myString = "This is my tweet check it out http:// tinyurl.com/blah and http:// blabla.com"

You may do something like this:

myString_list = [item for item in myString.split(" ")]
for item in myString_list:
    try:
        print re.search("(?P<url>https?://[^\s]+)", item).group("url")
    except:
        pass

bogdan 2010-06-03 11:55:20

@bogdan: I fixed your post, stop messing it please.

SilentGhost 2010-06-03 12:00:05

or you could jsut do: print re.findall("(?P<url>https?://[^\s]+)", myString)

bogdan 2010-06-03 13:02:51

ansaurus

tags:

views:

answers:

Extracting a URL in Python

Edit

related questions