views:

1008

answers:

3

In regards to: http://stackoverflow.com/questions/720113/find-hyperlinks-in-text-using-python-twitter-related

How can I extract just the url so I can put it into a list/array?


Edit

Let me clarify, I don't want to parse the URL into pieces. I want to extract the URL from the text of the string to put it into an array. Thanks!

+8  A: 

Misunderstood question:

>>> from urllib.parse import urlparse
>>> urlparse('http://www.ggogle.com/test?t')
ParseResult(scheme='http', netloc='www.ggogle.com', path='/test',
        params='', query='t', fragment='')

or py2.* version:

>>> from urlparse import urlparse
>>> urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
        params='', query='', fragment='')

ETA: regex are indeed are the best option here:

>>> s = 'This is my tweet check it out http://tinyurl.com/blah and http://blabla.com'
>>> re.findall(r'(https?://\S+)', s)
['http://tinyurl.com/blah', 'http://blabla.com']
SilentGhost
+2  A: 

In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
Andrew Hare
I get an "invalid syntax" with the last line.
Kyle Hayes
Ok, got it to work without the print statement for some reason
Kyle Hayes
Keep in mind that regex won't catch https:// links
Chris Lawlor
Good point - I simply copy/pasted the original regex. I fixed it to be a bit more robust and included your suggestion - thanks!
Andrew Hare
If you get a syntax error on the print statement, you're probably using Python 3.0, which removes the print statement and instead simply provides a print("Hello, world.") function instead.
Brandon Craig Rhodes
A: 

Regarding this:

import re
myString = "This is my tweet check it out http:// tinyurl.com/blah"
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

It won't work well if you have multiple urls in the string. If the string looks like:

myString = "This is my tweet check it out http:// tinyurl.com/blah and http:// blabla.com"

You may do something like this:

myString_list = [item for item in myString.split(" ")]
for item in myString_list:
    try:
        print re.search("(?P<url>https?://[^\s]+)", item).group("url")
    except:
        pass
bogdan
@bogdan: I fixed your post, stop messing it please.
SilentGhost
or you could jsut do: print re.findall("(?P<url>https?://[^\s]+)", myString)
bogdan