ansaurus

Question

Answer 1

+1 A:

This is definitely a non-trivial task assuming you want to remove any valid URL. I'd take a look at the Regex Lib page on the topic.

theraccoonbear 2009-12-18 18:06:46

Answer 2

+1 A:

This previous question will get you off to a good start to match the URL, (ie. RegExLib.com) then its just a matter of the removal

curtisk 2009-12-18 18:07:16

Answer 3

A:

You are better off using urllib instead of regex, because all of the work is done for you. There are various splitting functions available:

>>> import urllib
>>> [x for x in dir(urllib) if 'split' in x]
['splitattr', 'splithost', 'splitnport', 'splitpasswd', 'splitport', 'splitquery', 'splittag', 'splittype', 'splituser', 'splitvalue']

In this case it looks like you want a combination of splithost() and splittype(), like so:

>>>  urllib.splittype(url)
('http', '//example.com/url/?x=data')
>>> urllib.splithost('//example.com/url/?x=data')
('example.com', '/url/?x=data')

Chain those together and you get the host and the path (with query string) as a tuple:

>>> urllib.splithost(urllib.splittype(url)[1])
('example.com', '/url/?x=data')

Now make it useful!

jathanism 2010-02-11 07:15:25

ansaurus

tags:

views:

answers:

regex to remove URL from text

related questions