tags:

views:

239

answers:

3

I want to remove all occurrences of URL [full path, query string] from the text in Python. Any suggestions on how to do this? I am new to regex!

http://example.com/url/?x=data

This whole URL should be removed! Thanks

+1  A: 

This is definitely a non-trivial task assuming you want to remove any valid URL. I'd take a look at the Regex Lib page on the topic.

theraccoonbear
+1  A: 

This previous question will get you off to a good start to match the URL, (ie. RegExLib.com) then its just a matter of the removal

curtisk
A: 

You are better off using urllib instead of regex, because all of the work is done for you. There are various splitting functions available:

>>> import urllib
>>> [x for x in dir(urllib) if 'split' in x]
['splitattr', 'splithost', 'splitnport', 'splitpasswd', 'splitport', 'splitquery', 'splittag', 'splittype', 'splituser', 'splitvalue']

In this case it looks like you want a combination of splithost() and splittype(), like so:

>>>  urllib.splittype(url)
('http', '//example.com/url/?x=data')
>>> urllib.splithost('//example.com/url/?x=data')
('example.com', '/url/?x=data')

Chain those together and you get the host and the path (with query string) as a tuple:

>>> urllib.splithost(urllib.splittype(url)[1])
('example.com', '/url/?x=data')

Now make it useful!

jathanism