I want to remove all occurrences of URL [full path, query string] from the text in Python. Any suggestions on how to do this? I am new to regex!
http://example.com/url/?x=data
This whole URL should be removed! Thanks
I want to remove all occurrences of URL [full path, query string] from the text in Python. Any suggestions on how to do this? I am new to regex!
http://example.com/url/?x=data
This whole URL should be removed! Thanks
This is definitely a non-trivial task assuming you want to remove any valid URL. I'd take a look at the Regex Lib page on the topic.
This previous question will get you off to a good start to match the URL, (ie. RegExLib.com) then its just a matter of the removal
You are better off using urllib
instead of regex, because all of the work is done for you. There are various splitting functions available:
>>> import urllib
>>> [x for x in dir(urllib) if 'split' in x]
['splitattr', 'splithost', 'splitnport', 'splitpasswd', 'splitport', 'splitquery', 'splittag', 'splittype', 'splituser', 'splitvalue']
In this case it looks like you want a combination of splithost()
and splittype()
, like so:
>>> urllib.splittype(url)
('http', '//example.com/url/?x=data')
>>> urllib.splithost('//example.com/url/?x=data')
('example.com', '/url/?x=data')
Chain those together and you get the host and the path (with query string) as a tuple:
>>> urllib.splithost(urllib.splittype(url)[1])
('example.com', '/url/?x=data')
Now make it useful!