So I'm using python to do some parsing of web pages and I want to split the full web address into two parts. Say I have the address http://www.stackoverflow.com/questions/ask. I would need the protocol and domain (e.g. http://www.stackoverflow.com) and the path (e.g. /questions/ask). I figured this might be solved by some regex, however I'm not so handy with that. Any suggestions?
A:
import re
url = "http://stackoverflow.com/questions/ask"
protocol, domain = re.match(r"(http://[^/]*)(.*)", url).groups()
Cybis
2008-11-13 03:12:46
+7
A:
Use the Python urlparse module:
http://www.python.org/doc/2.5.2/lib/module-urlparse.html
For a well-defined and well-traveled problem like this, don't bother with writing your own code, let alone your own regular expressions. They cause too much trouble ;-).
Dan Fego
2008-11-13 03:13:00
+11
A:
Dan is right: urlparse is your friend:
>>> from urlparse import urlparse
>>>
>>> parts = urlparse("http://www.stackoverflow.com/questions/ask")
>>> parts.scheme + "://" + parts.netloc
'http://www.stackoverflow.com'
>>> parts.path
'/questions/ask'
Ned Batchelder
2008-11-13 03:37:48
Gotta love that batteries included philosophy. I thought regex at first b/c I didn't know about that battery was included. Thanks.
Sam Corder
2008-11-13 18:22:03