tags:

views:

66

answers:

3
url1 = http://xyz.com/abc
url2 = http://xyz.com//abc

I want to write a regex that validate both url1 and url2

+4  A: 

Why not just use urlparse instead?

Amber
Agreed; regular expressions aren't good for URIs, email addresses or markup.
Delan Azabani
@Delan: I'm pretty sure using regular expressions for URIs is totally fine. They even give you one to parse an URI in RFC 3986.
Felix Kling
Though most URIs are simple, there are some quirks and complexities, just like with email addresses, that makes some false positives and negatives. I can't remember who, but someone wrote a regular expression that validates email addresses exactly to the spec as a proof of this concept, and it filled over a page.
Delan Azabani
@Delan: True, but nevertheless, I am sure that under the hood, `urlparse` also uses a regular expression. It might be complex, but that does not necessarily mean it is bad. Of course you don't want to write such an expression every time on your own ;) I wrote an URI parser once that should validate against the RFC and it was not too complex (it used several regular expressions, not just one, that might be indeed too complex).
Felix Kling
@Felix King, there's no need to guess about these things, just have a look at urlparse.py and you'll see there is not a single regular expression there: urlparse.py doesn't `import re`, in fact it doesn't import anything. What there is is a lot of complex domain knowledge as to what features the different schemes support.
Duncan
A: 
http://\w+\.\w+//?\w+
splash
A: 

The answer depends on whether you want to parse urls in general or whether you just wonder how to handle the optional slash.

In the first case, I agree with Amber that you should use urlparse.

In the second case, use a ? after the slash in your expression:

http://xyz.com//?abc

A ? in a regular expression means that the previous element is optional (i.e. may appear zero times or once).

Jonas Wagner