views:

3504

answers:

8

I'm building a Google App Engine app, and I have a class to represent an RSS Feed.

I have a method called setUrl which is part of the feed class. It accepts a url as an input.

I'm trying to use the re python module to validate off of the RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)

Below is a snipped which should work, right? I'm incredibly new to Python and have been beating my head against this for the past 3 days.

p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
  self.url = url
  return url
+1  A: 

The regex provided should match any url of the form http://www.ietf.org/rfc/rfc3986.txt; and does when tested in the python interpreter.

What format have the URLs you've been having trouble parsing had?

Jory
A: 
urlfinders = [
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?"),
    re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]|\\\\
)+"),
    re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]

NOTE: As ugly as it looks in your browser just copy paste and the formatting should be good

Found at the python mailing lists and used for the gnome-terminal

source: http://mail.python.org/pipermail/python-list/2007-January/595436.html

That "nttp" protocol seems anomalous, and I see it's in the original source too. I wonder how much this regex was tested?
Greg Hewgill
Ugh, please don't write code like this. If you just want plain vanilla urls use urlparse as suggested in other answers. Save custom regexes for when you are actually trying to match, say, a specific subset of URLs or something else that could be considered a special case. If you find an ugly regex later you will wonder if you were just trying to describe URLs in general or were specifically picking out the cases the rest of your code could handle because of some weird constraint.
rndmcnlly
@sth: Please look up the NNTP protocol.
Greg Hewgill
Um, the bit after a "mailto:" protocol has to be a compliant email address, right? Jeffry Friedl in **Mastering Regular Expressions** worked out an URL that matches such an animal as 4724 bytes long all by itself....
RBerteig
+26  A: 

An easy way to parse (and validate) URL's is the urlparse module.

A regex is too much work.


There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.

Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?

These are not RFC-specified validations. These are validations unique to your application.

S.Lott
where is the validate() method on urlparse?
wsorenson
The question "is it valid?" isn't easy to answer because almost any string is a valid URL. If the result of parsing gives you a netloc or path you don't like, you could call that "invalid".
S.Lott
agreed -- thanks for elaborating.
wsorenson
+4  A: 

I admit, I find your regular expression totally incomprehensible. I wonder if you could use urlparse instead? Something like:

pieces = urlparse.urlparse(url)
assert all([pieces.scheme, pieces.netloc])
assert set(pieces.netloc) <= set(string.letters + string.digits + '-.')  # and others?
assert pieces.scheme in ['http', 'https', 'ftp']  # etc.

It might be slower, and maybe you'll miss conditions, but it seems (to me) a lot easier to read and debug than a regular expression for URLs.

John Fouhy
+1 for the codinghorror article on Jeff's attempts to do this. I was going to quote the regex for validating a valid email address, but 4K+ characters don't fit in this box. These things are just hard to do, and the best answer probably is a dedicated parser once you manage to find some candidate text to feed it.
RBerteig
The urlparse module seems a little liberal for validation (for accepting input and normalising it would be perfect however). It accepts things like "http://invalidurl--" - which I'm almost certain is an invalid URLs(?)
dbr
@dbr: That's why I added the assert statements to my code sample. Like I said, "maybe you'll miss conditions", but that can happen with regular expressions too, and at least this way you can easily tell what you are and aren't testing for.
John Fouhy
@dbr: I find it funny to see your comment as SO also accepted it :)
voyager
A: 

I've needed to do this many times over the years and always end up copying someone else's regular expression who has thought about it way more than I want to think about it.

Having said that, there is a regex in the Django forms code which should do the trick:

http://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L534

brianz
+7  A: 
nosklo
+1  A: 

urlparse quite happily takes invalid URLs, it is more a string string-splitting library than any kind of validator. For example:

from urlparse import urlparse
urlparse('http://----')
# returns: ParseResult(scheme='http', netloc='----', path='', params='', query='', fragment='')

Depending on the situation, this might be fine..

If you mostly trust the data, and just want to verify the protocol is HTTP, then urlparse is perfect.

If you want to make the URL is actually a legal URL, use the ridiculous regex

If you want to make sure it's a real web address,

import urllib
try:
    urllib.urlopen(url)
except IOError:
    print "Not a real URL"
dbr
+1  A: 

RFC 3696 defines "best practices" for URL validation - http://www.faqs.org/rfcs/rfc3696.html

The latest release of Lepl (a Python parser library) includes an implementation of RFC 3696. You would use it something like:

from lepl.apps.rfc3696 import Email

# compile the validator (do once at start of program)
validator = Email()

# use the validator (as often as you like)
if validator(some_email):
    # email is ok
else:
    # email is bad

Although the validator is defined in Lepl, which is a recursive descent parser, it is largely compiled internally to regular expressions. That combines the best of both worlds - a (relatively) easy to read definition that can be checked against RFC 3696 and an efficient implementation. There's a post on my blog showing how this simplifies the parser - http://www.acooke.org/cute/LEPLOptimi0.html

Lepl is available at http://www.acooke.org/lepl and the RFC 3696 module is documented at http://www.acooke.org/lepl/rfc3696.html

This is completely new in this release, so may contain bugs. Please contact me if you have any problems and I will fix them ASAP. Thanks.

andrew cooke