ansaurus

Question

Gruber’s URL Regular Expression in Python

Answer 1

+4 A:

I don't think python have this expression

[:punct:]

Wikipedia says [:punct:] is same to

[-!\"#$%&\'()*+,./:;<=>?@\\[\\\\]^_`{|}~]

S.Mark 2009-12-31 16:48:20

Wikipedia is wrong. It's missing the caret, according to http://www.regular-expressions.info/posixbrackets.html.

Peter Hansen 2009-12-31 17:05:02

Okay, now it's right. Please update your answer.

Peter Hansen 2009-12-31 17:07:28

Yeah, Updated my post, Thanks. Somebody updated Wikipedia too. Great!

S.Mark 2009-12-31 17:09:46

Yeah, that was me too. :-)

Peter Hansen 2009-12-31 17:11:10

Answer 2

+2 A:

Python doesn't have the POSIX bracket expressions.

The [:punct:] bracket expression is equivalent in ASCII to

[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]

Vinko Vrsalovic 2009-12-31 16:52:43

Make sure you use a "raw" string (prefix with `r`) when using that, as the backslash escapes won't be correct otherwise.

Peter Hansen 2009-12-31 17:10:33

Also note that Python does not support those Unicode character properties: http://stackoverflow.com/questions/1832893

Peter Hansen 2009-12-31 17:56:31

Indeed, they compile fine but don't do what you expect

Tobias 2009-12-31 18:00:08

Python's regex engine is a very strange beast. Fixed the answer.

Vinko Vrsalovic 2010-01-01 04:30:32

Answer 3

+4 A:

The original source for that states "This pattern should work in most modern regex implementations" and specifically Perl. Python's regex implementation is modern and similar to Perl's but is missing the [:punct:] character class. You can easily build that using this:

>>> import string, re
>>> pat = r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^%s\s]|/)))'
>>> pat = pat % re.sub(r'([-\\\]])', r'\\\1', string.punctuation)

The re.sub() call escapes certain characters inside the character set as required.

Edit: Using re.escape() works just as well, since it just sticks a backslash in front of everything. That felt crude to me at first, but certainly works fine for this case.

>>> pat = pat % re.escape(string.punctuation)

Peter Hansen 2009-12-31 16:55:42

This passes all of Gruber's tests, as does pat = pat % re.escape(string.punctuation)

Tobias 2009-12-31 18:04:03

@vanity, updated to mention that. Note the obvious, that if your data source is Unicode a pure-ASCII solution like string.punctuation may give imperfect results.

Peter Hansen 2009-12-31 18:11:15

It works with non-ASCII domains and paths. I don't have test data with non-English punctuation.

Tobias 2010-01-01 17:20:24

ansaurus

tags:

views:

answers:

Gruber’s URL Regular Expression in Python

related questions