views:

226

answers:

3

How do I rewrite this new way to recognise addresses to work in Python?

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

+4  A: 

I don't think python have this expression

[:punct:]

Wikipedia says [:punct:] is same to

[-!\"#$%&\'()*+,./:;<=>?@\\[\\\\]^_`{|}~]
S.Mark
Wikipedia is wrong. It's missing the caret, according to http://www.regular-expressions.info/posixbrackets.html.
Peter Hansen
Okay, now it's right. Please update your answer.
Peter Hansen
Yeah, Updated my post, Thanks. Somebody updated Wikipedia too. Great!
S.Mark
Yeah, that was me too. :-)
Peter Hansen
+2  A: 

Python doesn't have the POSIX bracket expressions.

The [:punct:] bracket expression is equivalent in ASCII to

[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] 
Vinko Vrsalovic
Make sure you use a "raw" string (prefix with `r`) when using that, as the backslash escapes won't be correct otherwise.
Peter Hansen
Also note that Python does not support those Unicode character properties: http://stackoverflow.com/questions/1832893
Peter Hansen
Indeed, they compile fine but don't do what you expect
Tobias
Python's regex engine is a very strange beast. Fixed the answer.
Vinko Vrsalovic
+4  A: 

The original source for that states "This pattern should work in most modern regex implementations" and specifically Perl. Python's regex implementation is modern and similar to Perl's but is missing the [:punct:] character class. You can easily build that using this:

>>> import string, re
>>> pat = r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^%s\s]|/)))'
>>> pat = pat % re.sub(r'([-\\\]])', r'\\\1', string.punctuation)

The re.sub() call escapes certain characters inside the character set as required.

Edit: Using re.escape() works just as well, since it just sticks a backslash in front of everything. That felt crude to me at first, but certainly works fine for this case.

>>> pat = pat % re.escape(string.punctuation)
Peter Hansen
This passes all of Gruber's tests, as does pat = pat % re.escape(string.punctuation)
Tobias
@vanity, updated to mention that. Note the obvious, that if your data source is Unicode a pure-ASCII solution like string.punctuation may give imperfect results.
Peter Hansen
It works with non-ASCII domains and paths. I don't have test data with non-English punctuation.
Tobias