views:

69

answers:

2

Which features of regular expressions are standard, and which are idiosyncratic ?
What should I do, and not do, if I want to use the same regex in different context, languages, platforms ?

+1  A: 

Here you can find a good reference. And here you have the best book I ever read about the subject. Then in this page, under language features (Part 1 & 2) you can see some differences

microspino
+2  A: 

There is no standard, but if maximum portability is your goal you should stick to the features supported by JavaScript regexes. All of the other major flavors support everything JS does, with only minor variations here and there. For example, some only support the POSIX character-class notation ([:alpha:]), while others use the Unicode syntax (\p{Alpha}).

Probably the most troublesome variations are those that affect the dot (.) and the anchors (^ and $). For example, JavaScript has no DOTALL (or "single-line") mode, so to match anything including a newline you have to use a hack like [\s\S]. Meanwhile, Ruby has a DOTALL mode but calls it multiline mode--what everyone else calls "multiline" (^ and $ as line anchors) is how Ruby always works.

Be aware, too, of exactly what the dot doesn't match (in the default mode). Traditionally that was just the linefeed (\n), but more and more flavors are adopting (or at least approximating) the Unicode guidelines concerning line separators. For example, in Java the dot doesn't match any of [\r\n\u0085\u2028\u2029], while ^ and $ treat \r\n as a single separator and won't match between the two characters.

Note that I'm only talking about Perl-derived flavors, like Python, Ruby, PHP, JavaScript, etc.. It wouldn't make sense to inlcude GNU or POSIX based flavors like grep, awk, and MySQL; they tend to have fewer features, but that's not what you would choose them for anyway.

I'm also not including the XML Schema flavor; it's much more limited than JavaScript, but it's a specialized application. For example, it doesn't support the anchors (^, $, \A, \Z, etc.) because matches are always anchored at both ends.

Alan Moore
Alan, your comment to my answer was a just one, therefore I removed it (and because yours is a better answer to the posed question!). +1
Bart Kiers