Features common to all regex flavors?

views:

382

answers:

+5 Q:

Features common to all regex flavors?

I've seen a lot of commonality in regex capabilities of different regex-enabled tools/languages (e.g. perl, sed, java, vim, etc), but I've also many differences.

Is there a standard subset of regex capabilities that all regex-enabled tools/languages will support? How do regex capabilities vary between tools/languages?

+10 A:

http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
Even more detailed: http://www.regular-expressions.info/refflavors.html

kokos 2008-08-27 13:07:45

+10 A:

Compare Regular Expression Flavors

http://www.regular-expressions.info/refflavors.html

Jeff Atwood 2008-08-27 13:08:30

+1 A:

If you took the grep regexp grammar, not the egrep one, or the sed regexp grammar and used that you should be using a safe subset across many platforms and tools.

About the only thing that may bite you then is when you go shift between regexp implementations using Finite State Automatons (FSA) and ones using backtracking, e.g. quantifier implementations will vary from grep to Perl.

FSA based implementations will find longest match starting at the first possible position. Backtracking ones will find the left-biased first match, starting at the first possible position. That is, it will try each branch in the order in the pattern until a match is found.

Consider the string "xyxyxyzz", and the pattern "(xy)*(xyz)?". FSA based engines will match the longest possible substring, "xyxyxyz". Back-tracking based engines will match the left-biased first substring, "xyxyxy".

Rob Wells 2008-08-27 13:14:23

"non-finite decision automata". My computer only has finite memory; how does it hold an infinite $THING?I think you might mean s/finite/deterministic/g.

Jonas Kölker 2009-05-18 13:44:26

+1 A:

Most regular expression tools/languages support these basic capabilities:

Character Classes/Sets and their Negation - []
Anchors - ^$
Alternation - |
Quantifiers - ?+*{n,m}
Metacharacters - \w, \s, \d, ...
Backreferences - \1, \2, ...
Dot - .
Simple modifiers like /g and /i for global and ignore case
Escaping Characters

More advanced tools/languages support:

Lookaheads and behinds
POSIX character classes
Word boundaries
Inline Switches like allowing case insensitivity for only a small section of the regex
Modifiers like /x to allow extra formatting and comments, /m for multiline
Named Captures
Unicode

Joseph Pecoraro 2008-08-27 13:15:30

Some simple implementations (eg. in Scintilla/SciTE) doesn't even support alternation or some quantifiers (? and {}).

PhiLho 2008-12-12 16:43:13

There's no standard engine. However, the POSIX Extended Regular Expression format is a valid subset of most engines and is probably as close as you'll get to a standardised subset.

2008-08-27 13:17:22

See emacs's regular expression syntax: http://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html#Regexps.

I recall reading that emacs's syntax is set in stone (for backwards compatibility reasons), so if you want to be compatible with everything, make everything compatible with this. Some tools might support it, others might not.

While you have a worthy goal, I think it'll be exceedingly difficult to reach, and I've also found emacs's regexps a pain to work with. Maybe 99% of everything is good enough if it makes you happier and more productive?

Jonas Kölker 2009-05-18 13:47:50

ansaurus

tags:

views:

answers:

Features common to all regex flavors?

related questions