views:

382

answers:

6

I've seen a lot of commonality in regex capabilities of different regex-enabled tools/languages (e.g. perl, sed, java, vim, etc), but I've also many differences.

Is there a standard subset of regex capabilities that all regex-enabled tools/languages will support? How do regex capabilities vary between tools/languages?

+10  A: 

Compare Regular Expression Flavors

http://www.regular-expressions.info/refflavors.html

Jeff Atwood
+1  A: 

If you took the grep regexp grammar, not the egrep one, or the sed regexp grammar and used that you should be using a safe subset across many platforms and tools.

About the only thing that may bite you then is when you go shift between regexp implementations using Finite State Automatons (FSA) and ones using backtracking, e.g. quantifier implementations will vary from grep to Perl.

FSA based implementations will find longest match starting at the first possible position. Backtracking ones will find the left-biased first match, starting at the first possible position. That is, it will try each branch in the order in the pattern until a match is found.

Consider the string "xyxyxyzz", and the pattern "(xy)*(xyz)?". FSA based engines will match the longest possible substring, "xyxyxyz". Back-tracking based engines will match the left-biased first substring, "xyxyxy".

Rob Wells
"non-finite decision automata". My computer only has finite memory; how does it hold an infinite $THING?I think you might mean s/finite/deterministic/g.
Jonas Kölker
+1  A: 

Most regular expression tools/languages support these basic capabilities:

  1. Character Classes/Sets and their Negation - []
  2. Anchors - ^$
  3. Alternation - |
  4. Quantifiers - ?+*{n,m}
  5. Metacharacters - \w, \s, \d, ...
  6. Backreferences - \1, \2, ...
  7. Dot - .
  8. Simple modifiers like /g and /i for global and ignore case
  9. Escaping Characters

More advanced tools/languages support:

  1. Lookaheads and behinds
  2. POSIX character classes
  3. Word boundaries
  4. Inline Switches like allowing case insensitivity for only a small section of the regex
  5. Modifiers like /x to allow extra formatting and comments, /m for multiline
  6. Named Captures
  7. Unicode
Joseph Pecoraro
Some simple implementations (eg. in Scintilla/SciTE) doesn't even support alternation or some quantifiers (? and {}).
PhiLho
A: 

There's no standard engine. However, the POSIX Extended Regular Expression format is a valid subset of most engines and is probably as close as you'll get to a standardised subset.

A: 

See emacs's regular expression syntax: http://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html#Regexps.

I recall reading that emacs's syntax is set in stone (for backwards compatibility reasons), so if you want to be compatible with everything, make everything compatible with this. Some tools might support it, others might not.

While you have a worthy goal, I think it'll be exceedingly difficult to reach, and I've also found emacs's regexps a pain to work with. Maybe 99% of everything is good enough if it makes you happier and more productive?

Jonas Kölker