tags:

views:

2790

answers:

7

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.

It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on. Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

+1  A: 

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.

Charlie Martin
+3  A: 

Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.

Dillie-O
+1  A: 

Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.

Darron
+1  A: 

POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.

Jonathan Leffler
A: 

Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:

HTH

cheers,

Rob

Rob Wells
'sed' is a command for which plain '(' is not special but '\(' is special; in contrast, PCRE reverses the sense, so '(' is special, but '\(' is not. This is exactly what the OP is asking about.
Jonathan Leffler
sed is a *nix utility that uses one of the most primitive sets of regexp evaluation. PCRE doesn't enter in to the situation I describes as it involves a different class of (in)finite automata with the way it evaluates regexps. I think my suggestion for the minimum set of regexp syntax still holds.
Rob Wells
On a POSIX-compliant system, sed uses POSIX BRE, which I cover in my answer. The GNU version on modern Linux system uses POSIX BRE with a few extensions.
Jan Goyvaerts
A: 

This is not exactly an answer to your question (Dillie-O summed it up nicely, so no need for me to try and do the same thing). If you have to construct regexes in different languages/flavors frequently, I can wholeheartedly recommend RegexBuddy. It's a Windows app, and it's commercial (around 30 bucks), so it might not be what you're looking for. But it is aware of all the significant regex flavors out there and can convert regexes between flavors (even provide you with code snippets ready for insertion in your favorite language). This is a major timesaver for me and others (like Jeff Atwood, for example).

Coincidentally, Jan Goyvaerts (the author of RegexBuddy) has recently written a good blog entry about escaping metacharacters which you might also find interesting.

Tim Pietzcker
+3  A: 

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{\

and these inside character classes:

^-]\

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{\

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

Jan Goyvaerts