views:

1734

answers:

3

I have the following regular expression for eliminating spaces, tabs, and new lines: [^ \n\t]

However, I want to expand this for certain additional characters, such as > and <.

I tried [^ \n\t<>], which works well for now, but I want the expression to not match if the < or > is preceded by a \.

I tried [^ \n\t[^\]<[^\]>], but this did not work.

Can anyone help?

A: 

Maybe you can use egrep and put your pattern string inside quotes. This should obliterate the need for escaping.

Yuval F
+1  A: 

Can any one of the sequences below occur in your input?

\\>
\\\>
\\\\>
\blank
\tab
\newline
...

If so, how do you propose to treat them?

If not, then zero-width look-behind assertions will do the trick, provided that your regular expression engine supports it. This will be the case in any engine that supports Perl-style regular expressions (including Perl's, PHP, etc.):

 (?<!\\)[ \n\t<>]

The above will match any un-escaped space, newline, tab or angled braces. More generically (using \s to denote any space characters, including \r):

 (?<!\\)\s

Alternately, using complementary notation without the need for a zero-width look-behind assertion (but arguably less efficiently):

 (?:[^ \n\t<>]|\\[<>])

You may also use a variation of the latter to handle the \\>, \\\>, \\\\> etc. cases as well up to some finite number of preceding backslashes, such as:

 (?:[^ \n\t<>]|(?:^|[^<>])[\\]{1,3,5,7,9}[<>])

Cheers, V.

vladr
A: 

According to the grep man page:

A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list.

This means that you can't match a sequence of characters such as \< or \> only single characters.

Unless you have a version of grep built with Perl regex support then you can use lookarounds like one of the other posters mentioned. Not all versions of grep have this support though.

Robert S. Barnes