tags:

views:

89

answers:

3

Hi all -- I have two files I've been trying to compare with diff. The files are automatically generated and feature a number of lines that look like:

//!   Generated Date  : Mon, 14, Dec 2009

I'd like those differences to be ignored, and have set out to use the "-I REGEX" flag to make that happen.

However, the number of spaces that appear between "Date" and the colon varies and unfortunately, it seems the flavor of regular expressions employed by diff lacks a number of the basic regex utilities.

For instance, I cannot for the life of me get the "one or more" plus-sign to work. Same deal with the "\s" representation of whitespace.

diff -I '.*Generated Date\s+:.*' ....

and

diff -I '.*Generated Date +:.*' ....

both fail spectacularly.

Rather than continuing to blindly try things, can somebody out there point me to a good reference on the diff-specific subset of regular expressions?

Thanks!

===== EDIT =======

Thanks to FalseVinylShrub, I've established that I should be escaping my '+' and any similar characters. This fixes the problem somewhat. Diff successfully matches

.*Generated Date \+.*

and

.*Generated Date  *.*

(Note that there are two spaces between "Date" and "*".)

However, the second I try to add the ':' to that expression, like so:

.*Generated Date \+:.*

and

.*Generated Date \+\:.*

Both versions fail to match the string in question and cause diff to take a significantly greater amount of time to run. Any thoughts there?

A: 

According to the specification, diff doesn't support regular expressions, nor does it have an -I switch.

You appear to be using a non-standard diff with non-standard extensions. How those non-standard extensions work, should be described in the documentation of whatever non-standard diff you are using.

Jörg W Mittag
I am using GNU diff 2.8.1. That's non-standard?
Zack
+1  A: 

Very interesting... I couldn't find a documentation reference, but a little experimentation found that:

  • ␠* and .* worked if zero-or-more is OK for you
  • As you said, ␠+ doesn't work. Neither did ␠{1,}... but ␠\{1,\} did work
  • UPDATE: ␠\+ also works!

( is representing a space character, that didn't show up).

I'm using GNU diff from GNU diffutils 2.8.1.

man diff and info diff didn't explain the RE syntax.

Hope this helps.

UPDATE: I found a brief section in man grep:

Basic vs Extended Regular Expressions

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

So I guess it's using Basic regex syntax.

FalseVinylShrub
Hm! I'm using the exact same version of GNU diff, so this was a good sanity check. I changed by regex a bit, and lo-and-behold, you're right! The problem is, it seems to break horribly on the ":". I will edit my original post to describe the problem.
Zack
A: 

Ok, here's what the GNU diff source says.

re_set_syntax (RE_SYNTAX_GREP | RE_NO_POSIX_BACKTRACKING);

I think that means, "same as gnu grep -G" (Basic Regular Expression). According to the gnu grep man page:

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, +, {, \|, (, and ).

Forget about \s, \S, etc.

Wayne Conrad