tags:

views:

152

answers:

2

I'm trying to filter thousands of files, looking for those which contain string constants with mixed case. Such strings can be embedded in whitespace, but may not contain whitespace themselves. So the following (containing UC chars) are matches:

"  AString "   // leading and trailing spaces together allowed
"AString "     // trailing spaces allowed
"  AString"    // leading spaces allowed
"newString03"  // numeric chars allowed
"!stringBIG?"  // non-alphanumeric chars allowed
"R"            // Single UC is a match

but these are not:

"A String" // not a match because it contains an embedded space
"Foo bar baz" // does not match due to multiple whitespace interruptions
"a_string" // not a match because there are no UC chars

I still want to match on lines which contain both patterns:

"ABigString", "a sentence fragment" // need to catch so I find the first case...

I want to use Perl regexps, preferably driven by the ack command-line tool. Obviously, \w and \W are not going to work. It seems that \S should match the non-space chars. I can't seem to figure out how to embed the requirement of "at least one upper-case character per string"...

ack --match '\"\s*\S+\s*\"'

is the closest I've gotten. I need to replace the \S+ with something that captures the "at least one upper-case (ascii) character (in any position of the non-whitespace string)" requirement.

This is straightforward to program in C/C++ (and yes, Perl, procedurally, without resorting to regexps), I'm just trying to figure out if there is a regular expression which can do the same job.

A: 

You could add the requirement with a character class, like:

ack --match "\"\s*\S+[A-Z]\S+\s*\""

I'm assuming that ack matches one line at a time. The \S+\s*\" part can match multiple closing quotes in a row. It would match the entirety of "alfa"", instead of just "alfa".

Andomar
ack, not awk ;^)~ It embeds in Perl or runs as a command line, and thus uses Perl regexps: http://betterthangrep.com/.But I can still consider awk, of course. Thanks.
Don Wakefield
Oh, and as written, doesn't yours require the upper case char as the last char in the non-space string? I need to require the UC char *anywhere* in the non-space string.
Don Wakefield
@Don Wakefield: Right, I kinda wondered about the new AWK `--match` option :) The regex should work in Perl tho
Andomar
@Don Wakefield: It has `\S+` both before and after `[A-Z]`, so it doesn't require the cap at end of string
Andomar
But I think it does require the capital *not* to be at the end of the string if there's only one of it, in other words, "abcD" won't match.
Dan
@Dan is right. The requirement is "at least one UC char, *anywhere* in the non-whitespace character string" and there can be no additional accidental constraint which amounts to "just at the beginning", "just at the end" or "at least *two* (three, four) UC chars"...
Don Wakefield
Won't that match ABCD? Or, is all caps an okay string to match?
coffeepac
Yes, all caps is okay to match. But the pattern really only matches strings with caps anchored to the ends of the non-whitespace string. So "ABCD" matches, and "ABxyCD", but not "abXYcd", which should.
Don Wakefield
+8  A: 

The following pattern passes all your tests:

qr/
  "      # leading single quote

  (?!    # filter out strings with internal spaces
     [^"]*   # zero or more non-quotes
     [^"\s]  # neither a quote nor whitespace
     \s+     # internal whitespace
     [^"\s]  # another non-quote, non-whitespace character
  )

  [^"]*  # zero or more non-quote characters
  [A-Z]  # at least one uppercase letter
  [^"]*  # followed by zero or more non-quotes
  "      # and finally the trailing quote
/x

Using this test program—that uses the above pattern without /x and therefore without whitespace and comments—as input to ack-grep (as ack is called on Ubuntu)

#! /usr/bin/perl

my @tests = (
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"A String">     => 0 ],
  [ q<"a_string">     => 0 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
  [ q<"  a String  "> => 0 ],
  [ q<"Foo bar baz">  => 0 ],
);

my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
for (@tests) {
  my($str,$expectMatch) = @$_;
  my $matched = $str =~ /$pattern/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",
        ": $str\n";
}

produces the following output:

$ ack-grep '"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' try
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",

With the C shell and derivatives, you have to escape the bang:

% ack-grep '"(?\![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' ...

I wish I could preserve the highlighted matches, but that doesn't seem to be allowed.

Note that escaped double-quotes (\") will severely confuse this pattern.

Greg Bacon
Kinopiko
That is a thing of beauty, sort of. ;^)~ Now if I can figure out how to escape it for the shell, I can use it with ack!
Don Wakefield
Just use single quotes. See my revised answer.
Greg Bacon
Is ack-grep just an alias for ack? I have version 1.88 of ack. Also, with c-shell, the single quoted version fails: [: Event not found.
Don Wakefield
But Bourne shell seems to work. Okay, I have my answer! Thanks @gbacon!
Don Wakefield
See revised answer. Glad to help!
Greg Bacon
@gbacon, Actually, probably due to my poorly worded spec, it still incorrectly matches when there are more than one internal whitespace sequences: "Foo bar baz" matches, but should not as it is multiple words. I'll see if I can fix it now that I have some clues to the pattern...
Don Wakefield
What about `" a String "`? I assume it should not match either.
Greg Bacon
@gbacon: True. *Any* number of interruptions of non-whitespace chars with whitespace chars disqualifies the string. Only leading and trailing whitespace are allowed.
Don Wakefield
ack-grep is how some Linux distros package ack, because there is already a package out there called ack.
Andy Lester
@gbacon's revised answer (truly, this time) wins the prize! Thanks, sir!
Don Wakefield