ansaurus

Question

How can I find strings that have mixed cased with Perl?

Answer 1

A:

You could add the requirement with a character class, like:

ack --match "\"\s*\S+[A-Z]\S+\s*\""

I'm assuming that ack matches one line at a time. The \S+\s*\" part can match multiple closing quotes in a row. It would match the entirety of "alfa"", instead of just "alfa".

Andomar 2009-12-08 15:14:49

ack, not awk ;^)~ It embeds in Perl or runs as a command line, and thus uses Perl regexps: http://betterthangrep.com/.But I can still consider awk, of course. Thanks.

Don Wakefield 2009-12-08 15:19:34

Oh, and as written, doesn't yours require the upper case char as the last char in the non-space string? I need to require the UC char *anywhere* in the non-space string.

Don Wakefield 2009-12-08 15:20:48

@Don Wakefield: Right, I kinda wondered about the new AWK `--match` option :) The regex should work in Perl tho

Andomar 2009-12-08 15:20:50

@Don Wakefield: It has `\S+` both before and after `[A-Z]`, so it doesn't require the cap at end of string

Andomar 2009-12-08 15:21:24

But I think it does require the capital *not* to be at the end of the string if there's only one of it, in other words, "abcD" won't match.

Dan 2009-12-08 15:31:52

@Dan is right. The requirement is "at least one UC char, *anywhere* in the non-whitespace character string" and there can be no additional accidental constraint which amounts to "just at the beginning", "just at the end" or "at least *two* (three, four) UC chars"...

Don Wakefield 2009-12-08 15:57:42

Won't that match ABCD? Or, is all caps an okay string to match?

coffeepac 2009-12-08 15:59:42

Yes, all caps is okay to match. But the pattern really only matches strings with caps anchored to the ends of the non-whitespace string. So "ABCD" matches, and "ABxyCD", but not "abXYcd", which should.

Don Wakefield 2009-12-08 16:09:32

Answer 2

+8 A:

The following pattern passes all your tests:

qr/
  "      # leading single quote

  (?!    # filter out strings with internal spaces
     [^"]*   # zero or more non-quotes
     [^"\s]  # neither a quote nor whitespace
     \s+     # internal whitespace
     [^"\s]  # another non-quote, non-whitespace character
  )

  [^"]*  # zero or more non-quote characters
  [A-Z]  # at least one uppercase letter
  [^"]*  # followed by zero or more non-quotes
  "      # and finally the trailing quote
/x

Using this test program—that uses the above pattern without /x and therefore without whitespace and comments—as input to ack-grep (as ack is called on Ubuntu)

#! /usr/bin/perl

my @tests = (
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"A String">     => 0 ],
  [ q<"a_string">     => 0 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
  [ q<"  a String  "> => 0 ],
  [ q<"Foo bar baz">  => 0 ],
);

my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
for (@tests) {
  my($str,$expectMatch) = @$_;
  my $matched = $str =~ /$pattern/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",
        ": $str\n";
}

produces the following output:

$ ack-grep '"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' try
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",

With the C shell and derivatives, you have to escape the bang:

% ack-grep '"(?\![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' ...

I wish I could preserve the highlighted matches, but that doesn't seem to be allowed.

Note that escaped double-quotes (\") will severely confuse this pattern.

Greg Bacon 2009-12-08 16:13:35

Kinopiko 2009-12-08 16:22:15

That is a thing of beauty, sort of. ;^)~ Now if I can figure out how to escape it for the shell, I can use it with ack!

Don Wakefield 2009-12-08 16:25:10

Just use single quotes. See my revised answer.

Greg Bacon 2009-12-08 16:26:41

Is ack-grep just an alias for ack? I have version 1.88 of ack. Also, with c-shell, the single quoted version fails: [: Event not found.

Don Wakefield 2009-12-08 16:31:11

But Bourne shell seems to work. Okay, I have my answer! Thanks @gbacon!

Don Wakefield 2009-12-08 16:33:37

See revised answer. Glad to help!

Greg Bacon 2009-12-08 16:36:52

@gbacon, Actually, probably due to my poorly worded spec, it still incorrectly matches when there are more than one internal whitespace sequences: "Foo bar baz" matches, but should not as it is multiple words. I'll see if I can fix it now that I have some clues to the pattern...

Don Wakefield 2009-12-08 17:19:00

What about `" a String "`? I assume it should not match either.

Greg Bacon 2009-12-08 18:05:37

@gbacon: True. *Any* number of interruptions of non-whitespace chars with whitespace chars disqualifies the string. Only leading and trailing whitespace are allowed.

Don Wakefield 2009-12-08 18:19:28

ack-grep is how some Linux distros package ack, because there is already a package out there called ack.

Andy Lester 2009-12-10 16:05:16

@gbacon's revised answer (truly, this time) wins the prize! Thanks, sir!

Don Wakefield 2009-12-10 18:54:27

ansaurus

tags:

views:

answers:

How can I find strings that have mixed cased with Perl?

related questions