views:

411

answers:

3

I'm running Ubuntu 8.04 and my code looks like this...

 for (i=1;i<=n;i++)
 {
  if (arr[i] ~ /^[A-Z]{2,4}$/) printf(arr[i])
 }

I quickly discovered that the {n} expression won't work in gawk without the --posix switch. Once enabled the expression works but it is case-insenitive matching AAAA and aaaa. What is going on here?

A: 

I only have mawk installed, but maybe this is what your looking for?

for (i=1;i<=n;i++) { if (arr[i] ~ [^A-Z]{2,4}$/) printf(arr[i]) }

Sorry, but I don't think this is what the OP asked for
jpalecek
+4  A: 

The expression itself works for me:

dfs:~# gawk --posix '/^[A-Z]{2,4}$/ {print "Yes"}'
AAAA
Yes
AA
Yes
TT
Yes
tt
YY
Yes
yy

Your problems may be caused by two things. Either you accidentally set the IGNORECASE awk variable or otherwise turned of case insensitive operation (BTW IGNORECASE doesn't work with --posix, but does with --re-interval, which enables the braces in regular expressions too), or it is a classic problem of locale's collating sequence (because gawk does locale aware character comparison), which means the lowercase characters compare between some uppercase characters. Quote from the relevant part of the manual:

Many locales sort characters in dictionary order, and in these locales, ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; instead it might be equivalent to ‘[aBbCcDdxXyYz]’, for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value ‘C’.

jpalecek
Perhaps you should specify that IGNORECASE is an environment variable for greater clarity.
dmckee
Yes, I probably should have, because it in fact isn't an environment variable. Edited to clarify that.
jpalecek
A: 

Otherwise, if you're using GNU awk, you could use the [:upper:] alphabetic character class.

% awk '{print /[:upper:]/?"OK":"KO"}'
AA
KO
aa
KO
radoulov