tags:

views:

59

answers:

2

I'm trying to identify and condense single (uppercase) characters in a string.

For example:

"test A B test" -> "test AB test"

"test A B C test" -> "test ABC test"

"test A B test C D E test" -> "test AB test CDE test"

I have it working for single occurrences (as in the first above example), but cannot figure out how to chain it for multiple occurrences.

$str =~ s/ ([A-Z]) ([A-Z]) / \1\2 /g;

I'll probably feel stupid when I see the solution, but I'm prepared for that. Thanks in advance.

+1  A: 
$str =~ s/\b([A-Z])\s+(?=[A-Z]\b)/$1/g;
KennyTM
I saw the "\1" was changed to "$1". Both versions appear to work...so what is the difference?
brydgesk
The word boundary assertion (`\b`) might not be what you want here. If the string `"A B C!"` should become `"AB C!"` you will need to use something else. Also, if `"A B C1"` should become `"ABC1"` then you will need to use something else.
Chas. Owens
@brydgesk read the output of `perl -Mdiagnostics -e '$" =~ s/(a)/\1/'` Basically it is a style and consistency issue (e.g. `\10` likely doesn't mean what you think it does, but `$10` does).
Chas. Owens
I think \b should be fine at the beginning, but at the end I'd like to look for either whitespace or end of string/line.
brydgesk
@bry: Oops. That was just because I was testing the regex in Python which doesn't accept the `$1`.
KennyTM
+1  A: 

The reason it's not working is that you have leading and trailing spaces in your regex. Once " A B C " becomes " AB C ", the B no longer has a leading space - the A is there.

The simplest solution would be to take those out and use s/([A-Z]) ([A-Z])/\1\2/g which should fulfill the stated requirements, but it would also turn all-caps phrases into a single block of letters (e.g., "THIS IS A TEST" -> "THISISATEST"), which may not be acceptable to you.

If you need to only collapse single capital letters and not groups of them (e.g., "FOR I M A TEST" -> "FOR IMA TEST", not "FORIMATEST"), then I don't think that's possible with a single regex. You'd have to do it in two passes, one to mark which spaces to collapse and the second to actually remove the marks (e.g., "FOR I M A TEST" -> "FOR I^M^A TEST" -> "FOR IMA TEST") because you otherwise can't distinguish between a pair of uppercase letters which were originally paired and one which was originally space-separated but has already been collapsed.

Dave Sherohman