views:

61

answers:

3

I hate ask stupid questions like this, but why doesn't my code do what I expect?

Java code in a main method:

String s = "\"The fat-dog [ruffus] @nd the stupid-cat [*mewzer*] don't like each other!\"";
String[] tokens = s.replaceAll("[\\x27]+", "").replaceAll("[^a-zA-z_\\x2D]+", " ").replaceAll("\\s+", " ").trim().split(" ");

System.out.println(s);
for (String t : tokens)
    System.out.println("Token: " + t);

This prints:

"The fat-dog [ruffus] @nd the stupid-cat [mewzer] don't like each other!"

Token: The
Token: fat-dog
Token: [ruffus]
Token: nd
Token: the
Token: stupid-cat
Token: [
Token: mewzer
Token: ]
Token: dont
Token: like
Token: each
Token: other

Which is mostly correct, except for those damn brackets! Shouldn't they be replaced by the "[^a-zA-z_\\x2D]+" expression? I even tried adding a replaceAll("\\[\\]"," ") and then a replaceAll("\\x5B\\x5D"," ") to no avail.


How can I get rid of these brackets? What is keeping them from being replaced in the three replace all statements I just mentioned?

+1  A: 

This:

replaceAll("\\[\\]"," ")

Should probably be:

replaceAll("(\\[|\\])"," ")

You were trying to replace instances of [] with a , instead of replacing a [ or a ] with a .

jjnguy
Thanks. That explains why the second two replaceAll expressions didn't work as expected, but what about the first? `[` and `]` are not in the set `a-zA-z_\\x2D` correct?
Doug
[ and ] are in the set A-z, see my answer :)
Affe
+2  A: 

Your first try didn't work because of this

replaceAll("[^a-zA-z_\x2D]+", " ")

That range of characters happens to actually include [ and ] in western european/north american sets. [\]^`_ are placed between Z and a, which is normally a convenience when you write A-z, but also a pitfall for you!

You probably meant A-Z

Affe
Ahh, the subtle capitalization typo. Sometimes you just need a second pair of regex comprehending eyes. Thanks.
Doug
A: 

It looks like there is a better way to do what you really seem to be wanting to do (removing all non-word characters from the string (except hyphen)):

String[] tokens = s.replaceAll("[^\\w\\s-]+", "").replaceAll("\\s+", " ").trim().split(" ");

This will leave digits in your string alone, though. Is that a problem?

Tim Pietzcker