ansaurus

Question

Replace Unicode Control Characters, existing solution ?

Answer 1

+1 A:

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

API links

Examples

Here's a subtraction example:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

[a-z&&[^aeiou]] matches [a-z] subtracted by [aeiou], i.e. all lowercase consonants.

The next example shows the negated whitelist approach:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-z are legal; everything else is illegal.

polygenelubricants 2010-08-09 10:39:51

The problem is that I am goign to use chinese, arabic, all the utf-8 character possible :) I will try with p{Cc} !!

Scorpi0 2010-08-09 11:52:43

@Scorpi0: the above are just examples. Find whatever Unicode category/block you want to black/white-list and compose the regex as you wish using elements shown here.

polygenelubricants 2010-08-09 12:24:58

Oh, `\p{Cc}`, one more [undocumented](http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html) pattern expression. Nice one. Good to know.

BalusC 2010-08-09 15:16:53

@BalusC: I'm no Unicode expert, but I think it is documented: "Categories may be specified with the optional prefix `Is`: Both `\p{L}` and `\p{IsL}` denote the category of Unicode letters. ". Replace `L` with `Cc`, or any other category name.

polygenelubricants 2010-08-09 15:46:23

Oh, it works that way! Thank you, regex expert :)

BalusC 2010-08-09 15:49:11

It is perfectly what I need, ty !!

Scorpi0 2010-08-10 08:16:54

ansaurus

tags:

views:

answers:

Replace Unicode Control Characters, existing solution ?

API links

Examples

related questions