ansaurus

Question

Answer 1

+1 A:

Look up the Unicode character values, and use literals of the form \uxxxx.

U+00e is a with a grave accent, e.g.

char aacute = '\u00e1';

The next question is where your string came from. Are you sure it has these characters? As composed characters? Better print some out in hex and have a look.

You might need to normalize (in Java 1.6 or with icu4j).

bmargulies 2009-12-21 20:24:24

Answer 2

+5 A:

For Unicode chacters to work, you must be certain that javac reads it in the same encoding as it is written.

You will save yourself a lot of trouble by just using the \uXXXX notation.

Thorbjørn Ravn Andersen 2009-12-21 20:29:06

So, what am I to do? Change the file encode? is so? how?

OscarRyz 2009-12-21 21:56:03

Answer 3

+1 A:

This seemed to work for me in a quick test:

static char [] a = {'à', 'á', 'â', 'ä' };


    public static boolean foundMatch(String s){

        boolean test = false;
        for(int i=0;i < a.length;i++){
            String t = String.valueOf(a[i]);
            test = s.contains(t);
            if (test) return true;
    }
        return test;
    }

curtisk 2009-12-21 20:41:02

i guess it depends on using the right tools. in this case, a unicode capable text editor. (AFAIR, Java works mostly in unicode, but not so sure about the compiler)

Javier 2009-12-21 20:43:33

You can tell the compiler about the encoding used in your sources by passing the `-encoding` option, e.g. `javac -encoding utf8 ...`

Dirk 2009-12-21 21:02:48

@curtisk: Yeap, that's what I thought. Actually that's how it would compile on my old Windows machine. It turns out my new compiler wasn't using UTF-8

OscarRyz 2009-12-21 22:14:01

Answer 4

A:

You don't mention what you need to accomplish (i.e. why you need to find accentuated characters in a string), I'll hazard a guess that you need to do more than merely check if there are accented characters present in a piece of input. On the risk of telling you something you already know:

If you need to filter them out of a text string I recommend you use whitelisting instead of blacklisting.
If you need to sort them alphabetically regardless of accentuation, use java.text.Collator instead of a roll-your-own system.
If you need to replace the accented characters by their 'base' characters, the Collator should again be of help (the decomposition stuff inside it), but I haven't done this before, so I can't tell you how to do so exactly.

Barend 2009-12-21 20:48:20

I'm posting an answer to someone new to programming. This is more for learning purposes. The original answer is here: http://stackoverflow.com/questions/1941899/i-need-a-function-to-convert-lower-case-to-upper-case-in-java/1942323#1942323 Thanks for the answer.

OscarRyz 2009-12-21 22:13:02

Answer 5

+4 A:

The code should be compiled with the correct encoding:

javac -encoding UTF-8 Foo.java

There'll be an encoding mismatch there somewhere.

public class Foo {
  char [] a = {'à', 'á', 'â', 'ä' };  
}

The above code saved as UTF-8 should become the hex dump:

70 75 62 6C 69 63 20 63 6C 61 73 73 20 46 6F 6F         public class Foo
20 7B 0D 0A 20 20 63 68 61 72 20 5B 5D 20 61 20          {__  char [] a
3D 20 7B 27 C3 A0 27 2C 20 27 C3 A1 27 2C 20 27         = {'__', '__', '
C3 A2 27 2C 20 27 C3 A4 27 20 7D 3B 20 20 0D 0A         __', '__' };  __
7D 0D 0A 0D 0A                                          }____

The UTF-8 value for code point U+00E0 (à) is C3 A0.

The code should be compiled with the correct encoding:

javac -encoding UTF-8 Foo.java

There is an outside chance that à will be represented by the combining sequence U+0061 U+0300. This is the NFD form (I've never come across a text editor that used it as a default for text entry). As Thorbjørn Ravn Andersen points out, it is often better to always use \uXXXX escape sequences - it is less ambiguous.

You also need to check your input device (file/console/etc.)

As a last resort, you can dump your chars as hex System.out.format("%04x", (int) c); and try manually decoding them with a character inspector to find out what they are.

McDowell 2009-12-21 21:45:40

Excellent explanation. In a nutshell: save (and compile) file as UTF-8.

BalusC 2009-12-21 21:48:16

ansaurus

tags:

views:

answers:

Accentuated literals in Java

related questions