views:

237

answers:

5

I tried to type char literals for accentuated vowels in Java, but the compilers says something like: unclosed character literal

This is what I'm trying to do:

 char [] a = {'à', 'á', 'â', 'ä' };

I've tried using Unicode '\u00E0' but for some reason they don't match with my code:

 for( char c : string.toCharArray() ) {
     if( c == a[i] ) {
         // I've found a funny letter 
     }
 }

The if never evaluates to true, no matter what I put in my string.

Here's the complete program I'm trying to code.

+1  A: 

Look up the Unicode character values, and use literals of the form \uxxxx.

U+00e is a with a grave accent, e.g.

char aacute = '\u00e1';

The next question is where your string came from. Are you sure it has these characters? As composed characters? Better print some out in hex and have a look.

You might need to normalize (in Java 1.6 or with icu4j).

bmargulies
+5  A: 

For Unicode chacters to work, you must be certain that javac reads it in the same encoding as it is written.

You will save yourself a lot of trouble by just using the \uXXXX notation.

Thorbjørn Ravn Andersen
So, what am I to do? Change the file encode? is so? how?
OscarRyz
+1  A: 

This seemed to work for me in a quick test:

static char [] a = {'à', 'á', 'â', 'ä' };


    public static boolean foundMatch(String s){

        boolean test = false;
        for(int i=0;i < a.length;i++){
            String t = String.valueOf(a[i]);
            test = s.contains(t);
            if (test) return true;
    }
        return test;
    }
curtisk
i guess it depends on using the right tools. in this case, a unicode capable text editor. (AFAIR, Java works mostly in unicode, but not so sure about the compiler)
Javier
You can tell the compiler about the encoding used in your sources by passing the `-encoding` option, e.g. `javac -encoding utf8 ...`
Dirk
@curtisk: Yeap, that's what I thought. Actually that's how it would compile on my old Windows machine. It turns out my new compiler wasn't using UTF-8
OscarRyz
A: 

You don't mention what you need to accomplish (i.e. why you need to find accentuated characters in a string), I'll hazard a guess that you need to do more than merely check if there are accented characters present in a piece of input. On the risk of telling you something you already know:

  • If you need to filter them out of a text string I recommend you use whitelisting instead of blacklisting.
  • If you need to sort them alphabetically regardless of accentuation, use java.text.Collator instead of a roll-your-own system.
  • If you need to replace the accented characters by their 'base' characters, the Collator should again be of help (the decomposition stuff inside it), but I haven't done this before, so I can't tell you how to do so exactly.
Barend
I'm posting an answer to someone new to programming. This is more for learning purposes. The original answer is here: http://stackoverflow.com/questions/1941899/i-need-a-function-to-convert-lower-case-to-upper-case-in-java/1942323#1942323 Thanks for the answer.
OscarRyz
+4  A: 

The code should be compiled with the correct encoding:

javac -encoding UTF-8 Foo.java


There'll be an encoding mismatch there somewhere.

public class Foo {
  char [] a = {'à', 'á', 'â', 'ä' };  
}

The above code saved as UTF-8 should become the hex dump:

70 75 62 6C 69 63 20 63 6C 61 73 73 20 46 6F 6F         public class Foo
20 7B 0D 0A 20 20 63 68 61 72 20 5B 5D 20 61 20          {__  char [] a
3D 20 7B 27 C3 A0 27 2C 20 27 C3 A1 27 2C 20 27         = {'__', '__', '
C3 A2 27 2C 20 27 C3 A4 27 20 7D 3B 20 20 0D 0A         __', '__' };  __
7D 0D 0A 0D 0A                                          }____

The UTF-8 value for code point U+00E0 (à) is C3 A0.

The code should be compiled with the correct encoding:

javac -encoding UTF-8 Foo.java

There is an outside chance that à will be represented by the combining sequence U+0061 U+0300. This is the NFD form (I've never come across a text editor that used it as a default for text entry). As Thorbjørn Ravn Andersen points out, it is often better to always use \uXXXX escape sequences - it is less ambiguous.

You also need to check your input device (file/console/etc.)

As a last resort, you can dump your chars as hex System.out.format("%04x", (int) c); and try manually decoding them with a character inspector to find out what they are.

McDowell
Excellent explanation. In a nutshell: save (and compile) file as UTF-8.
BalusC