views:

69

answers:

3

Hi, I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters

This is the code I have:

public static String readSentence(String resourceName) {

    String sentence = null;
    try {
        InputStream refStream = ClassLoader
                .getSystemResourceAsStream(resourceName);
        BufferedReader br = new BufferedReader(new InputStreamReader(
                refStream, Charset.forName("UTF-8")));
        sentence = br.readLine();
    } catch (IOException e) {
        throw new RuntimeException("Cannot read sentence: " + resourceName);
    }
    return sentence.trim();
}
+1  A: 

First, you could create the InputStreamReader as

new InputStreamReader(refStream, "UTF-8")

Also, you should verify if the resource really contains UTF-8 content.

mklhmnn
Hi thanks for your help.I have checked and it using the getEncoding module it tells me that its UTF8.I have changed the inputstream and it still is the same.
Lezan
+1  A: 

One of the most annoying reason could be... your IDE settings.

If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Roman
Hiya.I have checked because I have had a similar error before. I checked the properties again and it says Inherited from Container (UTF-8)
Lezan
+1  A: 

The problem is probably in the way that the string is being output.

I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:

for (char c : sentence.toCharArray()) {
    System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}

and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

Stephen C
Thank you I have just tried that and so far they all have the correct unicode codepoints. So what can I do if the problem is on the output side?
Lezan
Thaaank you... kept going through different bits of code ends up it was the annoying 'Non-breaking space' character!Thank you soooooo much
Lezan