ansaurus

Question

Reading unicode character in java

Answer 1

A:

I think its just "UTF8" not "UTF-8".

Here I saw it: Source

InsertNickHere 2010-09-02 19:48:56

UTF-8 vs UTF8 depends upon whether you use java.io or java.nio. In my experience it doesn't really matter either way.

Mondain 2010-09-02 19:55:09

When I gave you said downvote, I was looking at the docs for `java.nio.CharSet` (used in one of the other constructors for `InputStreamReader`), which lists it as `UTF-8`. However, since you've since edited your answer, I was able to cancel said downvote.

R. Bemrose 2010-09-02 20:11:47

`UTF-8` is the display name for the encoding; `UTF8` is an alias. They are equivalent.

Richard Fearn 2010-09-02 20:12:29

@R. Bemrose Ok! @Richard Got it :)

InsertNickHere 2010-09-03 06:47:05

Answer 2

A:

It sounds as though your file literally contains the text z\u0142o\u017Cy\u014, i.e. has Unicode escape sequences in it.

There's probably a library for decoding these but you could do it yourself - according to the Java Language Specification an escape sequence is always of the form \uxxxx, so you could get the 4-digit hex value xxxx for the character, convert it to an integer with Integer.parseInt, convert it to a character and finally replace the whole \uxxxx sequence with the character.

Richard Fearn 2010-09-02 19:54:59

Answer 3

+2 A:

Your code should be correct, but I guess that the file "a.txt" does not contain the Unicode characters encoded with UTF-8, but the escaped string "\u0142o\u017Cy\u0142".

Please check if the text file is correct, using an UTF-8 aware editor such as recent versions of Notepad or Notepad++ on Windows. Or edit it with your favorite hex editor - it should not contain backslashes.

I tried it with "€" as UTF-8-encoded content of the file and it gets printed correctly. Note that not all Unicode characters can be printed, depending on your terminal encoding (really a hassle on Windows) and font.

AndiDog 2010-09-02 19:56:01

Answer 4

+2 A:

Java interprets unicode escapes such as your \u0142 that are in the source code as if you had actually typed that character (latin small letter L with stroke) into the source. Java does not interpret unicode escapes that it reads from a file.

If you take your String str = "\u0142o\u017Cy\u0142"; and write it to a file a.txt from your Java program, then open the file in an editor, you'll see the characters themselves in the file, not the \uNNNN sequence.

If you then take your original posted program and read that a.txt file you should see what you expected.

Stephen P 2010-09-02 19:56:26

Rakesh 2010-09-02 20:17:39

@Rakesh - as BalusC mentioned in his answer, `java.util.Properties` has a `loadConvert()` method to do that conversion. My point is that simply reading from a file doesn't do that conversion.

Stephen P 2010-09-02 20:47:11

Answer 5

+1 A:

So, you want to unescape unicode codepoints? There is no public API available for this. The java.util.Properties has a loadConvert() method which does exactly this, but it's private. Check the Java source for the case you'd like to reuse this. It's doing the conversion by simple parsing. I wouldn't use regex for this since this is too error prone in very specific circumstances.

Or you should probably after all be using java.util.Properties or its i18n counterpart java.util.ResourceBundle with a .properties file instead of a plain .txt file.

then you can use the line oriented I/O BufferedReader to read each line. FileInputREader is a low level I/O that you should avoid. You're writing the characters to your file not the bytes, the best approach is to use character streams. for wrinting and reading unless you need to write bytes/binary data.

Alex 2010-09-03 04:49:54

I forgot to mention , try to see your a.txt file in hex and see what you got and you'll understand more from a low level perspective how this things work.

Alex 2010-09-03 04:52:44

Say what? He **is** using a character stream: an InputStreamReader. And character streams have to be built on top of byte streams, so he uses a FileInputStream. The way he's doing it is exactly right. If anything, we should advise people *not* to use FileReader, because it relies on a platform-dependent system property (i.e., the default encoding).

Alan Moore 2010-09-03 17:41:57

ansaurus

tags:

views:

answers:

Reading unicode character in java

See also:

related questions