Avoid printing unicode replacement character in Java

views:

575

answers:

Avoid printing unicode replacement character in Java

In Java, why does Character.toString((char) 65533) print out this symbol: � ?

I have a Java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?

+3 A:

There is no Unicode character U+FFFD. Hence, the code is logically incorrect. The intended use of the Unicode Replacement Symbol is to be substitued for bad input (such as (char)65533).

How to fix it: don't put junk in strings. Strings are for text. Bytes are for random binary data.

MSalters 2009-12-02 11:26:36

this one says there is...http://www.fileformat.info/info/unicode/char/fffd/index.htm

manu1001 2009-12-02 11:49:08

One can argue whether the value representin an "invalid input character" itself is a valid character. It is not a letter, not a digit, not punctuation, not a mathematical symbol, etc.

MSalters 2009-12-02 13:23:53

+1 A:

Well, what do you want it to do? If you're getting these characters "all over the place" I suspect you have bad data... it should be pretty rare that you receive data which can't be represented in Unicode.

How are you getting the data to start with?

Jon Skeet 2009-12-02 11:27:13

well, one place where i'm getting this data is from rss feeds....

manu1001 2009-12-02 11:47:46

That suggests that you're using the wrong encoding.

Jon Skeet 2009-12-02 12:59:03

+1 A:

One of the most likely scenarios is that you are trying to read ISO-8859 data using the UTF-8 character set. If you come accross a sequence of characters that is not valid UTF-8, then it will be replaced with the � symbol.

Check your input streams, and ensure that you read them using the correct character set.

Paul Wagland 2009-12-02 11:31:39

+1 A:

In java, why does Character.toString((char) 65533) print out this symbol: � ?

Because exact this particular character IS associated with the particular codepoint. It does not display a random character as you seem to think.

I have a java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?

Your problem lies somewhere else. It at least boils down that you should set every step which involves byte-char conversions (storing text in file/db, reading text from file/db, manipulating text, transferring text, displaying text, etcetera) to use UTF-8.

Which catches my eye is the fact that Java does absolutely nothing special with 0xFFFD, it just replaces uncovered chars by a question mark ? and that while you keep insisting that 0xFFFD comes from Java. I know that Firefox does exactly what you said, so are you maybe confusing "Firefox" with "Java"?

If this is true and you're actually talking about a Java webapplication, then you need to set at least the HTTP response encoding to UTF-8. You can do that by putting <%@ page pageEncoding="UTF-8" %> in top of the JSP page in question. You may find this article useful to get more background information and a detailed overview of all steps and solutions you need to apply to solve this "Unicode problem".

BalusC 2009-12-02 12:23:54

+1 A:

Have a look at this primer on character encodings.

kem 2009-12-02 23:39:50

ansaurus

tags:

views:

answers:

Avoid printing unicode replacement character in Java

related questions