tags:

views:

180

answers:

2

When encoding a java String to Latin-1 (ie. charset ISO-8859-1) I currently convert the German symbol β ('\u03B2') to ß ('\u00DF') before performing the encoding. I'm trying to avoid a question mark in the encoding where possible.

Can anyone suggest other un-encodable characters which can be replaced an encodable character? Or better yet, a Java library that does it for me?

Update: A bit of background: I have a Java program which exports it's data to CSV files so they can be read into a thrid-party application. A customer has complained that some characters are not converted - he gave me the example of "straβe". Although technically β is the greek symbol for Beta, a quick google search shows quite a few people use it to mean ß.

+2  A: 
erickson
+2  A: 

First, are you sure your input text is correctly entered or encoded?

u+03B2 is "GREEK SMALL LETTER BETA", not German eszett.

u+00DF is eszett or "LATIN SMALL LETTER SHARP S"

Java can map the latter to ISO-8859-1 because it's defined in http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT .

There is no way to solve this problem generally - the whole point of Unicode is that it contains (lots) of characters that simply cannot be represented in ISO-8859-* .

I suggest producing a list of all unicode characters in your data that are not listed in the http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT document. Then for each unmapped character, you will have to choose appropriate substitutions from the ISO-8859-1 range by hand/eye.

Joe Koberg
Thanks. Unfortunately I don't have access to my users data so I'll just have to wait for users to complain, then suggest they use the correct character!
Mark