views:

543

answers:

2

Hello, I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.

The original string is

HANDICAP╔ES

which is supposed to be

HANDICAPÉES

Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.

Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
    Map<String, Object> row = this.rowIter.next();
    // convert fields to UTF
    Map<String, Object> rowUTF = new HashMap<String, Object>();
    try {
        for (String key : row.keySet()) {
            Object o = row.get(key);
            if (o != null) {
                String valueCP850 = o.toString();
                // String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
                String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
                String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
                rowUTF.put(key, valueUTF8);
            }
        }
    } catch (UnsupportedEncodingException e) {
        System.err.println("Encoding exception: " + e);
    }   
}

In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.

Thanks, Cam

+6  A: 
String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES

This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).

Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.

Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.


Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:

  1. Align your environment and binary pipelines to use all the one and same character encoding.

  2. You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.

  3. Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.

If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

BalusC
Thanks for this reply. The data I am receiving is in an Access database, so I don't have control over how it was originally encoded. I guess I need to read it in and convert it to the proper format before doing anything.Also, we are trying to standardize and use UTF-8 for everything in our application. Does UTF-8 not support these characters?
You would need to instruct the JDBC driver and/or the database to use the proper encoding (the one which the database itself is using!). UTF-8 certainly supports those characters, but with a different binary representation, if you understand what I mean. Characters are namely -as everything- transferred as bytes. Simply because computers doesn't understand anything else. [This article](http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html) may help more in understanding the problem under the hoods.
BalusC
Thank you for the information and for the link, that is a great article!
You're welcome.
BalusC
I am back with another question ... should I not be able to convert directly from the original encoding to UTF8?<code>String name = "HANDICAP╔ES";String nameISO = new String(name.getBytes("CP850"), "ISO-8859-1");String nameUTF8 = new String(name.getBytes("CP850"), "UTF8");String nameUTF8_2 = new String(nameISO.getBytes(), "UTF8");System.out.println("nameISO=" + nameISO); // worksSystem.out.println("nameUTF8=" + nameUTF8); // does not workSystem.out.println("nameUTF8=" + nameUTF8_2); // works</code>Obviously I still don't get what's "under the hood". I will re-read your article now.
Sorry folks, I tried numerous times to figure out how to put the code into proper code formatting...but failed miserably.
You should keep and use the one and same encoding throughout all layers to avoid encoding problems. You should not convert from one to other. If the database contains information in encoding X, then you should display it using encoding X, not Y. When you process user inputs, you should process it using encoding X, not Y. If you need to change the encoding, you should change it at all layers of the application, also the database.
BalusC
Also carefully read the "Development Environment" part in the aforelinked article. The Windows Command Console doesn't support unicode. Use Swing or an IDE or just write to text file.
BalusC
Dear BalusC, Thank you for the updated response. We ARE using a single encoding throughout our application, which is UTF8. However, as I explained in a previous comment, we don't have control over the creation of the Access DB file -- we get it from a third-party source, and there is no way to get them to fix their encoding problem. That is why I have to convert it from the broken encoding in the Access DB to UTF8, which is what the rest of our application uses. This import from the Access DB is the initial step in our application pipeline.
Then you need to configure the JDBC/ODBC driver to use the DB-specified encoding to read and store data and cross fingers (keep using UTF-8 in remnant of the application). But if the data is *already* corrupted (view it using the MS Access program), then you're lost.
BalusC
+2  A: 

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.

Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:

String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");

getBytes("CP850") converts the character to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:

String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");

...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.

More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.

And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.


Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:

System.out.println("HANDICAPÉES");

The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:

CHCP 1252

To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.

As for writing to a file, just specify the desired encoding when you create the Writer:

OutputStreamWriter osw = new OutputStreamWriter(
    new FileOutputStream("myFile.txt"), "UTF-8");
Alan Moore
I guess I should have been more clear about my development environment. For development, I am using Eclipse on a Ubuntu Linux machine. I get the same results whether I run it from the Eclipse console or through a regular terminal console. We are using jackcess Java API to read the Access MDB database file. There seems no way to specify a default encoding for the jackcess driver so I have to do the conversion as I described above. I tried outputting the string directly into a GUI element (JLabel, JTextField) but that didn't help either.
Yes, this this seems to be quite an exotic problem, of which there was no hint in the original question. It might help if we could see the actual code you're using to retrieve the data. And don't try to put that in a comment--you've already seen how well that works. Edit the question and put it there.
Alan Moore
Ok, I have edited the question to show a sample of the code I'm using to retrieve the data. Thank you.