views:

3944

answers:

7

Hi!

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.

I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.

WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet); 
row = s.getRow(4);
String contents = row[0].getContents();

This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)

[edit]

I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.

[edit]

I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

A: 

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

Tom Hawtin - tackline
ISO-88591-1 is quite passable as Windows codepage 1252
Thorbjørn Ravn Andersen
+1  A: 

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.

If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.

Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:

String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);

pw.print(text ); // repeat as needed

pw.close(); // cleanup
osw.close();
fos.close();

If your problem is something else please edit your question and provide more details.

Mr. Shiny and New
A: 
FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));

And do with reader whatever you'd do directly with file.

vartec
+1  A: 

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.

JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

Michael Borgwardt
Thanks! I will try that and hopefully get back to this topic to let everybody see how it worked.
Jakob Eriksson
+2  A: 

WorkbookSettings ws = new WorkbookSettings();

ws.setEncoding("CP1250");

Worked for me.

A: 

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Seth
A: 

If none of the answer above solve the problem, the trick might be done like this:

String myOutput = new String (myInput, "UTF-8");

This should decode the incoming string, whatever its format.

lxndr