ansaurus

Question

How to parse a string that is in a different encoding from java

Answer 1

+1 A:

Conversion is generally done by something like this:

String properlyEncoded = 
    new String(original.getBytes(originalEncoding), newEncoding);

Note that it is not unlikely that some information is lost during the conversion.

Bozho 2010-10-25 16:25:29

Ok, so I did String projDateString2 = new String(projDateString.getBytes("Cp1252"), "UTF-16"); and I still cannot get the replaceAll to work correctly

Derek 2010-10-25 16:33:39

@Bozho: That conversion can very easily be lossy though, because the original incorrect conversion can easily lose information.

Jon Skeet 2010-10-25 16:41:48

@Jon Skeet true. But you can't prevent the loss, I think.

Bozho 2010-10-25 16:43:30

Answer 2

+10 A:

Java strings are always in UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

If you have the bytes and you know the relevant encoding, you should use:

String text = new String(bytes, encoding);

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

So this statement:

projDateString2.replaceAll("\0x96", "\u2013");

will never do what you want. Even if everything else is correct, you should be using:

projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");

(or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.

Jon Skeet 2010-10-25 16:27:10

I will re-think this comment since you edited while I was writing it.

Derek 2010-10-25 16:35:12

not specifying encoding for the mentioned methods simply means they use the default platform encoding. Which is UTF-8, if `-Dfile.encoding` isn't specified.

Bozho 2010-10-25 16:39:47

@Bozho: It's UTF-8 on *some* platforms, but not on all. Relying on it is basically a bad move. I'll edit this in.

Jon Skeet 2010-10-25 16:41:00

that you should not rely on the default is completely true

Bozho 2010-10-25 16:42:31

I am using docx4j to open the word document. It seems to be using a FileInputStream and the load method can be seen here: http://dev.plutext.org/trac/docx4j/browser/trunk/docx4j/src/main/java/org/docx4j/openpackaging/packages/OpcPackage.java

Derek 2010-10-25 16:46:29

Being as I can't really control the input source..and only have access to the String object, what is the best route to go for me to detect the "En Dash" character that I am seeking to replace with a regular unicode "-" dash character

Derek 2010-10-25 16:53:45

@Derek: It may well have it correctly to start with. First find out the character it's actually using in Unicode form, then just replace it with `text = text.replaceAll("\uxxxx", "-")` where `\uxxxx` is the appropriate Unicode character. But you should find this out just using *characters*, not doing anything with bytes.

Jon Skeet 2010-10-25 17:01:21

I am not sure if that is the case here. When I do projDateString.getBytes() and print the (int) cast of the character array, I get three characters that are 65506 65408 and 65427 - which shouldnt be that large right? In thise case what will I need to put in my regex part for the replaceAll()?

Derek 2010-10-25 17:19:31

@Derek: *Don't use getBytes()*. Just use `charAt()` and cast each char to an int. That's the way to get the Unicode values. If you use `getBytes()` you're going back into encoding it in binary, which is going to make it harder to work out what's going on. Just don't do it.

Jon Skeet 2010-10-25 18:01:51

Ah - I see. Thanks. So I think that worked, because I got back a unicode value of 8211 for the character in question, and a 32 on either side of it indicating the spaces. However, my replaceAll("\u8211", "-") still did not work. I am playing with that now..do I need to use a hex value instead perhaps?

Derek 2010-10-25 18:47:33

Got it! I had to do my regex using the hex value of the character, which is 2013! Thanks for all the help!

Derek 2010-10-25 18:53:14

@Derek: I would actually use `replace` rather than `replaceAll` - there's no need for a regex here or even strings, as you just want to replace one character with another.

Jon Skeet 2010-10-25 21:23:46

Answer 3

+1 A:

First you need to make sure that you properly convert from CP1252 bytes to Java's character representation (which is UTF-16). Since you're using a library for parsing .docx files, this has probably happened.

Now all you need to do is call projDateString.replace('\u2013', '-') and do something with the return value. No need for replaceAll(), since you're not working with regular expressions.

adietrich 2010-10-25 16:44:57

I am using docx4j to open the word document. It seems to be using a FileInputStream and the load method can be seen here: http://dev.plutext.org/trac/docx4j/browser/trunk/docx4j/src/main/java/org/docx4j/openpackaging/packages/OpcPackage.java

Derek 2010-10-25 16:47:31

Thanks for the tip abotu the return value - i had that typed up correctly in code..just didnt make it into my SO question

Derek 2010-10-25 16:47:57

Updated my answer, you're trying to go from "En Dash" to "-", correct? Otherwise you would have to swap the replace() parameters.

adietrich 2010-10-25 22:43:25

ansaurus

tags:

views:

answers:

How to parse a string that is in a different encoding from java

related questions