views:

82

answers:

3

I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

specifically, I want to replace the "En Dash" character with a plain "-"

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    }
    String projDateString2 = new String(test);
    projDateString2.replaceAll("\0x96", "\u2013");
    System.out.println("projDateString2: " + projDateString)

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

This gives me the following output:

test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present

As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext "-"

+1  A: 

Conversion is generally done by something like this:

String properlyEncoded = 
    new String(original.getBytes(originalEncoding), newEncoding);

Note that it is not unlikely that some information is lost during the conversion.

Bozho
Ok, so I did String projDateString2 = new String(projDateString.getBytes("Cp1252"), "UTF-16"); and I still cannot get the replaceAll to work correctly
Derek
@Bozho: That conversion can very easily be lossy though, because the original incorrect conversion can easily lose information.
Jon Skeet
@Jon Skeet true. But you can't prevent the loss, I think.
Bozho
+10  A: 

Java strings are always in UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

If you have the bytes and you know the relevant encoding, you should use:

String text = new String(bytes, encoding);

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

So this statement:

projDateString2.replaceAll("\0x96", "\u2013");

will never do what you want. Even if everything else is correct, you should be using:

projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");

(or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.

Jon Skeet
I will re-think this comment since you edited while I was writing it.
Derek
not specifying encoding for the mentioned methods simply means they use the default platform encoding. Which is UTF-8, if `-Dfile.encoding` isn't specified.
Bozho
@Bozho: It's UTF-8 on *some* platforms, but not on all. Relying on it is basically a bad move. I'll edit this in.
Jon Skeet
that you should not rely on the default is completely true
Bozho
I am using docx4j to open the word document. It seems to be using a FileInputStream and the load method can be seen here: http://dev.plutext.org/trac/docx4j/browser/trunk/docx4j/src/main/java/org/docx4j/openpackaging/packages/OpcPackage.java
Derek
Being as I can't really control the input source..and only have access to the String object, what is the best route to go for me to detect the "En Dash" character that I am seeking to replace with a regular unicode "-" dash character
Derek
@Derek: It may well have it correctly to start with. First find out the character it's actually using in Unicode form, then just replace it with `text = text.replaceAll("\uxxxx", "-")` where `\uxxxx` is the appropriate Unicode character. But you should find this out just using *characters*, not doing anything with bytes.
Jon Skeet
I am not sure if that is the case here. When I do projDateString.getBytes() and print the (int) cast of the character array, I get three characters that are 65506 65408 and 65427 - which shouldnt be that large right? In thise case what will I need to put in my regex part for the replaceAll()?
Derek
@Derek: *Don't use getBytes()*. Just use `charAt()` and cast each char to an int. That's the way to get the Unicode values. If you use `getBytes()` you're going back into encoding it in binary, which is going to make it harder to work out what's going on. Just don't do it.
Jon Skeet
Ah - I see. Thanks. So I think that worked, because I got back a unicode value of 8211 for the character in question, and a 32 on either side of it indicating the spaces. However, my replaceAll("\u8211", "-") still did not work. I am playing with that now..do I need to use a hex value instead perhaps?
Derek
Got it! I had to do my regex using the hex value of the character, which is 2013! Thanks for all the help!
Derek
@Derek: I would actually use `replace` rather than `replaceAll` - there's no need for a regex here or even strings, as you just want to replace one character with another.
Jon Skeet
+1  A: 

First you need to make sure that you properly convert from CP1252 bytes to Java's character representation (which is UTF-16). Since you're using a library for parsing .docx files, this has probably happened.

Now all you need to do is call projDateString.replace('\u2013', '-') and do something with the return value. No need for replaceAll(), since you're not working with regular expressions.

adietrich
I am using docx4j to open the word document. It seems to be using a FileInputStream and the load method can be seen here: http://dev.plutext.org/trac/docx4j/browser/trunk/docx4j/src/main/java/org/docx4j/openpackaging/packages/OpcPackage.java
Derek
Thanks for the tip abotu the return value - i had that typed up correctly in code..just didnt make it into my SO question
Derek
Updated my answer, you're trying to go from "En Dash" to "-", correct? Otherwise you would have to swap the replace() parameters.
adietrich