I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.
How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?
specifically, I want to replace the "En Dash" character with a plain "-"
The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing
char[] test = projDateString.getBytes("Cp1252");
for(int i = 0; i < test.length; i++){
System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
}
String projDateString2 = new String(test);
projDateString2.replaceAll("\0x96", "\u2013");
System.out.println("projDateString2: " + projDateString)
I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.
This gives me the following output:
test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present
As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext "-"