ansaurus

Question

How to parse word-created special chars in java

Answer 1

+3 A:

Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,

Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.

So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.

Windows-1252 is recognized by the IANA Charset Registry

Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None

so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.

Lord Torgamus 2010-10-22 19:55:07

ASCII and Unicode are *Character Sets*, not encodings. When you have a particular character value from a charset, you have to decide how you're going to write that value to disk. *That's* what an encoding is.

Stephen P 2010-10-22 20:30:50

@Stephen, hm, I have learned something [about semantics](http://en.wikipedia.org/wiki/Character_set#General_terminology). Neither one of us is fully right, it seems.

Lord Torgamus 2010-10-22 20:35:40

Really like your edit!

Stephen P 2010-10-22 20:47:07

@Stephen, thanks! I only meant to do a quick follow-up edit, but the more I researched, the more I realized that the original answer needed work, so... yeah.

Lord Torgamus 2010-10-22 21:32:28

I tried a couple of different settings so far - however none have worked just yet. I forgot to mention this is word 2007. does that have a different encoding?

Derek 2010-10-22 21:42:00

@Derek, Word 2007 uses all of the following encodings for English: Unicode, "Windows 1250, 1252-1254, 1257, ISO8859-x" Source: [Microsoft Office help page](http://office.microsoft.com/en-us/word-help/choose-text-encoding-when-you-open-and-save-files-HA010121249.aspx#BM4)

Lord Torgamus 2010-10-22 23:24:21

Answer 2

+1 A:

Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.

If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write

s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');

where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

Giulio Piancastelli 2010-10-22 20:19:20

If Word is auto-replacing the user's character with one of its own, I'd suspect an em dash before an en dash.

Lord Torgamus 2010-10-22 20:38:32

I made a simple test on a Word document before answering: on my screen the character seemed an en dash, but you may be right.

Giulio Piancastelli 2010-10-22 22:24:48

In Word, if you type `2010 -- Present` it replaces the two dashes with a single *en dash*

Stephen P 2010-10-23 00:02:25

As far as I can tell, the replacement is triggered even if you type a single `-`.

Giulio Piancastelli 2010-10-23 04:54:44

Answer 3

+2 A:

You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)

Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.

You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.

There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.

Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.

Say you have

String stuff = MSWordUtil.getNextChunkOfText();

Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to

File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file

By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.

After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.

At that point you should be able to do

stuff.replaceAll("\u0096", "\u2013");

I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.

Stephen P 2010-10-22 20:19:33

+1 for the `0x80 - 0x9f` info

Lord Torgamus 2010-10-22 20:39:14

So my incoming string, which is coming out of the doc file, is in Cp1252, right? If I am going to strip the En Dash out of that, how do I go about that? Thought it might be something like String newString = new String(oldString.getBytes("CP1252), "UTF-8") but that doesnt seem to work - newString still prints with the funny chars, and I searched for \u2013 and \u2014 to no avail

Derek 2010-10-22 22:07:59

Given a `File input` object, created from the name of the Word document on disk, you may try `char[] chars = new char[(int) (input.length())]; Reader in = new InputStreamReader(new FileInputStream(input), encoding); in.read(chars); in.close(); String s = new String(chars);` where `encoding` should be the character encoding of your Word file. From then on, `s` should internally use UTF-8, so you can search for `\u2013` or anything else with ease.

Giulio Piancastelli 2010-10-22 22:36:59

@Derek: see my update. I have to do this because I get mixed input. As Giulio says in his comment and Torgamus in his answer, if you can specify that your input text is `Windows-1252` as in the 2nd parameter to the InputStreamReader constructor, you'll actually get a `\u2013` in your java String and won't have to worry about it.

Stephen P 2010-10-22 22:46:38

Ok - i edited my post too with some code on my attempt - that was not working. Can you identify why?

Derek 2010-10-22 22:47:47

Stephen P 2010-10-22 23:39:46

ansaurus

tags:

views:

answers:

How to parse word-created special chars in java

related questions