tags:

views:

793

answers:

3

Hi,

I have problems when processing "specific" characters in texts using the DOM API in Java. The files are in XML format. I was told in a previous post what the situation with the ampersand (&) symbol in XML is (and several more characters such as < and >). Here is the post: http://stackoverflow.com/questions/871963/special-characters-in-xml-files-processing-with-the-dom-api

However, what could I do with other special characters in the data such as specific letters in German and French? For example, I have the word "façade" in the text element of the XML document. However, the place for the letter "ç" looks corrupted: when I open the file with the vim editor in linux it looks like: "fa^Zade", when I open it with another editor as .txt or .xml file, the place for "ç" looks like a small empty rectange (or an empty space). This is the case with german umlauts and other "special" symbols of other languages, too. They make problems, when I try to process the files with an XML parser (I am getting parsing errors). I suppose this is some encoding problem. In the header of the XML file I am using encoding="UTF-8". I have tried to change it (i.e. to "Unicode" or others), but it doesn't help.

How could I make so that these special characters are recognized? Should I use some special encoding? If they were just two or three characters, which I knew for sure, I could have replaced them before processing with the DOM API in Java the way I have done with the ampersand (&) symbol (I have converted & to &amp;), however, they are a lot, and potentially could be any "special" symbol. Is the problem coming from the way the data was saved? For example, during the saving process a special encoding should have been used (?), so that now the characters are recognised (?). (I have not saved the data myself).
Thank you.

A: 

If they were just two or three characters, which I knew for sure, I could have replaced them before processing with the DOM API in Java the way I have done with the ampersand (&) symbol (I have converted & to &), however, they are a lot, and potentially could be any "special" symbol.

You don't need to anticipate all possible inputs. Instead, simply convert each such entity to an NCR, or Numeric Character Reference. For example &#x20AC; is NCR for the Euro symbol €; this means that 20AC is the hexadecimal Unicode reference for the Euro symbol.

John Feminella
Thanks, I could do so for the symbols, which I know what they are: i.e. if I recognize that on this place a german or french letter, or euro symbol is used. However, the texts, which I am processing, are written by different people (from different nationalities), and even though they write in English, they often include some words of their languages. Or simply some international words, which consist such non-english characters. In this sense, I can expect everything, and I would like to see whether there is some way to recognize these characters generally.
Hmmm -- I think you may have misunderstood me, since I think this method does what you describe. Consider your input as a stream of characters. All you'd have to do is examine each character and determine whether or not it's "special". One way, for example, is simply to treat every Unicode character whose hex value is larger than 0xFF as special and encode it as a NCR.
John Feminella
+1  A: 

This does not seem to be a problem on XML, but an encoding problem. XML can handle both UTF-8 and Latin-1. But you need to know the input encoding or NOT use a reader but an input stream with the XML-declaration using the right encoding attribute.

Are you sure, the source is not corrupt? Which encoding is it? Is the XML encoding attribute of the declaration in the first line right? ^Z does not look like an UTF-8 encoding!

Arne Burmeister
A: 

encoding="UTF-8" seems like the right way to go, then you should not have to treat any of these characters differently. You said 'In the header of the XML file I am using encoding="UTF-8"', but are you writing the character data out as UTF-8 as well?

In vim you can use "ga" I think to show the code of the character under the cursor, this should help with debugging.

Matthew Wilson