Hi,
I have problems when processing "specific" characters in texts using the DOM API in Java. The files are in XML format. I was told in a previous post what the situation with the ampersand (&) symbol in XML is (and several more characters such as < and >). Here is the post: http://stackoverflow.com/questions/871963/special-characters-in-xml-files-processing-with-the-dom-api
However, what could I do with other special characters in the data such as specific letters in German and French? For example, I have the word "façade" in the text element of the XML document. However, the place for the letter "ç" looks corrupted: when I open the file with the vim editor in linux it looks like: "fa^Zade", when I open it with another editor as .txt or .xml file, the place for "ç" looks like a small empty rectange (or an empty space). This is the case with german umlauts and other "special" symbols of other languages, too. They make problems, when I try to process the files with an XML parser (I am getting parsing errors). I suppose this is some encoding problem. In the header of the XML file I am using encoding="UTF-8". I have tried to change it (i.e. to "Unicode" or others), but it doesn't help.
How could I make so that these special characters are recognized? Should I use some special encoding?
If they were just two or three characters, which I knew for sure, I could have replaced them before processing with the DOM API in Java the way I have done with the ampersand (&) symbol (I have converted & to &
), however, they are a lot, and potentially could be any "special" symbol.
Is the problem coming from the way the data was saved? For example, during the saving process a special encoding should have been used (?), so that now the characters are recognised (?). (I have not saved the data myself).
Thank you.