views:

31

answers:

3

I have an XML as input to a Java function that parses it and produces an output. Somewhere in the XML there is the word "stratégie". The output is "stratgie". How should I parse the XML as to get the "é" character as well?

The XML is not produced by myself, I get it as a response from a web service and I am positive that "stratégie" is included in it as "stratégie". In the parser, I have:

public List<Item> GetItems(InputStream stream) {

    try {

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document doc = db.parse(stream);
        doc.getDocumentElement().normalize();
        NodeList nodeLst = doc.getElementsByTagName("item");
        List<Item> items = new ArrayList<Item>();

        Item currentItem = new Item();
        Node node = nodeLst.item(0);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            Element item = (Element) node;
            if(node.getChildNodes().getLength()==0){
                return null;
            }


NodeList title = item.getElementsByTagName("title");
Element titleElmnt = (Element) title.item(0);
if (null != titleElmnt)
    currentItem.setTitle(titleElmnt.getChildNodes().item(0).getNodeValue());
....

Using the debugger, I can see that titleElmnt.getChildNodes().item(0).getNodeValue() is "stratgie" (without the é).

Thank you for your help.

A: 

You can either use utf-8 and have the 'é' char in your document instead of &#233;, or you need to have a parser that understand this entity which exists in HTML and XHTML and maybe other XML dialects but not in pure XML : in pure XML there's "only" &quot;, &lt;, &gt; and maybe &apos; I don't remember.

Maybe you can need to specify those special-char entities in your DTD or XML Schema (I don't know which one you use) and tell your parser about it.

p4bl0
I have updated my post with more details and code, thank you for helping.
Manu
dan04
@dan04 I wish I could upvote your comment more than once. Shame on me to have forgotten to mention this one...
p4bl0
+1  A: 

I strongly suspect that either you're parsing it incorrectly or (rather more likely) it's just not being displayed properly. You haven't really told us anything about the code or how you're using the result, which makes it hard to give very concrete advice.

As ever with encoding issues, the first thing to do is work out exactly where data is getting lost. Lots of logging tends to be the way forward: create a small test case that demonstrates the problem (as small as you can get away with) and log everything about the data. Don't just try to log it as raw text: log the Unicode value of each character. That way your log will have all the information even if there are problems with the font or encoding you use to view the log.

Jon Skeet
A: 

The answer was here: http://www.yagudaev.com/programming/java/7-jsp-escaping-html

Manu