ansaurus

Question

Convert XML document from Latin1 to UTF8 using Java

Answer 1

A:

I'm not familiar with this package but from the source on the web I suspect it may be broken:

http://kickjava.com/src/org/apache/ecs/xml/XMLDocument.java.htm

contains stuff like

        for (int i=0; i<prolog.size(); i++) {
268             ConcreteElement e = (ConcreteElement)prolog.elementAt(i);
269             e.output(out);
270             // XXX really this should use line separator!
271 // XXX should also probably check for pretty print
272 // XXX also probably have difficulties with encoding

which suggests problems.

We use XOM (http://www.xom.nu) and that specifically has a setEncoding() on its Serializer so I would suggest changing packages...

peter.murray.rust 2009-12-21 17:16:55

Unfortunately I did see that but I am hoping that there is some sort of workaround. Regardless, thank you for the package suggestion.

UmYeah 2009-12-21 17:24:38

Answer 2

A:

Here is a function I wrote to convert all non-ASCII characters to their corresponding entity. Might help you sanitizing some PCDATA content before output.

/**
 * Creates xml entities for non ascii characters in the given String.
 */
public static String xmlEntitify(String in){

 StringBuffer b = new StringBuffer();

 for (int i=0;i<in.length();i++){

  Character c = in.charAt(i);
  if (c<128){
   b.append(c);
  }
  else if (c=='\ufeff'){
   // BOM character, just remove it
  }
  else {
   String cstr = Integer.toHexString(c).toUpperCase();
   while(cstr.length()<4){
    cstr="0"+cstr;
   }
   b.append("&#x");
   b.append(cstr);
   b.append(";");
  }
 }
 return b.toString();
}

Read your input stream into a String content, and write into the output stream xmlEntitify(content).

Your output is guaranteed to contain only ASCII characters, no more encoding problem.

UPDATE

Given the comments, I'll be even bolder : if you are not sanitizing your data, you are calling for trouble. I guess you are at least already replacing the < and & characters in your PCDATA. If not, you definitely should. I have another version of the above method which, instead of the first if, has :

if (c<128 && c!='&' && c!='<' && c!='>' &&  c!='"'){
 b.append(c);
}

so that these characters are also converted to their corresponding Unicode entity. This converts all of my PCDATA to unicode-friendly ASCII-only strings. I had no more encoding problem since I'm using this technique. I don't ever output XML PCDATA which has not been passed through this method : this is not sweeping the elephant under the carpet. It is just getting rid of the problem by being as generic as can be.

subtenante 2009-12-21 17:25:50

This solves the wrong problem. He needs to UTF-8 encode the output stream, which is VERY different from substituting character entities for non-ascii data. Those character entities will still point to the Latin1 code points, not the requisite UTF-8 code points.

Jim Garrison 2009-12-21 19:10:02

As Jim wrote (and my colleague pointed out to me) this is simply covering up the problem. This became my temporary solution just because I needed a quick fix but when I have time I will go back and rewrite my code because it is simply wrong.

UmYeah 2009-12-21 19:26:02

Haha, so great. I'm downvoted for the only answer that brings something so far. I love you pals.@Jim : I know I did not answer the question in the desired way. If someone comes up with a better fix, I'll be glad to upvote it and use it in my own code. So far, sanitizing the PCDATA has always been the best way for me, which works in all the cases.@UmYeah : when you have only ASCII characters, you text is UTF-8 encoded. You just have changed the way the extended characters are referred to. You let to the client the responsability to format them.

subtenante 2009-12-22 08:42:34

@Jim : "Those character entities will still point to the Latin1 code points, not the requisite UTF-8 code points." => Wrong, they will be written as Unicode entities. Latin-1 or UTF-8 are encodings, and with my solution you bypass the encoding problem to make the data the most systematic you can : using only Unicode and not a specific encoding. Like this, it doesn't even matter what you put in the encoding attribute of your XML declaration.

subtenante 2009-12-22 09:10:50

@subtenante - If i had the rep I would upvote your solution because it was exactly what I needed and helped me understand my problem a little better. Thank you for the help.

UmYeah 2009-12-22 15:26:32

Answer 3

+1 A:

The simplest workaround is probably going to be changing your code like follows:

XMLDocument doc = new XMLDocument(1.0,false,Charset.defaultCharset().toString());

I'm guessing they're just using the default encoding to write characters to the stream. So pass the default encoding to the prologue and you should be fine.

I'll agree with other posters that this is probably the least of your worries. Looking at the source repository for ECS, it doesn't appear to have been updated for four years (the "ECS2" repository likewise).

And some self-promotion: if you're looking to build XML documents using a simple interface, the Practical XML library has a builder. It uses the standard JDK serialization mechanism for output.

kdgregory 2009-12-21 18:27:12

The `Charset.defaultCharset()` returns the platform specific default charset, which may not be the same as the XML file encoding and/or not to be an Unicode derivate at all, such as `CP-1252` (ouch) or `ISO-8859-x`. You don't want to have that. You need to know the actual encoding of the XML file before.

BalusC 2009-12-21 18:56:25

If you had read the question a little more closely, you would see that the OP is actually *producing* and XML file, not consuming one. If you had read my response a little more closely, you would have seen that my rationale for using `defaultEncoding()` in the prologue was that it appeared that the 3rd-party library (Jakarta ECS) was using it.

kdgregory 2009-12-21 19:25:58

Answer 4

+1 A:

Any chance you can write to a Writer rather than an OutputStream... that way you could specify the encoding.

cjstehno 2009-12-21 18:41:25

Answer 5

+1 A:

Here is a solution my co-worker came up with that I THINK is the correct way to do it but what do I know. Instead of using doc.output(stream) we used:

    try {
            IOUtils.write(doc.toString(), stream, "UTF-8");
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

To be honest I dont completely understand the problem yet, which is why I am having problems in the first place. It seems that @subtenante's solution went through and converted any character that UTF-8 could not represent and replaced it with the unicode entity. This solution seems to write to the stream using the UTF-8 encoding like I originally wanted doc.output to. I dont know the exact difference, just that both solved my problems. Any further comments to help me understand the problem would be appreciated.

UmYeah 2009-12-22 15:06:23

This solution seems really OK, provided you have access to the commons-io library. My solution has the advantage of making the output encoding-independent, because it contains only pure ASCII. This solution uses UTF-8 and encodes extended characters the right way, as defined in your encoding attribute. The main difference in the result is that your method puts extended characters in 2 or 3 bytes, whereas mine needs 8 bytes for each. But XML is verbose anyway. :)

subtenante 2009-12-22 15:36:03

ansaurus

tags:

views:

answers:

Convert XML document from Latin1 to UTF8 using Java

related questions