views:

267

answers:

3

We at the company want to convert all the sites we are hosting from Latin-1 to UTF-8. After a ot of googling, we have our Perl script almost complete. The only thing that is missing now are the XML files.

What is the best way to convert XML from Latin-1 to UTF-8 and is it useful?

I am asking because we are unsure about it since most entries on Google explain how to do the exact opposite. Some even say that utf8 may cause problems with XML. Can you enlighten us on the whole XML Encoding Issue?

+2  A: 

What are you converting? The data or the XML tags or something else?

I think you just need to read it as Latin-1 and rewrite it as UTF-8 unless your source does something really weird. The decoding and encoding happens for you at the filehandle level. Once you have it in Perl, it's internally UTF-8 already.

What do you have so far? What problems are you having?

Is your situation too complicated to merely use xmllint?

 xmllint --encode utf8 --output filename.xml filename.xml.latin1

If you are using XML::Parser, see Juerd's Unicode Advice about that module.

If you are converting more than just XML files, iconv might help:

iconv -f ISO-8859-1 -t UTF-8 filename.txt.latin1 > filename.txt
brian d foy
That is not entirely correct! If you have a xml prologue like that one: <?xml version="1.0" encoding="latin1"?>, you have to modify or delete if the document is now encoded in UTF8!
Johannes Weiß
xmllint is a better solution than iconv, so I've updated my answer
brian d foy
+1  A: 

As brian mentioned its internally UTF-8 in Perl. Perl will convert it whether you want it or not.

The trickery is connected to the UTF8 flag, which is a bit flag attached to each string. For the data that XML::Parser returns, that UTF8 flag is set.

If ever you want ot get rid of this behaviour, clear the UTF8 flag. One way you can do it, is like this:

sub de_utf8 {
    use bytes;
    return "$_[0]";
}

This way, the resulting string will be the same byte data as the original string.

EDIT: A bit off the topic of the OP... sorry.

kevchadders
+6  A: 

I'd use xmllint --encode utf8 FILE-NAME, sample:

xmllint --encode utf8 --output test.xml test.xml

will correctly convert test.xml (whatever encoding it may have) to UTF-8 including the XML prologue.

Johannes Weiß