views:

126

answers:

3

If I convert a UTF-8-encoded XML document (which has an XML prolog declaring the encoding to be UTF-8) to Latin-1 using xmllint, will there be any data loss?

xmllint --encode iso-8859-1 --output test-latin1.xml test-utf8.xml

(the data will eventually be displayed as ISO-8859-1-encoded HTML)

A: 

I converted it back to UTF-8 again and the file seems to be identical to the original, so it looks it's ok.

xmllint --encode utf-8 --output test-utf8-post.xml test-latin1.xml
Alf Eaton
+1  A: 

If there is dataloss depends on the contents of the file. If all characters in it belong to the iso-8859-1 subset, it'll be ok. If it contains other characters, e.g. from the Cyrillic alphabet or Old Italian, you will lose them. xmllint indicates that (with an error code).

TGV
Will those non-iso-8859-1 characters actually get lost, or will they be replaced by numeric character entities?
Alf Eaton
+2  A: 

There will be a problem if there are any unicode characters outside Latin1 in your original xml file. But I suspect xmllint will detect that and refuse to do the the translation.

The only case I can think of where you might get interesting conversions is if the file contains accented characters - unicode has multiple ways of representing them, which might be all mapped to the single representation in Latin1.

Douglas Leeder