views:

68

answers:

1

While parsing some html files with libxml the function xmlParseFile() returns that the code includes non UTF-8 characters How can i modify the default charset of the library to ISO-8859-1 ? Is there any other way to solve this ?

PS: The entire development is based on libxml and works in most cases so I can't switch to another library.

+1  A: 

The encoding used for XML data must be specified in the XML's prolog. If no encoding is specified, W3's XML spec dictates that UTF-8 must be assumed instead.

Why are you using an XML parser for parsing HTML data? libxml has an HTML parser separate from its XML parser. Look at htmlParseFile() and related functions. Since HTML is not XML, there would be no XML prolog present to indicate the data encoding. HTML does have a <meta> tag available that can be used inside the <head> tag for that, though. libxml's HTML parser does look for that tag to determine the encoding, if not explicitally passed to htmlParseFile() directly.

Remy Lebeau - TeamB