tags:

views:

27

answers:

1

I'm using libxml2. All function are working with xmlChar*. I found that xmlChar is an unsigned char.

So I have some questions about how to work with it.

1) For example if I working with utf-16 or utf-32 file how libxml2 process it and returns xmlChar in function? Will I lose some characters then??

2) If I want to do something with this string, should I cast it to char* or wchar_t* and how??

Will I lose some characters?

A: 

xmlChar is for handling UTF-8 encoding only.

So, to answer your questions:

  1. No, you won't loose any characters if using UTF-16 or UTF-32. Just use iconv or any other library to encode your UTF-16 or UTF-32 data before passing it to the API.

  2. Do not just "cast" the string. Convert them if needed in some other encoding.

Pablo Santa Cruz
Thank you but now I have some more questions: How does it work now? Because even if I feed a utf-16 file. Libxml still release unsigned char*. Why and how does it work? The second is How can I coonvert UTF32 or UTF16 to UTF-8. I don't want to use some third-part libraries. I need to do it under unix. I know that windows have function WideCharToMutliByte does unix has something like that? And the last question is how can I convert xmlchar to other encoding and to which one?
Nikita
Yes. The thing is API is doing some internal convertions. All CALLs are `xmlChar` based, even though the FILES or NETWORK feeds you use to parse the XML is encoded in a different charset. In UNIX, use LIBICONV. It's a pretty common library and if I recall correctly it already bundles with LIBXML2. To convert xmlChar to other encoding, again, use LIBICONV. Redards...
Pablo Santa Cruz
And one more question. Why did you say that I should first encode UTF-16 before feed it to libxml. I've just tried to do it without converting then I applied xmlCheckUTF8 function to every element which was released from lib xml and it was ok. I guess that unsigned char* is just a number of bytes ...
Nikita
No. I said (at least I tried to anyway :-) ) that you should encode your UTF-16 data into UTF-8 **before** feeding it to the API if you are getting the UTF-16 encoded data from somewhere else...
Pablo Santa Cruz