views:

65

answers:

3

I'm using libcurl to fetch some HTML pages.

The HTML pages contain some character references like: סלקום

When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨

is it the ISO-8859-1 encoding?

If so, how do I convert it to UTF-8 to get the correct word.

Thanks

EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.

I added this to eclipse.ini

-Dfile.encoding=utf-8

and finally I got Hebrew characters on my Eclipse console. Thanks

A: 

No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.

You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.

Benoit
Not necessary, libxml2 already does that. That's how he got the non-ASCII characters in the first place.
MSalters
A: 

This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.


I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here


The c-style way would be (no checking for clarity):

iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);
stefaanv
Those should represent Hebrew characters. but right now I'm not getting it.what would be the correct way to use iconv?
embedded
embedded
stefaanv
+3  A: 

Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.

You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*

To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.

MSalters
I just printed out the xmlchar* using cout and got ׳₪׳¨׳˜׳ ׳¨How can I print it right?
embedded
`std::cout` will use the current locale. If that's not a UTF-8 locale (quite likely) then it won't work at all. `std::wcout` usually can print out Unicode, but it expects `wchar_t*` not libxml's `XMLchar*`.
MSalters