views:

45

answers:

2

I'm using libcurl for getting HTML pages.

I have some problems with Hebrew characters.

for example this: סלקום gets gibberish.

How do I get Hebrew characters and not gibberish?

Do I need some HTML decoder?

Does libcurl support such operation?

Does libiconv support such operation?

I appreciate any help.

Thanks

A: 

Currently It runs on windows I have Hebrew support on it.

Embedded
When providing additional information about a question, **edit** the question instead of claiming the additional information is an **answer** .
David Dorward
+1  A: 

Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:

>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�

The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.

An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.

You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.

(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

jleedev
I'm getting characters like: ׳¡׳?׳§׳•׳?I need to work on those characters.
embedded
And characters like: ׳₪׳¨׳˜׳ ׳¨
embedded
@embedded Aha! That’s exactly what I needed.
jleedev