views:

249

answers:

2

Well, I give up. I've been messing around with all I could think of to retrieve data from a target website that has information in traditional Chinese encoding (charset=GB2312).

I've been using the simple_html_parser like always but it doesn't seem to return the Chinese characters, in fact all I get are some weird question marks embedded inside a rhomboid shape. ("�������ѯ�ؼ��֣�" Like so)

Declaring the encoding for the php file didn't do anything except of getting rid of some unwanted character showing at the start of the page.

By declaring it I mean:

header('Content-Type', 'text/html; charset=GB2312');

I can't get any data that's written in Chinese, also tried file_get_contents with the same luck. I'm probably missing something obvious since I can't find any related discussion elsewhere.

Thanks in advance.

+1  A: 

Get it in whatever character set the source uses, then convert it to something usable locally, such as UTF-8. Then send it to the browser.

Ignacio Vazquez-Abrams
+2  A: 

Have you tried converting the encoding with mb_convert_encoding or iconv, e.g.

$str = mb_convert_encoding($content, 'UTF-8', 'GB2312');

or

$str = iconv("UTF-8", "GB2312//IGNORE", $content);
Gordon
This is why I love stack overflow, I would've eventually give up trying it and forget about it, but this website, is just incredible.Your first example worked great, didn't try the second one, could you point out the difference?Thanks
johnnyArt
@johnnyArt well, they both do the same basically, but the `iconv` function is somewhat more configurable and supports more encodings than `mb_*` (afaik). As for the difference between the packages, I really have not much to offer: `iconv` requires to be enabled first and I think to have read it's a bit slower, while `mb_*` is bundled by default. I'd say it's like GD and ImageMagick. They are just two available Packages. But, actually, you might want to ask about the difference in a new question.
Gordon