tags:

views:

20

answers:

1

I use ruby reading a web page, and its content is:

<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>

From the meta, we can see it uses a GB2312 encoding.

My code is:

res = Net::HTTP.post_form(URI.parse("http://xxx/check"), 
                              {:query=>'xxx'})

Then I use:

res.include?("中文")

to check if the content has that word. But if shows false.

I don't know why it is false, and what should I do? What encoding ruby 1.8.7 use? If I need to convert the encoding, how to do it?

+1  A: 

Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.

Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:

res.include?("\xD6\xD0\xCE\xC4")

However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:

兄形男

in it, that would be encoded as "\xD0\xD6\xD0\xCE\xC4\xD0". Which contains the byte sequence "\xD6\xD0\xCE\xC4", so the include? would be true even though the characters 中文 are not present.

If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.

bobince
Ruby 1.9 supports unicode, right?
Grant Crofton
@Grant: Yep. Just tested this in 1.9 and it works as long as both strings have the `encoding` 'gb2312' set.
bobince
@bobince, nice to see you are online. I can't use ruby 1.9 because there is something wrong when I read Chinese strings from mongodb, but it's ok with ruby 1.8.7.
Freewind
I just tried your code with ruby 1.8.7, but it seems not work. The content is GB2312, may contain a Chinese word `没被注册`,and it's encoded string should be `\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1`(Right?). I use `content.include?("\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1")` to check, but it always output `false`
Freewind
If the encoding of the HTML page really is GB2312 then yes, it should contain `\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1`. If you `p` the HTML string, you can check what exact bytes are in there. `p` gives you octal, so you'd want to see `"...\303\273\261\273\327\242\262\341..."`.
bobince
@bobince, you are right! It works now. Thank you very much !!
Freewind