ansaurus

Question

Ruby read a web page with encoding `GB2313`, how to check if the content contains some keyword?

Answer 1

+1 A:

Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.

Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:

res.include?("\xD6\xD0\xCE\xC4")

However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:

兄形男

in it, that would be encoded as "\xD0\xD6\xD0\xCE\xC4\xD0". Which contains the byte sequence "\xD6\xD0\xCE\xC4", so the include? would be true even though the characters 中文 are not present.

If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.

bobince 2010-07-09 14:10:03

Ruby 1.9 supports unicode, right?

Grant Crofton 2010-07-09 14:16:15

@Grant: Yep. Just tested this in 1.9 and it works as long as both strings have the `encoding` 'gb2312' set.

bobince 2010-07-09 15:28:15

@bobince, nice to see you are online. I can't use ruby 1.9 because there is something wrong when I read Chinese strings from mongodb, but it's ok with ruby 1.8.7.

Freewind 2010-07-09 15:32:17

I just tried your code with ruby 1.8.7, but it seems not work. The content is GB2312, may contain a Chinese word `没被注册`,and it's encoded string should be `\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1`(Right?). I use `content.include?("\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1")` to check, but it always output `false`

Freewind 2010-07-09 15:33:21

If the encoding of the HTML page really is GB2312 then yes, it should contain `\xc3\xbb\xb1\xbb\xd7\xa2\xb2\xe1`. If you `p` the HTML string, you can check what exact bytes are in there. `p` gives you octal, so you'd want to see `"...\303\273\261\273\327\242\262\341..."`.

bobince 2010-07-09 15:44:17

@bobince, you are right! It works now. Thank you very much !!

Freewind 2010-07-10 03:30:51

ansaurus

tags:

views:

answers:

Ruby read a web page with encoding `GB2313`, how to check if the content contains some keyword?

related questions