views:

598

answers:

2

So I have a ruby script that parses HTML pages and saves the extracted string into a DB... but i'm getting weired charcters (usually question marks) instead of plain text...

Eg : ‘SOME TEXT’ instead of 'Some Text'

I've tried HTML entities and CGI::unescape ... but to no avail... did some googling n set $KCODE = 'u' & require 'jcode' still not working...

any suggestions /pointers would be great

Thanks

PS : using mysql 5.1

+2  A: 

Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.

It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.

Ruby has an iconv implementation to convert between encoding types. See here for more information.

Andrew Flanagan
but note that you only need to _convert_ encoding if the original page wasn't in UTF-8 and you want to store your copy in UTF-8
Alnitak
... or it was in UTF-8 and you don't want to store it in UTF-8.
Andrew Flanagan
+4  A: 

Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.

That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.

The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.

If it's terminal output, make sure that $ENV{'LANG'} is set to the appropriate UTF8 encoding (e.g. en.UTF-8) and that the terminal emulator itself is set the same way.

If it's HTML output, make sure that the page encoding is set to UTF-8 as well, i.e.:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Alnitak