ansaurus

Question

Answer 1

+2 A:

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string for 1.8, you probably need to look at Iconv.

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

JasonTrue 2010-04-03 20:06:52

Hi Jason, Thanks so much for all the help. Got it working perfectly. I set my MySQL DB encoding to UTF-8 as well as my terminal profile.

Moe 2010-04-03 21:31:55

Answer 2

A:

Just to add a cross-reference, this SO page gives some related information:

http://stackoverflow.com/questions/2567029/how-to-make-nokogiri-transparently-return-un-encoded-html-entities-untouched

Greg 2010-04-04 00:45:47

Answer 3

A:

You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.

as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:

require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))

Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

Nakul 2010-04-04 08:21:07

From the sample provided, it's clear that his content is already in UTF-8.

JasonTrue 2010-04-08 06:22:19

nope it isn't. else he would get ù only. the webpage is not utf-8 encoded

Nakul 2010-04-08 13:50:52

\303\271 are c-escaped UTF-8 byte values, which is how they appear in IRB when you look at an evaluated string; it's octal for C3 B9, which is the UTF-8 sequence for ù. If it were iso-8859-1, he would have gotten the octal for F9, or \371.

JasonTrue 2010-04-09 23:26:04

but then, why would it look like Ã¹ in mysql? As i understand its irb not able or display it utf-8, right?

Nakul 2010-04-13 10:36:59

That was a separate problem, which I explained in my answer. Mysql collation needs to be set for UTF-8 on the table that you're storing data in.IRB can display UTF-8 text on appropriate terminals, but it won't display evaluated expressions as UTF-8. It shows evaluated expressions as ASCII + Octal escaped sequences. ("puts" may will behave differently. See `puts "\001"` vs `"\001"` in irb for an example that isn't UTF-8 specific.)

JasonTrue 2010-04-21 05:53:26

Answer 4

A:

Try setting the encoding option of Nokogiri, like so:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")

Koen. 2010-07-31 15:50:00

ansaurus

tags:

views:

answers:

Nokogiri and Special Characters

related questions