views:

56

answers:

2

I'm opening a CSV file and reading values from it using File.open(filename).

So I do something like this:

my_file = File.open(filename)
my_file.each_line do |line|
 line_array = line.split("\t")
 ratio = line_array[1]
 puts "#{ratio}"
 puts ratio.isutf8?
end

The issue I'm having is the values in line_array seem to be in a strange format. For example one of the values in a cell of the CSV file is 0.86. When I print it out it looks like " 0 . 8 6"

So it kind of behaves like a string but I'm not sure how it's encoded. When I try to do some introspection:

ratio.isutf8?
I get this:
=> undefined method 'isutf8?' for "\0000\000.\0008\0006\000":String

What the heck is going on?! How do I get ratio into a normal string that I can then call ratio.to_f on?

Thanks.

+2  A: 

Looks like your input data is encoded as UTF-16 or UCS-2.

Try something like this:

require 'iconv'

ratio = Iconv.conv('UTF-8', 'UTF-16', line_array[1])
puts "Ratio is now '#{ratio}'."

Come to think of it, you should probably run Iconv.conv on the whole line before calling split on it, otherwise there will be stray zero bytes at the end of the strings (unless you change your delimiter to '\000\t', which looks rather ugly.)

Lars Haugseth
Awesome - let me try that. Will let you know how it goes!
+2  A: 

Unpacking a binary string is generally called decoding. It looks like your data is in UTF-16 but should should find you what encoding it is actually using (e.g. by investigating the workflow/configuration that produced it) before assuming this is true.

In Ruby 1.9 (decode on the fly):

my_file = File.open(filename).set_encoding('UTF-16BE:UTF-8')
# the rest as in the original

In Ruby 1.8 (read in whole file, then decode and parse it; may not work for super large files):

require 'iconv'

# …

my_file = File.open(filename)
my_text = Iconv.conv('UTF-8', 'UTF-16BE', my_file.read)
my_text.each_line do |line|
 # the rest as in the original
end
Chris Johnsen
Great response as well. Thanks!