views:

64

answers:

1

I'm using this little bit of ruby:

File.open(ARGV[0], "r").each_line do |line|
   puts "encoding: #{line.encoding}"
   line.chomp.split(//).each do |char|
     puts "[#{char}]"
  end
end

And I have a sample file that I'm feeding in the file just contains three periods and a newline.

When I save this file with a fileencoding of utf-8 (in vim: set fileencoding=utf-8) and run this script on it I get this output:

encoding: UTF-8
[]
[.]
[.]
[.]

And then if I change the fileencoding to latin1 (in vim: set fileencoding=latin1) and run the script, I don't get that first blank char:

encoding: UTF-8
[.]
[.]
[.]

What's going on here? I understand that the utf8 encoding puts some bytes at the start of the file to mark the file as utf8 encoded, but I thought they were supposed to be invisible when processing the text (i.e.: the ruby runtime was supposed to process them). What am I missing?

btw:

ubuntu:~$ ruby --version
ruby 1.9.2p0 (2010-08-18 revision 29034) [i686-linux]

Thanks!

Update:

Hex dump of the file with the extra char (the BOM):

ubuntu:~$ hexdump new.board
0000000 bbef 2ebf 2e2e 0a0d 0a0d
000000a
+1  A: 

Try running

data = IO.read(ARGV[0])
puts data.dump

and see what you get. This will print the escape codes of any nonprinting characters.

It doesn't look like the UTF8 byte order mark, if I set the BOM using :set bomb in vim on the file and try your code I get

[?]
[?]
[?]
[.]
[.]
[.]

while dump gives me

"\357\273\277...\n"

which will be the octal representation of the BOM (EF BB BF in hex)

Scott Wales
The file saved with the utf8 file encoding in vim has "\u{feff} at the start of the data.dump output, and the latin1 file doesn't have anything.
Stewart Johnson
So that would suggest I suppose that ruby is treating the files as utf8 regardless of what's on disk, and incorrectly processing the BOM? (i.e.: treating it as a char, rather than an encoding hint)
Stewart Johnson
@Stewart - that's the utf-16 BOM, i've no idea why that would be there.
Scott Wales
@Stewart I've only got Ruby 1.8, which I believe doesn't have unicode support. Generally UTF8 files won't have any byte order mark, as it somewhat defeats the purpose of it being backwards compatible with programs expecting ascii input. Try doing a hex dump of your input file. If it starts with `FEFF` or `FFFE` then vim is putting the wrong BOM onto your file, if it's `EFBBBF` then Ruby is doing something weird, possibly converting input to UTF16.
Scott Wales
using the `file` command returns this for the file with \u{feff} at the start: `UTF-8 Unicode (with BOM) text, with CRLF line terminators`. For the file without the BOM `file` returns `ASCII text, with CRLF line terminators`. WTF.
Stewart Johnson
I've added the hexdump of the file with the BOM to the original question. Ruby is apparently losing its mind.
Stewart Johnson
I think the solution here may be to `set nobomb` in vim on the files I'm reading in, and then never speak of this again. Thanks for your help!
Stewart Johnson
@Stewart From your hexdump it looks like your version of ruby is using UTF16 strings internally, converting your input file. The BOM is a nonprinting character (A zero width space I believe), but if it's a problem I guess you'll have to manually delete it. Something like `line.sub!('\u{feff}','')` should do, you shouldn't have BOMs elsewhere in the file.
Scott Wales