ansaurus

Question

file encoding generating blank character in ruby -- why?

Answer 1

+1 A:

Try running

data = IO.read(ARGV[0])
puts data.dump

and see what you get. This will print the escape codes of any nonprinting characters.

It doesn't look like the UTF8 byte order mark, if I set the BOM using :set bomb in vim on the file and try your code I get

[?]
[?]
[?]
[.]
[.]
[.]

while dump gives me

"\357\273\277...\n"

which will be the octal representation of the BOM (EF BB BF in hex)

Scott Wales 2010-10-01 06:29:58

The file saved with the utf8 file encoding in vim has "\u{feff} at the start of the data.dump output, and the latin1 file doesn't have anything.

Stewart Johnson 2010-10-01 06:34:49

So that would suggest I suppose that ruby is treating the files as utf8 regardless of what's on disk, and incorrectly processing the BOM? (i.e.: treating it as a char, rather than an encoding hint)

Stewart Johnson 2010-10-01 06:36:18

@Stewart - that's the utf-16 BOM, i've no idea why that would be there.

Scott Wales 2010-10-01 06:36:49

@Stewart I've only got Ruby 1.8, which I believe doesn't have unicode support. Generally UTF8 files won't have any byte order mark, as it somewhat defeats the purpose of it being backwards compatible with programs expecting ascii input. Try doing a hex dump of your input file. If it starts with `FEFF` or `FFFE` then vim is putting the wrong BOM onto your file, if it's `EFBBBF` then Ruby is doing something weird, possibly converting input to UTF16.

Scott Wales 2010-10-01 06:45:52

using the `file` command returns this for the file with \u{feff} at the start: `UTF-8 Unicode (with BOM) text, with CRLF line terminators`. For the file without the BOM `file` returns `ASCII text, with CRLF line terminators`. WTF.

Stewart Johnson 2010-10-01 06:56:40

I've added the hexdump of the file with the BOM to the original question. Ruby is apparently losing its mind.

Stewart Johnson 2010-10-01 07:02:11

I think the solution here may be to `set nobomb` in vim on the files I'm reading in, and then never speak of this again. Thanks for your help!

Stewart Johnson 2010-10-01 07:05:50

@Stewart From your hexdump it looks like your version of ruby is using UTF16 strings internally, converting your input file. The BOM is a nonprinting character (A zero width space I believe), but if it's a problem I guess you'll have to manually delete it. Something like `line.sub!('\u{feff}','')` should do, you shouldn't have BOMs elsewhere in the file.

Scott Wales 2010-10-01 07:12:18

ansaurus

tags:

views:

answers:

file encoding generating blank character in ruby -- why?

related questions