views:

2090

answers:

1

I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?

+5  A: 

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

Alan Moore