ansaurus

Question

Why is my Bash script adding <feff> to the beginning of files?

Answer 1

+6 A:

U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.

Mark Byers 2009-12-29 00:54:38

This is what I thought too, but he clearly states in the question that the BOM is not in the original file.

ithcy 2009-12-29 01:03:09

A BOM is invisible. My best guess given the information in the question is that the clean.sed script changes unprintable characters to their hex representation, and possibly also removes NUL characters. So the BOM maybe was there all along, it just becomes more visible after the "cleaning".

Mark Byers 2009-12-29 01:07:03

here is clean.sed: s/\",\"/XXX/g; :a s/,//g ta s/XXX/\",\"/g;

SDGuero 2009-12-29 01:12:15

I'm sure you're right, it's the only answer that makes sense. I'm just taking him at his word... (BOM is easily visible with cat utf16_file.txt)

ithcy 2009-12-29 01:16:53

Shouldn't vi display the BOM from the get go?If it is there, vi cannot see it in the original file but can see it after the sed edits. I posted clean.sed... Plesae let me know if this is the root cause. Thanks! :)

SDGuero 2009-12-29 01:22:33

No, vi knows how to handle Unicode and will not display the BOM. Do a :set fenc in vi and it will show you the encoding of the current file. Mark Byers is correct, you are probably seeing a mangled BOM after your sed because sed is outputting ASCII.

ithcy 2009-12-29 01:28:18

...So to summarize, your csv file is UTF-16 encoded, and sed is probably not going to work for you unless you have the option to convert the file to ASCII first. (Try man iconv) If you can't do that, use something like a simple python script to do the text replacements.

ithcy 2009-12-29 01:37:13

ansaurus

tags:

views:

answers:

Why is my Bash script adding <feff> to the beginning of files?

related questions