tags:

views:

628

answers:

1

Greetings,

While I've gotten many answer off this site, this is my first question, and I'm kinda excited about it... :)

I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:

# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st

# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp

# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1

# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2

# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3

# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4

# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1

Here is clean.sed:

s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;

Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.

My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.

Maybe I'm just missing something... Any help is appreciated.

Thanks Ryan

+6  A: 

U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.

Mark Byers
This is what I thought too, but he clearly states in the question that the BOM is not in the original file.
ithcy
A BOM is invisible. My best guess given the information in the question is that the clean.sed script changes unprintable characters to their hex representation, and possibly also removes NUL characters. So the BOM maybe was there all along, it just becomes more visible after the "cleaning".
Mark Byers
here is clean.sed: s/\",\"/XXX/g; :a s/,//g ta s/XXX/\",\"/g;
SDGuero
I'm sure you're right, it's the only answer that makes sense. I'm just taking him at his word... (BOM is easily visible with cat utf16_file.txt)
ithcy
Shouldn't vi display the BOM from the get go?If it is there, vi cannot see it in the original file but can see it after the sed edits. I posted clean.sed... Plesae let me know if this is the root cause. Thanks! :)
SDGuero
No, vi knows how to handle Unicode and will not display the BOM. Do a :set fenc in vi and it will show you the encoding of the current file. Mark Byers is correct, you are probably seeing a mangled BOM after your sed because sed is outputting ASCII.
ithcy
...So to summarize, your csv file is UTF-16 encoded, and sed is probably not going to work for you unless you have the option to convert the file to ASCII first. (Try man iconv) If you can't do that, use something like a simple python script to do the text replacements.
ithcy