tags:

views:

1163

answers:

3

Hi,

has anyone an idea how an awk script (presumably a one-liner) for removing a BOM would look like?

Specification:

  • print every line after the first (NR > 1)
  • for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
+7  A: 

Found here:

http://unix.derkeiler.com/pdf/Newsgroups/comp.unix.shell/2008-10/msg00031.pdf

awk '{if(NR==1)sub(/^\xef\xbb\xbf/, "");print}'

Enjoy!

Bartosz
It seems that the dot in the middle of the sub statement is too much (at least, my awk complains about it). Beside this it's exactly what I searched, thanks!
Boldewyn
This solution, however, works **only** for UTF-8 encoded files. For others, like UTF-16, see Wikipedia for the corresponding BOM representation: http://en.wikipedia.org/wiki/Byte_order_mark
Boldewyn
I agree with the earlier comment; the dot does not belong in the middle of this statement and makes this otherwise great little snippet an example of an awk syntax error.
Brandon Craig Rhodes
So: `awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE` and make sure INFILE and OUTFILE are different!
mrclay
+2  A: 

Not awk, but simpler:

tail -c +4 UTF8 > UTF8.nobom

To check for BOM:

hd -n 3 UTF8

If BOM is present you'll see: 00000000 ef bb bf ...

mrclay
The tail trick is cool. Thanks!
Boldewyn
+2  A: 

Using sed:

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using Gnu Sed: the -i parameter means "in place", and will update files without need of redirections or weird tricks.

Denilson Sá
That's nice, too. Thanks!
Boldewyn