tags:

views:

20

answers:

1

In a recent question it was noted that on OSX running sed on a non ascii file gave strange results. For instance if you do (/usr/bin/cal is a random binary file)

sed 's/[^A-Z]//' /usr/bin/cal

sed will remove all of the printable characters other than A-Z, but many nonprintable characters remain. If however, you do

LANG='' sed 's/[^A-Z]//' /usr/bin/cal

only A-Z (and newlines) are output. Why?

Normally LANG=en-US.UTF-8 What is going on? I cannot see anyway that the output of sed could be considered correct in UTF-8. Is it broken, or is there some notion of working that I do not understand?

I know that the OSX sed is conforming to POSIX, and is therefore different from the beloved GNU sed.

+3  A: 

Binary data, such as the contents of /usr/bin/cal, are not UTF-8, and so will confuse any code that reads it as if it was. In particular, any byte with the high bit set (e.g., >= 128) will be interpreted as part of a multi-byte sequence representing a single character, and will thus be elided from the output. Not all sequences of bytes with the high-bit set are valid UTF-8, so things will get quite confused, but this probably explains why some non-printable characters remain but (possibly) not others.

In short: if you want to use text-oriented tools on binary data, don't.

Marcelo Cantos