ansaurus

Question

Answer 1

A:

You can get the hex codes with \xff \xfE and replace it by nothing.

schoetbi 2010-08-08 17:53:36

Answer 2

+1 A:

sed 's/[^ -~]//g'

or as the other answer implies

sed 's/[\x80-\xff]//g'

See section 3.9 of the sed info pages. The chapter entitled escapes.

Edit for OSX, the native lang setting is en_US.UTF-8

try

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

deinst 2010-08-08 17:54:53

Thanks - but this doesn't seem to work for me. When I run this on the test file the only change is a carriage return (x0A) appended to the end of the file.

Greg Harman 2010-08-08 18:00:43

The last comment was in regard to the first approach. The second one strips off the first legit character (5) but leaves the FF and FE bytes. It doesn't make sense to me why...

Greg Harman 2010-08-08 18:03:21

Oh. Are you outputting the result of sed to a new file, i.e.`sed 's/[^ -~]//g' test.csv > test1.csv` sed itself does not change the file, it outputs a changed version to stdout.

deinst 2010-08-08 18:05:03

Yes, I'm just doing it in-line for purposes of posting here.

Greg Harman 2010-08-08 18:11:54

@Greg Which version of osx?, and have you replaced the original sed?

deinst 2010-08-08 18:38:33

This is v10.6.4, and is the original sed AFAIK

Greg Harman 2010-08-08 22:51:05

See my update, the problem is that LANG=en_US.UTF-8 (assuming perhaps wrongly that you're a usian). I have no idea why that screws things up.

deinst 2010-08-08 22:57:28

Bingo! (and yes, I am US)

Greg Harman 2010-08-08 23:14:04

I'm going to ask a question as to why it screws up.

deinst 2010-08-08 23:28:18

@deinst it screws up (at least as I understand it) because the FF FE isn't treated as part of the content of the file, but as formatting metadata -- and hence the editing rules don't get applied to it. Similarly, if you did `sed 's/.//g' | xxd` you'll get `fffe 0a0a` because the 0A (linefeeds) aren't part of the lines, they're line terminators, and hence don't have the "delete everything" rule applied.

Gordon Davisson 2010-08-09 23:10:02

@Gordon Thanks, I am beginning to understand the subtleties of UTF-8. Give me back the days when men were men and everything was ascii.

deinst 2010-08-10 00:07:11

Answer 3

+1 A:

The FF and FE bytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text. FF FE indicates UTF-16 in Little Endian

Here's an excerpt from the FAQ:

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

Some protocols allow optional BOMs in the case of untagged text. In those cases,

Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

References

unicode.org/FAQ/UTF BOM

Related questions

polygenelubricants 2010-08-08 18:57:43

Answer 4

+1 A:

This will strip out all lines that begin with the specific bytes FF FE

sed -e 's/\xff\xfe//g' hexquestion.txt

The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.

Gary 2010-08-08 20:05:45

Thanks for this - I didn't know that about the []. Unfortunately, it doesn't seem to solve my particular problem.

Greg Harman 2010-08-08 22:50:17

I re-read your question and updated my answer to catch all occurances of this pattern. Also, it turns out that this solution works for me on cygwin, Redhat linux 4.8 but fails on an older Redhat system and Solaris 9. Older versions of sed might not be able to deal with non-ascii.

Gary 2010-08-08 23:39:58

Answer 5

A:

On OS X, the Byte Order Mark is probably being read as a single word. Try either sed 's/^\xfffe//g' or sed 's/^\xfeff//g' depending on endianess.

drewk 2010-08-08 23:07:45

Nope... good idea though!

Greg Harman 2010-08-08 23:10:12

Answer 6

A:

To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:

Show all the bytes:

$ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C
00000000  31 32 33 20 61 62 63 ff  fe 7f 80                 |123 abc....|

Have sed remove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:

$ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63 ff fe  80                       |123abc...|

Have sed remove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:

$ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63                                 |123abc|

Dennis Williamson 2010-08-08 23:27:43

Answer 7

+1 A:

As an alternative you may used ed(1):

printf '%s\n' H $'g/[\xff\xfe]/s///g' ',p' | ed -s test.csv

printf '%s\n' H $'g/[\xff\xfe]/s///g' wq | ed -s test.csv  # in-place edit

bashfu 2010-08-09 12:59:35

ansaurus

tags:

views:

answers:

Stripping hex bytes with sed - no match

References

See also

Related questions

related questions