ansaurus

Question

Answer 1

+2 A:

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

Niki Yoshiuchi 2010-09-23 18:01:12

`iconv` won't not work, as it's a binary file a lot of non-utf-16 data, and `iconv` exits on first error.

taw 2010-09-24 13:27:40

Ouch...I'm still looking into giving grep a utf-16 query out of curiosity (I don't think it's converting because it doesn't really know the encoding, it's gotta be doing something else weird) and I'll let you know if I come up with something.

Niki Yoshiuchi 2010-09-24 14:23:09

Check out my edit. Got something that works.

Niki Yoshiuchi 2010-09-24 15:58:57

It seems to be working after minor modification: `pcregrep \`echo -n "test" | iconv -f utf-8 -t utf-16le | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'\` <binary.file`. Most importantly it doesn't require utf-16 characters to be on 2-byte boundary - something all previous methods had big problems with. Even works with `-i`.

taw 2010-09-27 07:39:50

Awesome! I discovered that the problem I was having was with the backticks. For some reason they return utf-8 strings, and escape the backslashes. This is why sed has four '\'s.

Niki Yoshiuchi 2010-09-27 14:11:27

ansaurus

tags:

views:

answers:

grepping binary files and UTF16

related questions