views:

1670

answers:

4

I want to do this:

 findstr /s /c:some-symbol *

or the grep equivalent

 grep -R some-symbol *

but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection.

Any suggestions?

Thanks, David

A: 

According to this blog article by Damon Cortesi grep doesn't work with UTF-16 files, as you found out. However, it presents this work-around:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

This is obviously for Unix, not sure what the equivalent on Windows would be. The author of that article also provides a shell-script to do the above that you can find on github here.

This only greps files that are UTF-16. You'd also grep your ASCII files the normal way.

Mark A. Nicolosi
A: 

You didn't say which platform you want to do this on.

On Windows, you could use PowerGREP, which automatically detects Unicode files that start with a byte order mark. (There's also an option to auto-detect files without a BOM. The auto-detection is very reliable for UTF-8, but limited for UTF-16.)

Jan Goyvaerts
+3  A: 

Thanks for the suggestions. I was referring to Windows Vista and XP.

I also discovered this workaround, using free Sysinternals strings.exe:

C:> strings -s -b dir_tree_to_search | grep regexp

Strings.exe extracts all of the stings it finds (from binaries, but works fine with text files too) and prepends each result with a filename and colon, so take that into account in the regexp (or use cut or another step in the pipeline). The -s makes it do a recursive extraction and -b just suppresses the banner message.

Ultimately I'm still kind of surprised that the flagship searching utilities Gnu grep and findstr don't handle Unicode character encodings natively.

David Martin
On their home unix environments, UTF-16 is much less common, and files are generally in UTF-8, which they handle just fine.
bdonlan
A: 

type and find both handle unicode just fine.

Greg Stigers