tags:

views:

367

answers:

2

I am frustrated that grep fails to find a word like "hello" in my UTF-16 documents.

Can anyone recommend a version of grep that attempts to guess the file encoding and then properly handle it?

Thanks!

+1  A: 

Perl has a way better regex syntax than grep (much more powerful), it has UTF8 and UTF16 support, but I'm not sure how good it is at guessing the encoding... if you tell it which encoding to use, though, it can read these files without any issues and run regexes over them. You'll have to write yourself a tiny Perl program for that (your own micro-grep implementation in Perl so to say), but that isn't too hard. Perl exists for all major operating systems.

Mecki
There are even a few examples of very basic grep replacements written in Perl throughout the Perldoc website. I believe they're generally about 5 or 6 lines, though they would be more if you wanted to add any sort of sophisticated command-line parsing.
Chris Lutz
+1  A: 

ack as perl-based grep replacement?

You'll definitely want to check out ack.

It supports Unicode encodings, and is basically grep, but better.

try a matching Unicode locale with grep

If you are under Linux, Unix, etc. you may want to change your LANG envariable to an encoding to match your documents.

Check your locale first. Here is what mine is set to by default on my MacBook Pro:

 $ locale 
 LANG="en_US.UTF-8"
 LC_COLLATE="en_US.UTF-8"
 LC_CTYPE="en_US.UTF-8"
 LC_MESSAGES="en_US.UTF-8"
 LC_MONETARY="en_US.UTF-8"
 LC_NUMERIC="en_US.UTF-8"
 LC_TIME="en_US.UTF-8" 
 LC_ALL=

say, under bash:

$ LANG="foo" grep 'gotta be found now' file.name

something a little more permanent (be careful with this):

$ export LANG="foo"
$ grep 'bar' mitz.vah
popcnt
popcnt