views:

661

answers:

2

On linux i have a directory with lots of files. Some of them have nonASCII characters, but they are all valid UTF8. One programme has a bug that prevents it working with nonASCII filenames, I have to find out how many are affected. I was going to do this with find and then do a grep to print the nonASCII characters, and then do a wc -l to find the number. it doesn't have to be grep, I can use any standard unix regex, like perl, sed, awk, etc.

However I'm not sure if there is a regex for 'any character that's not a ASCII character', is there?

+1  A: 

This will match a single non-ASCII char:

[^\x00-\x7F]

This is valid PCRE (Perl-Compatible Regular Expression).


EDIT: [^[:print:]] will probably suffice for you.

Alix Axel
don't you mean [~\x20-\x7f]
adrianm
@adrianm: No, `^` is valid in PCRE.
Alix Axel
That's exactly right. However you have to use pcregrep, not standard grep. [^[:print:]] won't work if your terminal is set up in UTF8.
Rory
+1  A: 

You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:

\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
Rubens Farias