views:

106

answers:

2

Hi,

I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks

+2  A: 

IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.

Edit: I did remember correctly, check out this paper / tutorial

Glen
Thanks, the tutorial is helpful. Bookmarking it for future reference.
dasp
+2  A: 

If you are looking for ready solution, you might want to try Enca.

However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

Both methods have their good and bad sides and may sometimes give wrong results.

drdaeman
Grepping for the specified regex solved my problem. Thanks!
dasp