Detect presence of a specific charset

views:

106

answers:

Detect presence of a specific charset

Hi,

I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks

+2 A:

IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.

Edit: I did remember correctly, check out this paper / tutorial

Glen 2009-06-09 11:01:56

Thanks, the tutorial is helpful. Bookmarking it for future reference.

dasp 2009-06-10 13:06:20

+2 A:

If you are looking for ready solution, you might want to try Enca.

However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

Both methods have their good and bad sides and may sometimes give wrong results.

drdaeman 2009-06-09 12:10:56

Grepping for the specified regex solved my problem. Thanks!

dasp 2009-06-10 13:00:07

ansaurus

tags:

views:

answers:

Detect presence of a specific charset

related questions