Hi,
I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
Hi,
I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){
n,}/
(this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1)
.
Both methods have their good and bad sides and may sometimes give wrong results.