Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!
UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.
If you do want to write a program for detecting those sequences, it's pretty easy:
Illegal UTF-8 initial sequences
UTF-8 Sequence Reason for Illegality
10xxxxxx illegal as initial byte of character (80..BF)
1100000x illegal, overlong (C0 80..BF)
11100000 100xxxxx illegal, overlong (E0 80..9F)
11110000 1000xxxx illegal, overlong (F0 80..8F)
11111000 10000xxx illegal, overlong (F8 80..87)
11111100 100000xx illegal, overlong (FC 80..83)
1111111x illegal; prohibited by spec
Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1
bits before the first 0
bit.
For example, 11110xxx
is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.
The other thing to do is ensure that all continuation octets start with 10
.
Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump
utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.
My hdump
utility is available at: http://david.tribble.com/programs.html