ansaurus

Question

Answer 1

+2 A:

UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.

If you do want to write a program for detecting those sequences, it's pretty easy:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec

Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.

For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.

The other thing to do is ensure that all continuation octets start with 10.

paxdiablo 2009-11-18 03:06:58

Answer 2

+1 A:

Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.

My hdump utility is available at: http://david.tribble.com/programs.html

Loadmaster 2009-11-18 03:07:39

Not all UTF-8 files have a BOM.

Matthew Talbert 2009-11-18 03:17:58

ansaurus

tags:

views:

answers:

Checking all files are encoded as UTF-8

related questions