Howto identify UTF-8 encoded strings

views:

1912

answers:

+6 Q:

Howto identify UTF-8 encoded strings

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

+5 A:

There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.

Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.

Laurent 2008-12-18 09:15:06

+4 A:

This W3C page has a perl regular expression for validating UTF-8

hamishmcn 2008-12-18 09:18:20

If you're reading a stream and you might not have the beginning, you should either loose the \A at the begining or add a ".{0,5}?" just after it to capture the first truncated character.

mat 2008-12-18 09:27:11

I would recommend doing this by using the language's standard Unicode library rather that reimplementing it through regular expressions.

Laurent 2008-12-18 09:38:36

+9 A:

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a delphi library used for encoding detection

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question http://stackoverflow.com/questions/373081/how-can-i-best-guess-the-encoding-when-the-bom-byte-order-mark-is-missing, it has some useful content.

Edward Wilde 2008-12-18 10:40:33

ansaurus

tags:

views:

answers:

Howto identify UTF-8 encoded strings

related questions