tags:

views:

535

answers:

3

In the absence of a BOM is there a quick and dirty way in which I can check if a char* buffer contains UTF8 characters?

+4  A: 

Not reliably. See Raymond Chen's series of posts on the subject.

The problem is that UTF-8 without a BOM is all too often indistinguishable from equally valid ANSI encoding. I think most solutions (like the win32 API IsTextUnicode) use various heuristics to give a best guess to the format of the text.

Mark Pim
+5  A: 

You can test the hypothesis that it could, but I believe you can only end up knowing that it does not with certainty. In other words, you can examine the buffer to see if all byte sequences are legal UTF-8, that the code points are represented with the least number of bytes, that no 16-bit surrogate codes are present, and so forth. A buffer that passes all of those criteria might seem to be text, but you could be fooled.

In addition to the Raymond Chen discussion at Old New Thing cited by Mark Pim's answer, the buffer could actually contain x86 machine code that just happens to be restricted to the subset that seems to be 7-bit printable ASCII. Amazingly you actually can write meaningful programs in that subset, one example of which is the EICAR anti-virus test virus.

Of course, a buffer that contains byte sequences that are malformed UTF-8 is probably not UTF-8 text at all. In that case, you have a high degree of confidence. Then the trick is to figure out what encoding it might actually be.

If you know (or can assume) something about the semantic content of the buffer, then you could also use that to support your determination. For example, if the buffer is supposed to contain English text, then it is highly unlikely to have codepoints from Korean in it, and it should generally be spelled correctly, follow English grammar, and so forth. This can get expensive to test, of course...

RBerteig
A: 

For quick and dirty, you can't do much better than the regex on this page. If you just want to know whether it's safe to decode the bytes as UTF-8, that's all you need.

Alan Moore