views:

383

answers:

6

Is it possible to know if a file has unicode (16-byte per char) or 8-bit ASCII content ?

+4  A: 

You may be able to read a byte-order-mark, if the file has this present.

Brian Agnew
+1  A: 

If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.

Pascal Cuoq
+1  A: 

Unicode is a alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.

dottedmag
Thanks, I have fixed the topic.
Soubok
Unfortunately Microsoft have really confused this issue by consistently calling the UTF-16LE encoding “Unicode”.
bobince
A: 

First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.

The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.

You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.


Edit in response to OP's comment:

I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).

My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).

kdgregory
I have to choose between JS_CompileScript() and JS_CompileUCScript() to compile JavaScript files for my native embedding (http://code.google.com/p/jslibs)
Soubok
A: 

For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

ZZ Coder
+1  A: 

Ditto to what Brain Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.

You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.

David Grayson