tags:

views:

209

answers:

4

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.

Suppose I do

echo "t" > test.txt

if I then

file test.txt

I get test.txt:ASCII text

If I then do

echo "å" > test.txt

Then I get

test.txt: UTF-8 Unicode text

How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?

Thanks.

+4  A: 

From the file manpage:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ''text'' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ''character data'' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.

schnaader
right. thanks. this is the part that I'm after though: "[utf-8] can be distinguished by the different ranges and sequences of bytes that constitute printable text", what exactly distinguishes them?
Dervin Thunk
Ctrl+C and then Ctrl+V
Isaac
+3  A: 

There are certain byte sequences that suggest that UTF-8 encoding may be in use (see Wikipedia). If file finds one or more of those and doesn't find anything that can't occur in UTF-8, it's a fair guess that the file is encoded in UTF-8. But again, just a guess. For the basic ASCII character set (normal characters like 't'), the binary representation is the same in all encodings, so file just goes with ASCII by default.

The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command

echo "å" > test.txt

would write a file using UTF-16.

David Zaslavsky
You're correct about the xterm being initialized in utf-8.
Dervin Thunk
+2  A: 

It inserts a BOM on the very beginning of the file.

BOM (Byte-Oder Mark) tell Editors the Encoding of the file (and other things like big/little endian encoding)

You can find out the existence of BOM be checking the file size. It's more than 2 bytes (i guess it's 4 or 5 bytes).

This Article about BOMs in Wikipedia can help much.


Update:

Yes, I was wrong.

Even there is BOM for UTF-8 but most of editors do NOT insert BOM at the beginning because BOM codes are ASCII incompatible and one of the goals of UTF-8 design is ASCII compatibility. So it's really bad to insert BOM for UTF-8 !

So the editors really guess if files encoded in UTF-8 or not.


So Another Question!:

It seems that there are the possibility that Editors guess wrong about real encoding of a file. Is such situations rare? It's clear that smaller texts have more chance for this situation.

Isaac
BOMs are not universally used and if they're not present, all you can do is guess.
Artelius
In particular, echo "å" > test.txt will probably NOT insert a BOM into the file, because echo isn't designed to create files.
Artelius
Artelius is right. BOMs are, in fact, rarely used, except by MSVS. In any case, if I hexdump, there is still no BOM: c3 a5 0a (0a is paragraph marker)...
Dervin Thunk
Guessing the encoding of a text is not simple. It needs statistical analysis. Especially when your file contains just ONE character, it's almost impossible! Check the file size to see there is a BOM ;)
Isaac
A BOM encoded into UTF-8 is three bytes. But BOMs should not generally be used in UTF-8 files as they are meaningless and non-ASCII-compatible. Unfortunately some Microsoft software does put them in anyway.
bobince
@bobince , @Dervin Thunk , @Artelius - Thanks. I was wrong! =]
Isaac
+2  A: 

UTF-8 is "ASCII-friendly", in the sense that a text file consisting only of ASCII characters will be exactly the same, whether it is encoded with ASCII or UTF-8.

Note: some people think there are 256 ASCII characters. There are only 128. The ISO-8859-x is a family of encodings whose first 128 characters are ASCII and the rest are other characters.

Also, UTF-8 is very well-designed, and gives you several properties, for instance, some characters are encoded with 1 byte, some with 2, 3, or 4 - but a 4-byte character will never contain the bytes of any shorter character, and nor will a 3 or 2 byte character. All 1-byte characters are encoded with bytes 0 to 127, while all longer characters are encoded as a sequence of bytes in the range 128 to 255.

A non-UTF-8 byte stream (for instance, a binary file, or a UTF-16 file) usually can be ruled out as UTF-8, because it is likely to violate such properties. The only exception is plain ASCII files which of course can be harmlessly interpreted as UTF-8 anyway.

So in short, UTF-8 files can be detected as such because most "random" byte sequences are illegal in UTF-8, and so something that doesn't violate any rules is quite likely to be UTF-8.

Artelius