views:

427

answers:

5
+1  Q: 

Get file encoding

How can I figure out with PHP what file encoding a file has?

+3  A: 

mb_detect_encoding should be able to do the job.

http://us.php.net/manual/en/function.mb-detect-encoding.php

In it's default setup, it'll only detect ASCII, UTF-8, and a few Japanese JIS variants. It can be configured to detect more encodings, if you specify them manually. If a file is both ASCII and UTF-8, it'll return UTF-8.

BlackAura
I don't think your last statement is right. If it were, then ASCII would never be detected because all ASCII strings are also UTF-8. I think the mb_detect_order() function is relevant for determining what encoding is returned when more than one could be valid. By default ASCII comes before UTF-8.
Rob Kennedy
According to PHP's documentation, it should work that way, yes. It just doesn't seem to. If it worked as the documentation says it should, it would never return UTF-8. When I've used it in the past, it prefers UTF-8 over ASCII, returning ASCII only when the string isn't a valid UTF-8 string.
BlackAura
A: 

You can't really, unless the file is kind enough to tell you somewhere inside it.

For example, HTML files are meant to contain a content-type meta tag near the top, so that your web browser knows what encoding is used.. eg

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

or

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

There are methods that try to guess by looking at the file and spotting byte sequences that suggest certain encodings, but these are really only guessing.

rikh
A: 

You can use the fread() function to look at the first few bytes of the file for the "magic number", and then map that magic number against a list of known magic numbers for file types.

Spike Williams
Only up to a rather limited point. The encoding for a UTF-16 file is indicated by a BOM (byte-order mark), to distinguish between little-endian and big-endian (UTF-16LE and UTF-16BE). But for other code sets, there is no mandatory identification - they just get on with presenting the data in their encoding.
Jonathan Leffler
+3  A: 

Detecting the encoding is really hard for all 8 bit character sets but utf-8 (because not every 8 bit byte sequence is valid utf-8) and usually requires semantic knowledge of the text for which the encoding is to be detected.

Think of it: Any particular plain text information is just a bunch of bytes with no encoding information associated. If you look at any particular byte, it could mean anything, so to have a chance at detecting the encoding, you would have to look at that byte in context of other bytes and try some heuristics based on possible language combination.

For 8bit character sets you can never be sure though.

A demonstration of heuristics going wrong is here for example:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

Some 16bit sets, you have a chance at detecting because they might include a byte order mark or have every second byte set to 0.

If you just want to detect UTF-8, you can either use mb_detect_encoding as already explained, or you can use this handy little function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
    )+%xs', $string);
}
pilif
A: 

BlackAura's suggestion is very good, IMHO.

Another option is to call file(1) on the file in question using system() or the like. Often, it is able to tell you the encoding as well. It should be available in any sane UNIX environment.

rodion