ansaurus

Question

Answer 1

+3 A:

In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.

anon 2010-01-13 14:42:36

To build on this, you can use the environment to determine a sensible default. Take a look at $LANG and the locale command, if your OS supports them.

Roger Pate 2010-01-13 14:46:17

Answer 2

+2 A:

Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.

For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

You can either use it, or check its sources how they have done it.

Kornel Kisielewicz 2010-01-13 14:44:11

Note - this tells you if it COULD be utf8, you can't know if it is. A stream of regular 7bit ASCII is utf8 until you hit the first accented character.

Martin Beckett 2010-01-13 14:47:22

*Checking* whether input is valid UTF-8 or not isn't heuristic (it's what your function does), but determining if UTF-8 was the user's *intention* is.

Roger Pate 2010-01-13 14:48:34

Language shortcut :/

Kornel Kisielewicz 2010-01-13 15:16:30

Answer 3

A:

Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.

And if your input comes from a file, as opposed to a real console, then all bets are off.

Seva Alekseyev 2010-01-13 14:49:46

Answer 4

A:

Jared Oberhaus answered this well on a related question specific to java.

Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)

John Weldon 2010-01-13 14:52:00

Answer 5

A:

As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the source of the unix file command and see what tests it uses to determine file encoding. From the man page of file:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

PCRE provides a function to test a given string for its completely being valid UTF-8.

ferdystschenko 2010-01-13 17:00:14

ansaurus

tags:

views:

answers:

Determine input encoding

related questions