views:

161

answers:

5

I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for input streams, and that it instead depends on the compiler, the runtime environment, localization, and what not.

How can I determine the input encoding?

+3  A: 

In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.

anon
To build on this, you can use the environment to determine a sensible default. Take a look at $LANG and the locale command, if your OS supports them.
Roger Pate
+2  A: 

Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.

For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

You can either use it, or check its sources how they have done it.

Kornel Kisielewicz
Note - this tells you if it COULD be utf8, you can't know if it is. A stream of regular 7bit ASCII is utf8 until you hit the first accented character.
Martin Beckett
*Checking* whether input is valid UTF-8 or not isn't heuristic (it's what your function does), but determining if UTF-8 was the user's *intention* is.
Roger Pate
Language shortcut :/
Kornel Kisielewicz
A: 

Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.

And if your input comes from a file, as opposed to a real console, then all bets are off.

Seva Alekseyev
A: 

Jared Oberhaus answered this well on a related question specific to java.

Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)

John Weldon
A: 

As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the source of the unix file command and see what tests it uses to determine file encoding. From the man page of file:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

PCRE provides a function to test a given string for its completely being valid UTF-8.

ferdystschenko