views:

113

answers:

4

Just curious about the encodings that system is using when handling string storing(if it cares) and printing.

Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)

Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?

Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?

Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

A: 

Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)

You're sort of thinking about this backwards. Different encodings interpret the underlying integers as different characters (or parts of characters, if we're talking about a multi-byte character set), depending on the encoding.

Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?

Both std::string and std::wstring are completely encoding agnostic. As far as C++ is concerned, they simply store arrays of char objects and wchar_t objects respectively. The only requirement is that char is one-byte, and wchar_t is some implementation-defined width. (Usually 2 bytes on Windows and 4 on Linux/UNIX)

Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")?

That depends on the platform. ISO C++ only talks about the global locale object, std::locale(), which generally refers to your current system-specific settings.

Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

Generally, if you output to the screen through stdout, the characters you see displayed are interpreted and rendered according to your system's current locale settings.

Charles Salvia
@Charles: what happens to this line of code: std::wstring = L"してる~". It should be translated from char to integer(s) before storing, right? So if you're printing things out, then yes, it's from integer(s) to char. On the other hand, if you're storing the string, it's from char to integer(s). Do you agree? :)
Eric
That string is already just a bunch of integers: 0x3057, 0x3066, 0x308B, and 0xFF5E. Those integers are *interpreted* (not converted) as whatever character, depending on the character set. For example, in the UTF-16 encoding, 0x3057 is a two-byte sequence representing the hiragana letter SI. (http://www.unicodemap.org/details/0x3057/index.html)
Charles Salvia
@Charles: So do you mean it is NOT using the encoding to map the char to the integer(s) which is(are) stored in memory? Then what's that?
Eric
What do you mean by *it*? Your system's font-engine? Your console? What?
Charles Salvia
@Charles: It means how string stores the data. Ok, let me rephrase like this. If I give std::wstring a string with Japanese character, the integers stored in memory are independent from the encoding, is this correct?
Eric
The string just stores data as bytes, it doesn't care what those bytes "represent" according to any character set. If you type a Japanese character into your text editor, the text editor application will decide how the text you see on your screen is actually represented as binary integers. If your text editor is set to use UTF-16, for example, then if you type a Japanese character it will be stored as a 2 byte integer in memory and on disk. Another application may then later interpret that same 2 byte integer differently, if it is not set to UTF-16.
Charles Salvia
+1  A: 

Encoding and Decoding is inherently the same process, i.e. they both transform one integral sequence to another integral sequence.

The difference between encoding and decoding is on the conceptual level. When you "decode" a character, you transform an integral sequence encoded in a known encoding ("string") into a system-specific integral sequence ("text"). And when you "encode", you're transforming a system-specific integral sequence ("text") into an integral sequence encoded in a particular encoding ("string").

This difference is conceptual, and not physical, the memory still holds a decoded "text" as a "string"; however since a particular system always represent "text" in a particular encoding, text transformations would not need to deal with the specificities of the actual system encoding, and can safely assume to be able to work on a sequence of conceptual "characters" instead of "bytes".

Generally however, the encoding used for "text" uses encoding that have properties that makes it easy to work with (e.g. fixed-length characters, simple one-to-one mapping between characters and byte-sequence, etc); while the encoded "string" is encoded with an efficient encoding (e.g. variable-length characters, context-dependant encoding, etc)

Joel On Software has a writeup on this: http://www.joelonsoftware.com/articles/Unicode.html

This one is a good one as well: http://www.jerf.org/programming/encoding.html

Lie Ryan
+3  A: 

(I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory)

Not quite. Make sure you understand one important distinction.

  • A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
  • A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.

Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.

The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.

Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value depend on the encoding currently in use?

Yes, it will. Suppose you have std::string euro = "€"; Then:

  • With the windows-1252 encoding, the string will be encoded as the byte 0x80.
  • With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
  • With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.

Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")?

Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.

~$ echo $LANG
en_US.utf8

Windows has a GetACP function to get the "ANSI" code page number.

Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.

dan04
A: 

Any one working with encodings should read this Joel on Software article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I found it useful when I started working with encodings.

Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use?

C/C++ programmers are used to thinking of characters as bytes, because almost everyone starts working with the ascii character set, maps the integers 0-255 to symbols such as the letters of the alphabet and arabic numbers. The fact that the C char datatype is actually a byte doesn't help matters.

The std::string class stores data as 8-bit integers, and std::wstring stores data in 16-bit integers. Neither class contains any concept of encoding. You can use any 8-bit encoding such as ASCII, UTF-8, Latin-1, Windows-1252 with a std::string, and any 8-bit or 16-bit encoding, such as UTF-16, with a std::wstring.

Data stored in std::string and std::wstring must always be interpreted by some encoding. This generally comes into play when you interact with the operating system: reading or writing data from a file, a stream, or making OS API calls that interact with strings.

So to answer your question, if you store the same byte in a std::string and a std::wstring, the memory will contain the same value (except the wstring will contain a null byte), but the interpretation of that byte will depend on the encoding in use.

If you store the same character in each of the strings, then the bytes may be different, again depending on the encoding. For example, the Euro symbol (€) might be stored in the std::string using a UTF-8 encoding, which is corresponds to the bytes 0xE2 0x82 0xAC. In the std::wstring, it might be stored using the UTF-16 encoding, which would be the bytes 0x20AC.

Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?

Yes, the locale determines how the OS interprets strings at it's API boundaries. Locale's define more than just encoding. They also include things information on how money, date, time, and other things should be formatted. On Linux or OS X, you can use the locale command in the terminal to see what the current locale is:

mch@bohr:/$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=

So in this case, my locale is Canadian English. Each locale defines a encoding used to interpret strings. In this case the locale name makes it clear that it is using a UTF-8 encoding, but you can run locale -ck LC_CTYPE to see more information about the current encoding:

mch@bohr:/$ locale -ck LC_CTYPE
LC_CTYPE
ctype-class-names="upper";"lower";"alpha";"digit";"xdigit";"space";"print";"graph";"blank";"cntrl";"punct";"alnum";"combining";"combining_level3"
ctype-map-names="toupper";"tolower";"totitle"
ctype-width=16
ctype-mb-cur-max=6
charmap="UTF-8"
... output snipped ...

If you want to test a program using encodings, you can set the LC_ALL environment variable to the locale you want to use. You can also change the locale using setlocale. Permanently changing the locale depends on your distribution.

On Windows, most API functions come in a narrow and a wide format. For example, [GetCurrentDirectory][9] comes in GetCurrentDirectoryW (Unicode) and GetCurrentDirectoryA (ANSI) variants. Unicode, in this context, means UTF-16.

I don't know enough about Windows to tell you how to set the locale, other than to try the languages control panel.

Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

When you print a string to std::cout, the OS will interpret that string in the encoding set by the locale. If your string is UTF-8 encoded and the OS is using Windows-1252, it will be necessary to convert it to that encoding. One way to do this is with the iconv library.

mch