ansaurus

Question

Answer 1

A:

If you want to just work with raw data use a vector<char>. string is only meant to be used with encodings where one character == one char.

Let_Me_Be 2010-10-29 12:17:33

char is always one byte... You may refer to vector<int> instead. But still, UTF-8 is a variable size encoding where some characters use just one byte, and other characters several bytes, so using vector<int> is inefficient.

Diego Sevilla 2010-10-29 12:21:26

Nemanja Trifunovic 2010-10-29 12:50:03

@Nemanja `char*` yes, definitely, `std::string` no, never.

Let_Me_Be 2010-10-29 12:55:28

@Let_Me_Be: References?

Nemanja Trifunovic 2010-10-29 14:12:36

@Nemanja What references? If you use variable encoding inside `std::string` then non of the methods will work as expected.

Let_Me_Be 2010-10-29 14:23:56

A few of them do work. `find`, for example.

dan04 2010-10-29 15:15:08

@Let_Me_Be: References for your claim that std:string should "never" be used with multibyte encodings.

Nemanja Trifunovic 2010-10-29 16:12:05

Answer 2

A:

To write UTF-8, you need to use a codecvt facet like this one. An example of how to use it can be seen here.

Marcelo Cantos 2010-10-29 12:17:42

Those are used to convert wchar_t (UTF-16/UTF-32) into UTF-8. Since the string is already UTF-8 no conversion is required.

Martin York 2010-10-29 14:52:20

@Martin: There is no guarantee that the string is UTF-8. If the source file was saved using codepage 437, the character `à` will be a single byte with the value 133. (In Unicode, `à` is represent by the code point U+00E0, which UTF-8 encodes as the byte sequence [0xc3, 0xa0].)

Marcelo Cantos 2010-10-29 23:27:29

Answer 3

+1 A:

In your code sample, the std::string charset stores what you write. That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text.

UTF-8 is just a coding scheme in which different chars use different byte sizes. However, if you use a UTF-8 editor, it will codify, say 'ñ' with two bytes, and, when you write it to file, it will have that two bytes (being again UTF-8 compliant).

The problem may be the editor you used to create the source C++ file. It may use latin1 or some other encoding.

Diego Sevilla 2010-10-29 12:24:20

Yeah, I thought of that, but the editor is in UTF-8 mode

oystein 2010-10-29 12:27:34

Answer 4

A:

std::string assumes the string is UTF-8 coded while std::wstring assumes the string is UTF-16 coded.

Eric 2010-10-29 12:24:35

No, std::string and std::wstring are character-set agnostic. The only difference between std::string and std::wstring is that the former is an array of `char` elements and the latter is an array of `wchar_t` elements

Charles Salvia 2010-10-29 12:28:34

@Charles: ok, I shouldn't have mentioned the word "assume". What I mean is if you store the UTF-16 coded string in a std:wstring then you will get a human-readable contents of the string if you print it out or use a debugger. Right, string/wstring itself has no knowledge of the underlying character set, of course.

Eric 2010-10-29 12:40:09

Answer 5

+6 A:

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

You are correct that std::string is encoding agnostic. It simply holds an array of char elements. How these char elements are interpreted as text depends on the environment. If your locale is not set to some form of Unicode (i.e. UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode.

Are you sure your string literal "abcdefgàèíüŷÀ" is actually Unicode and not, for example, Latin-1? (ISO-8859-1 or possible Windows-1252)? You need to determine what locale your platform is currently configured to use.

-----------EDIT-----------

I think I know your problem: some of those Unicode characters in your charset string literal, like the accented character "À", are two-byte characters (assuming a UTF-8 encoding). When you address the character-set string using the [] operator in your random_string function, you are returning half of a Unicode character. Thus the random-string function creates an invalid character string.

For example, consider the following code:

std::string s = "À";
std::cout << s.length() << std::endl;

In an environment where the string literal is interpreted as UTF-8, this program will output 2. Therefore, the first character of the string (s[0]) is only half of a Unicode character, and therefore not valid. Since your random_string function is addressing the string by single bytes using the [] operator, you're creating invalid random strings.

So yes, you need to use std::wstring, and create your charset string-literal using the L prefix.

Charles Salvia 2010-10-29 12:27:36

This is probably the issue, as I have earlier been able to read a unicode string from a file (encoded in UTF-8) into a std::string and output it to a different file. I'll look into it.

oystein 2010-10-29 12:40:29

Yes, I think this is it. See my answer.

larsmans 2010-10-29 13:02:29

And this is exactly why I said that you can't store multi-byte encodings in a `std::string`. But for some reason I got downvoted to oblivion.

Let_Me_Be 2010-10-29 13:04:02

@Let_Me_Be, because you *can* store multi-byte encodings in a `std::string`. I just did so in the example above. You simply can't address a single multi-byte character of the string using the `[]` operator.

Charles Salvia 2010-10-29 13:05:31

Larsmans was first so he gets the accepted answer

oystein 2010-10-29 13:06:59

@Charles Yeah the same way I can use a linked list for random access.

Let_Me_Be 2010-10-29 13:08:43

@Let_Me_Be, well I didn't downvote you. But regardless, your suggestion of using `std::vector<char>` would result in the same problem. You couldn't address a single complete multibyte character.

Charles Salvia 2010-10-29 13:14:49

@Charles Yes, but unlike `std::string`, `std::vector` is meant to store raw data.

Let_Me_Be 2010-10-29 13:16:09

Answer 6

+9 A:

random_string is likely to be the culprit; I wonder how it's implemented. If your string is indeed UTF-8-encoded and random_string looks like

std::string random_string(std::string const &charset)
{
    const int N = 10;
    std::string result(N);
    for (int i=0; i<N; i++)
        result[i] = charset[rand() % charset.size()];
    return result;
}

then it will take random chars from charset, which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. See Wikipedia and RFC 3629.

The solution might be to transform to and from UTF-32 in random_string. I believe wchar_t and std::wstring use UTF-32 on Linux. UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane.

larsmans 2010-10-29 12:47:21

So if a std::string named "str" contains "àỳ", str[0] won't return "à"? And str[1] won't return "ỳ"?

oystein 2010-10-29 12:56:47

No, it will return the first byte in the multi-byte encoding for these characters. C++ is a 1980s invention, designed to be compatible with C (1970s) and ASCII (1960s), while Unicode and UTF-8 were introduced in the early 90s. UTF-8 was designed to keep *most* old programs and algorithms working, looks like you used one of the algorithms that break. *If* this is more or less what `random_string` does.

larsmans 2010-10-29 13:01:03

It is. I guess this means that whenever I want to manipulate a unicode string I must use a wstring. I'll read up on portability issues and such. Anyway, answer accepted.

oystein 2010-10-29 13:06:35

Correction to my previous comment: `str[1]` will return the second byte in the encoding for `à`.

larsmans 2010-10-29 14:18:17

Is there anything wrong with using UTF-8 with wstring to solve the problem? Any particular reason why I'd have to convert to UTF-32 (or UTF-16)?

oystein 2010-10-29 14:55:27

@oystein: `wstring` on Windows uses UTF-16, so you still have the "half a character" problem, although less often. It's perfectly reasonable to store Unicode strings in UTF-8 as long as you remember that `char` means "byte", **NOT** "character".

dan04 2010-10-29 14:59:59

See also http://stackoverflow.com/questions/3473295/utf-8-or-utf-16-or-utf-32-or-ucs-2/

dan04 2010-10-29 15:04:31

@dan: How does wstring "use" UTF-16?

oystein 2010-10-29 15:21:28

Technically, `wstring` itself is encoding-agnostic. But all of the Windows API and CRT functions that accept `wchar_t`-based strings interpret them as being encoded in UTF-16. And that MSVC has `sizeof(wchar_t) == 2` so that you *can't* use it for UTF-32.

dan04 2010-10-29 15:29:03

Guess I _can_, but I'd get that half character problem again, eh? I'm not writing platform specific code, so hopefully UTF-8 + wstring should not be a problem...

oystein 2010-10-29 15:33:16

@oystein, no, you can use UTF-8, it just takes extra processing (and will make your app UTF-8-specific). Since any Unicode codepoint takes at most 4 bytes to encode in UTF-8, you can convert `charset` to an `std::vector<UTF8Char>`, where `UTF8Char` is a `struct` wrapping an `unsigned char [4]` array. (The half-char issue with UTF-16, btw., only occurs when you're handling ancient scripts and the like.)

larsmans 2010-10-29 15:51:49

ansaurus

tags:

views:

answers:

Unicode and std::string in C++

related questions