views:

408

answers:

3

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.

This code works

ofstream file("test.txt");
file << "A’B" ;
file.close();

==> A’B

This code works

wofstream file("test.txt");
file << "A’B" ;
file.close();

==> A’B

This code fails

wofstream file("test.txt");
file << L"A’B" ;
file.close();

==> A

This code fails...

wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();

==> A

Any idea ?

A: 

Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?

Logan Capaldo
What puzzles me is that, when using unicode (http://mariusbancila.ro/blog/?p=135) wofstream and ’ works correctly. But then why ofstream without unicode works too ?
"unicode" is too vague. You can use e.g. UTF-8 with ofstream and it's still unicode, but you wouldn't be using wchar_ts. Again, this is most likely an interaction between your source file's encoding, and what you are actually putting into the string literals, plus what your compiler expects/thinks your source file to be encoding is. The blog post is using windows APIs, are you on windows using VC++?
Logan Capaldo
A: 

Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.

I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.

dirkgently
A: 

You should "enable" locale before using wofstream:

std::locale::global(std::locale()); // Enable locale support 
wofstream file("test.txt");
file << L"A’B";

So if you have system locale en_US.UTF-8 then the file test.txt will include utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.

wofstream file("test.txt");
file << "A’B" ;
file.close();

This code works because "A’B" is actually utf-8 string and you save utf-8 string to file byte by byte.

Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.

Artyom
std::locale::global(std::locale("french")); works. I think I understand now (or begin to).Usually when dealing with Unicode character (too vague... I know) without ‘L’, the compiler (VS) will warn me with “character represented by universal-character-name ... cannot be represented in the current code page”. So this time I was surprised not to see this warning, so I assumed something was wrong. Also I think/thought that UTF-8 uses only 1-byte encoding for 128 US-ASCII...
1st: UTF-8 is compatible with US-ASCII. I actually do not know how VS represents unicode charrecters, gcc by default uses utf-8, VS may use local charset. But the general idea that you need to setup locale that would convert wide charrecters to locale encoding 8bit charrecters. Under unix this is usually utf8
Artyom