I need a string that won't properly convert to ANSI using several code pages.

views:

118

answers:

I need a string that won't properly convert to ANSI using several code pages.

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.

To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.

Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.

edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!

There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.

jarnbjo 2009-10-09 16:21:43

What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.

Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.

You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.

Jonathan Leffler 2009-10-09 16:29:08

+1 A:

If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.

For instance, try some of Byzantine Musical Symbols

Nemanja Trifunovic 2009-10-09 16:35:01

FWIW, some Chinese characters out of the BMP are present in GB18030. My 2-cents,

Serge - appTranslator 2009-10-09 22:57:42

+3 A:

There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური You can find more in Georgian file (ka.xml) of the CLDR DB.

HTH,

Serge - appTranslator 2009-10-09 21:55:55

ansaurus

tags:

views:

answers:

I need a string that won't properly convert to ANSI using several code pages.

related questions