views:

136

answers:

3

Hi,

I am in doubt:

Or I abstract away my string-type and implicitly use the local stringtype or I use something like ICU and convert to the local type when needed.

Let me give you an example the following:

enum StringKind {
  ICU_STRING,
  STD_STRING,
  MSCORLIB_STRING,
  NSSTRING,
  ... /* You get the picture */
};

template<class E>
class _MyString {
};

template<>
class _MyString<ICU_STRING> {};

template<>
class _MyString<NSSTRING> {};

#if defined(__ICU_INSTALLED__)
typedef _MyString<ICU_STRING> MyString;
#elif defined(__DOT_NET__)
typedef _MyString<MSCORLIB_STRING> MyString;
/* ... */
#endif

or I just use the ICU implementation in my code and convert the UnicodeString to the characterencoding of that runtime. Be aware, string can get very big in my implementation!

What should I do/chose?

Thank you,

Filip

A: 

Personally I favor the typedef. I'm not even sure why.

100MB strings. That settles it. Use the typedef.

Joshua
A: 

Why is the size of a string an issue? Either you need Unicode (or at least something beyond ASCII), and you accept the additional memory requirements, or you use something like std::string. At a quick glance, ICU will work with UTF-8, although with a little extra work, and that's identical to ASCII when dealing with only ASCII characters. – David Thornley

The size of the string is the biggest issue. Imagine, you have a string that is 100 MB in memory. The last option is chosen, and all strings are saved in UnicodeString (icu)... Since the code is cross-platform, some other code needs the content in its own format, lets say NSString on Mac or System.String on dotNet platforms.

Now you have to create a temporary buffer of the same size, possibly even bigger (UTF8 can take up to 6 bytes per character), run a converter on it, and then create the new string of your chosen type with the contents of that buffer. Somewhere in that process, you end up with 3 strings, all the same. There is 300+ MB used just because a line of code wanted something in its own type... What a loss!

Now imagine that this conversion code is invoked multiple times, and maybe on multiple threads.

Aren't we lucky that there is 64-bit to solve all our memory problems ;-)

Flaps
If this is the case, then the typedef is certainly the way to go, since it avoid the necessity to convert a 100M string.
David Thornley
A: 

Take a look at ICU's UText interfaces. They are designed to allow non-contiguous storage of strings.

Steven R. Loomis