views:

114

answers:

3

I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:

const char *getTranslatedWord( const char *englishWord );

Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:

class Word {
public:
  static Word fromUtf8( const char *data ) { return Word( data ); }
  const char *toUtf8() { return m_data; }

private:
  Word( const char *data ) : m_data( data ) { }

  const char *m_data;
};

I could now use this specialized type in the API:

Word getTranslatedWord( const Word &englishWord );

Unfortunately, it's easy to make this very inefficient. The Word class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word gets extended with more and more utility functions (like length or fromLatin1 or substr etc.) and I'd rather not write Yet Another String Class. I just want a little container which avoids accidental encoding mixups.

I wonder whether anybody else has some experience with this and can share some useful techniques.

EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.

+3  A: 

You could pass arround a std::pair instead of a char*:

struct utf8_tag_t{} utf8_tag;
std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord);

The generated machine code should be identical on a decent modern compiler that uses the empty base class optimization for std::pair.


I don't bother with this though. I'd just use char*s and document that the input has to be utf8. If the data could come from an untrusted source, you're going to have to check the encoding at runtime anyway.

Joe Gauterin
+1 That's a pretty creative idea. :-)
Frerich Raabe
+1 for 'don't bother'… Just use utf-8.
Steven R. Loomis
+1  A: 

I suggest that you use std::wstring.

Check out this other question for details .

radman
Yes, std::wstring looks like a candidate. However, I was wondering whether there is maybe something which doesn't require people to link their plugins against the standard C++ library. At least with Visual Studio 2009 it's not all inline template magic as far as I can see.
Frerich Raabe
Using std::wstring isn't a good idea. It's a sequence of wchar_t - which is a 16 bit integer type on Microsoft compilers and a 32 bit integer type on gcc. So a std::wstring could reasonably contain utf16LE, utf16BE, utf32BE or utf32LE.
Joe Gauterin
A: 

The ICU project provides a Unicode support library for C++.

jopa
True, but I'd rather not pull in a whole new library.
Frerich Raabe
Unless you need other functions it provides…
Steven R. Loomis