views:

346

answers:

3

Hi,

What's the current best practice for handling generic text in a platform independent way?

For example, on Windows there are the "A" and "W" versions of APIs. Down at the C layer we have the "_tcs" functions (like _tcscpy) which map to either "wcscpy" or "strcpy". And in the STL I've frequently used something like:

typedef std::basic_string<TCHAR> tstring;

What issues if any arise from these sorts of patterns on other systems?

+1  A: 

One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.

I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.

Brian R. Bondy
+3  A: 

There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.

I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.

EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).

coppro
+1  A: 

Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.

Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.

One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.

maccullt