What's the best practice of unicode processing in C++?
Here is a checklist for Windows programming:
- All strings enclosed in _T("my string")
- strlen() etc. functions replaced with _tcslen() etc.
- Use LPTSTR and LPCTSTR instead of char * and const char *
- When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
- For C++ strings, use std::wstring instead of std::string
- Use ICU for dealing with your data (or a similar library)
- In your own data store, make sure everything is stored in the same encoding
- Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like
is
_alpha
unless that is the definition you want. - I can't say it enough: _**never iterate over the indices of a
string
if you care about correctness, always use your unicode library for this.**_
Look at http://stackoverflow.com/questions/11635/case-insensitive-string-comparison-in-c
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
- The Code-Page Model
- Double-Byte Character Sets in Windows
- Unicode
- Compatibility Issues in Mixed Environments
- Unicode Data Conversion
- Migrating Windows-Based Programs to Unicode
- Summary
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide