I have been working for some years now as C++ Developer using MS Visual Studio as working platform. Since I privately prefer to use linux, I recently took the chance to move my working environment to linux as well. Since I have been optimizing my windows environment for several years now, of course it turns out several things are missing or not working as expected. Thus I have some questions for which I could not come with useful answers yet.
Lets start the following problem different questions will probably follow later. The problem is something I have already stumbled upon several times, whenever I was forced to debug platform specific bugs on non windows platforms.
Simply speaking: How can I display Unicode (UCS2 encoded) strings while debugging on linux?
Now some more details I have figured so far. Our Lib uses interally a Unicode based
String class, which encodes every char as a 16bit Unicode value (but we do not
support multiword encodings, thus we basically can only use the UCS2 encodable subset
of UTF16, but this encompasses nearly all used scripts anyway).
This already poses one problem, since most platforms (i.e. linux / unix) consider
wchar_t
types to consist of 4 bytes while on windows it is only 2 bytes, thus
I cannot simply cast the internal string buffer to (wchar_t *
), so I am not sure,
if this would really help any debugger.
For gdb I have figured, that I can call functions from the debugged code, to print debug messages. Thus I inserted a special function into our lib, that can arbitrarily transform the string data and write it to a new buffer. Currently I transcode our internal buffer to utf8, since I expect this to be the mostly likely to work.
But so far this solves the problem only partially: If the string is latin, then I now get a readable output (whereas one cannot directly print the latin data if it is 16 bit encoded), but I also have to deal with other scripts (f.e. CJK (a.k.a. Hansi / Kanji), cyrillic, greek ...) and with dealing I mean I have to specifically debug data using such scripts, since the used scripts directly influence the control flow. Ofcourse in these cases I only see the ISO chars that correspond to the multiple bytes that make up a utf8 char, which makes debugging CJK data even more cryptic then correctly displayed strings would be.
Generally gdb allows to set several host and target encodings, thus It should be possible, to send the correct encoded utf8 data stream to the console.
But of course I'd prefer to use an IDE for debugging. Currently I am trying to make friends with eclipse and CDT, but for debugging I have also tested kdgb. In both applications I could so far only obtain incorrectly decoded utf8 data. On the other hand I once debugged a java project in eclipse on a windows platform and all internal strings were displayed correctly (but this application was not using our lib and the corresponding strings), thus at least in some situations eclipse can display unicode chars correctly.
The most annoying point for me is, that so far I could not even come up with any proof, that displaying true unicode data (i.e. non ISO chars) is working in any setup on linux (i.e. even the gdb scripts for QStrings I have found, seem to only display latin chars and skip the remainder), but of course nearly every linux application seems to support unicode data, thus there must be people around, that debug true unicode data on linux platforms and I really cannot imagine, that they are all reading hexcodes instead of directly displaying unicode strings.
Thus any pointers to setups that allow debugging of unicode strings, based on any other string classes (f.e. QString) and / or IDE would also be appreciated.