How can I display Unicode strings while debugging on linux?

I have been working for some years now as C++ Developer using MS Visual Studio as working platform. Since I privately prefer to use linux, I recently took the chance to move my working environment to linux as well. Since I have been optimizing my windows environment for several years now, of course it turns out several things are missing or not working as expected. Thus I have some questions for which I could not come with useful answers yet.

Lets start the following problem different questions will probably follow later. The problem is something I have already stumbled upon several times, whenever I was forced to debug platform specific bugs on non windows platforms.

Simply speaking: How can I display Unicode (UCS2 encoded) strings while debugging on linux?

Now some more details I have figured so far. Our Lib uses interally a Unicode based String class, which encodes every char as a 16bit Unicode value (but we do not support multiword encodings, thus we basically can only use the UCS2 encodable subset of UTF16, but this encompasses nearly all used scripts anyway). This already poses one problem, since most platforms (i.e. linux / unix) consider wchar_t types to consist of 4 bytes while on windows it is only 2 bytes, thus I cannot simply cast the internal string buffer to (wchar_t *), so I am not sure, if this would really help any debugger.

For gdb I have figured, that I can call functions from the debugged code, to print debug messages. Thus I inserted a special function into our lib, that can arbitrarily transform the string data and write it to a new buffer. Currently I transcode our internal buffer to utf8, since I expect this to be the mostly likely to work.

But so far this solves the problem only partially: If the string is latin, then I now get a readable output (whereas one cannot directly print the latin data if it is 16 bit encoded), but I also have to deal with other scripts (f.e. CJK (a.k.a. Hansi / Kanji), cyrillic, greek ...) and with dealing I mean I have to specifically debug data using such scripts, since the used scripts directly influence the control flow. Ofcourse in these cases I only see the ISO chars that correspond to the multiple bytes that make up a utf8 char, which makes debugging CJK data even more cryptic then correctly displayed strings would be.

Generally gdb allows to set several host and target encodings, thus It should be possible, to send the correct encoded utf8 data stream to the console.

But of course I'd prefer to use an IDE for debugging. Currently I am trying to make friends with eclipse and CDT, but for debugging I have also tested kdgb. In both applications I could so far only obtain incorrectly decoded utf8 data. On the other hand I once debugged a java project in eclipse on a windows platform and all internal strings were displayed correctly (but this application was not using our lib and the corresponding strings), thus at least in some situations eclipse can display unicode chars correctly.

The most annoying point for me is, that so far I could not even come up with any proof, that displaying true unicode data (i.e. non ISO chars) is working in any setup on linux (i.e. even the gdb scripts for QStrings I have found, seem to only display latin chars and skip the remainder), but of course nearly every linux application seems to support unicode data, thus there must be people around, that debug true unicode data on linux platforms and I really cannot imagine, that they are all reading hexcodes instead of directly displaying unicode strings.

Thus any pointers to setups that allow debugging of unicode strings, based on any other string classes (f.e. QString) and / or IDE would also be appreciated.

Most Linux distros tend to have excellent Unicode support. However, I would say that using UTF16 on Linux is a mistake. I realize this would be natural, coming from a Windows environment, but it will just make things more difficult for you on Linux.

As long as your locale is set to Unicode, it's trivial to output UTF-32 strings, (wchar_t strings) using wprintf or wcout, and of course you can output UTF-8 strings using normal output facilities. However, with UTF-16 you are essentially limited to building a custom string class that uses int16_t, which, as you've discovered, is going to be difficult to print in a debugger.

You mentioned that you created a function which translates the UTF-16 to UTF-8 for the purposes of debugging, but the variable-length characters make it difficult to deal with. Why not simply make a function that translates the UTF16 to UTF32, so each Unicode codepoint is one character? This way you can use wide character output to read the strings. GDB doesn't allow you to output wide-character strings by default, but you can fix that using this simple script.

Yes, more precisely I'am currently using Ubunt 9.10 with gnome.I'd like to use a graphical frontend (Eclipse / kdbg / kdevelop / ddd ...) thus this has to be configured as well. Basically for most IDEs it seems to be no problem to view utf-8 encoded source code,i.e. I can see all chars my current charset supports, but while debugging at best I only get escaped hex codes for non ascii chars.

2009-11-12 11:46:05

I am not sure wether using UTF-16 is really a mistake, since 32-bit chars could nearly double our memory foot print, and could thushave significant negative impact on performance (which is considered a critical part for our purpose) on the other hand 32-bit charsets only enable additional chars, that we will very likely never need.Logging data in arbitray enocdings is no problem.I now adjusted my dump method to UTF-32 and tried to cast the resultto wchar, setting target-wide-char to utf-32 in gdb, that againworks in the console, but not in kdbg. Now I will test this with other IDEs.

2009-11-12 12:53:58

Well, you're right that UTF-32 consumes twice the memory. My usual operating strategy is to store Unicode data as UTF-8, and if I ever have to process it, I first convert it to UTF-32.

Charles Salvia 2009-11-12 13:45:06

Just tried casting utf-32 to wchar in eclipse, result looks likeL"\344\270\203ABER", i.e. the latin chars are displayed correctly,but the leading kanji is again garbled up into escaped hex values...

2009-11-12 14:24:52

ansaurus

tags:

views:

answers:

How can I display Unicode strings while debugging on linux?

related questions