views:

75

answers:

2

Is it good/safe/possible to use the tiny utfcpp library for converting everything I get back from the wide Windows API (FindFirstFileW and such) to a valid UTF8 representation using utf16to8?

I would like to use UTF8 internally, but am having trouble getting the correct output (via wcout after another conversion or plain cout). Normal ASCII characters work of course, but ñä gets messed up.

Or is there an easier alternative?

Thanks!

UPDATE: Thanks to Hans (below), I now have an easy UTF8<->UTF16 conversion through the Windows API. Two way conversion works, but the UTF8 from UTF16 string has some extra characters that might cause me some trouble later on...). I'll share it here out of pure friendliness :) ):

// UTF16 -> UTF8 conversion
std::string toUTF8( const std::wstring &input )
{
    // get length
    int length = WideCharToMultiByte( CP_UTF8, NULL,
                                      input.c_str(), input.size(),
                                      NULL, 0,
                                      NULL, NULL );
    if( !(length > 0) )
        return std::string();
    else
    {
        std::string result;
        result.resize( length );

        if( WideCharToMultiByte( CP_UTF8, NULL,
                                 input.c_str(), input.size(),
                                 &result[0], result.size(),
                                 NULL, NULL ) > 0 )
            return result;
        else
            throw std::runtime_error( "Failure to execute toUTF8: conversion failed." );
    }
}
// UTF8 -> UTF16 conversion
std::wstring toUTF16( const std::string &input )
{
    // get length
    int length = MultiByteToWideChar( CP_UTF8, NULL,
                                      input.c_str(), input.size(),
                                      NULL, 0 );
    if( !(length > 0) )
        return std::wstring();
    else
    {
        std::wstring result;
        result.resize( length );

        if( MultiByteToWideChar(CP_UTF8, NULL,
                                input.c_str(), input.size(),
                                &result[0], result.size()) > 0 )
            return result;
        else
            throw std::runtime_error( "Failure to execute toUTF16: conversion failed." );
    }
}
+3  A: 

Why do you want to use UTF8 internally? Are you working with so much text that using UTF16 would create unreasonable memory demands? Even if that was the case, you're probably better off using wide chars anyway, and dealing with memory issues in some other way (using a disk cache, better algorithms or data structures).

Your code will be much cleaner and easier to deal with using wide chars native to the Win32 API internally, and only doing UTF8 conversions when reading or writing out data that requires it (eg. XML files or REST APIs).

Your problem may also occur at the point where you print your output to the console, see: http://stackoverflow.com/questions/2492077/output-unicode-strings-in-windows-console-app

Finally I haven't used the utfcpp library, but UTF8 conversions are fairly trivial to perform using Win32's WideCharToMultiByte and MultiByteToWideChar with CP_UTF8 as the code page. Personally I would do a one time conversion and work with the text in UTF16 until it was time to output or transfer it in UTF8 if needed.

Brook Miles
Note that wide characters on Windows are 16bit, and thus have to be encoded as UTF-16. That, too, is a multi-byte encoding. Even though you are probably less likely to encounter Unicode code points needing two 16bit bytes to be encoded, those exist, and you cannot assume that each 16bit value is an individual character.
sbi
True, the main benefit is that UTF16 is the native encoding for Windows, and working with it means not having to continually convert to and from some other encoding when calling APIs.
Brook Miles
I'm developing a cross-platform app, and on linux wchar_t"s are double the sie of what they are on Windows. All I need the win32 API for is filenames, all the rest is plain text (ASCII characters only). I don't see a reason to process double the amount of bytes, when a simple std::string will suffice.
rubenvb
The reason is that a) double the amount of bytes is irrelevant in this case unless it's a huge amount or you're on a very limited platform and b) it's the native OS encoding and is therefor simpler to use. Basically I don't think it's worth all the extra effort and complexity to use UTF8 with no external requirement to do su.
Brook Miles
As I said, the app is crossplatform and I'd have to create a much larger abstraction layer if I want to have it run on any non-windows system. It's either UTF8 or UTF16, but one end is going to have to be converted anyways. I'm not delving into the tchar business.
rubenvb
+4  A: 

The Win32 API already has a function to do this, WideCharToMultiByte() with CodePage = CP_UTF8. Saves you from having to rely on another library.

You cannot normally use the result with wcout. Its output goes to the console, it uses an 8-bit OEM encoding for legacy reasons. You can change the code page with SetConsoleCP(), 65001 is the code page for UTF-8 (CP_UTF8).

Your next stumbling block would be the font that's used for the console. You'll have to change it but finding a font that's fixed-pitch and has a full set of glyphs to cover Unicode is going to be difficult. You'll see you have a font problem when you get square rectangles in the output. Question marks are encoding problems.

Hans Passant
Just to clarify: a font (at least a TT font) allows you to specify what glyph will be shown for a codepoint for which the font doesn't contain a glyph. That's *typically* an empty rectangle, but could be essentially anything the font designer chose.
Jerry Coffin
I thought these were available, but I didn't know they were for UTF-8 -> UTF-16 conversion (I stupidly thought they used the UCS-2 encoding instead). Console output is indeed a messy thing. Perhaps I can output the UTF-8 to a file and open that with, say Notepad++ (it's only for checking what my program does)?
rubenvb
Sure, ought to work. As long as you can convince it that this is a UTF-8 file, it normally requires a BOM at the start of the file. Write 0xef 0xbb 0xbf first to be sure.
Hans Passant