tags:

views:

39

answers:

1

I have a problem trying to get my head around using UTF8 with Poco::XML::XMLWriter. In the following code example, everything works fine when the input contains ASCII characters. However, occasionally the string in wordmapIt->first contains a non-ASCII value, such as a character -105 occurring in the middle of a string. When this happens the xml stream seems to terminate on the -105 char even though there are many other words after this one. I want to save whatever string was there so just stripping the char out isn't the right answer - theres got to be some kind of encoding I can apply (I think) but what?

I'm clearly missing something conceptually but for the life of me I cant figure out the right way to do this.

Poco::XML::XMLString EDocument::makeXMLString()
{
    std::stringstream xmlstream;
    Poco::UTF8Encoding utf8encoding;
    Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);

    writer.startDocument();
    std::map<std::string, std::string>::iterator wordmapIt;

    for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
    {
        writer.startElement("", "", "word");
        writer.characters(Poco::XML::toXMLString(wordmapIt->first));
        writer.endElement("", "", "word");
        }
        writer.endDocument();
    return xmlstream.str();
    }

Edit: Solution based on answer below.

Poco::XML::XMLString EDocument::makeXMLString()
{
    std::stringstream xmlstream;
    Poco::UTF8Encoding utf8encoding;
    Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);

    Poco::Windows1252Encoding windows1252encoding;
    Poco::UTF8Encoding utf8encoding;
    Poco::TextConverter textconverter(windows1252encoding, utf8encoding);

    writer.startDocument();
    std::map<std::string, std::string>::iterator wordmapIt;

    for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
        {
        std::string strword; 
        textconverter.convert(wordmapIt->first, strword);
        writer.startElement("", "", "word");
        writer.characters(strword);
        writer.endElement("", "", "word");
        }
    writer.endDocument();
    return xmlstream.str();
}
+1  A: 

It sounds like you have a byte string in Windows code page 1252 encoding. “Character -105” presumably really means byte 0x97, which would map to Unicode character U+2014 Em Dash () in cp1252.

I'm not familiar with Poco, but I would guess you're expected to convert your cp1252 strings to UTF-8 output encoding using a TextConverter with Windows1252Encoding and UTF8Encoding.

Although if what you really have is an “ANSI string” (a byte string in the default code page for the current machine's locale), 1252 might not be the right answer and you might have to use a function from another library to do the conversion properly.

bobince
Perfect! Thank you so much. My confusion had arisen because Im scraping strings out of IE and was thinking 'well the webpage is utf8 so whats the problem?' But as you pointed out the string was a cp1252 encoded string. Using TextConverter as you suggested to map from cp1252 to utf8 was the right result. Im editting my question to contain the answer because finding examples of this stuff is a drag.
Andrew Bucknell