views:

200

answers:

2

I am working with a Wikipedia XML dump that is encoded in UTF-8. Right now, I am reading in everything as std::string, so when I std::cout to the screen, foreign characters are displayed as jibberish.

The actual parsing process only looks for ASCII characters though, but when I write the parsed file to disk, I want to preserve the foreign characters. In other words, I want the output to have the same encoding as the input.

Is it OK to use std::string, or am I going to have to use something like ICU? The libraries I have looked at seem overly complicated. Is there something quick I can use to do this?

+1  A: 

All the time you do not break the text and non-ascii characters you are safe. You can use std::string without problem.

I mean when you do not relate to the content of the XML as trying to do something like split letters or words, try to make upper case text, etc, you do not have any problems.

Artyom
Will this still work even if I remove characters from the string? For example, I want to scan each character and remove "{{" from the string by copying all other characters to a new string. Will this still work without converting?
Ryan Rosario
Removing ASCII characters will not cause a problem. UTF-8 is designed so that no byte in a multi-byte character is in the range 0-127 and thus can't be confused with an ASCII character.
John Machin
+1  A: 

UTF-8 is the default encoding for XML documents. Just write it to your file. There is no point in converting it to Unicode and back again. If it is accidentally dumped to your screen, avert your gaze :-)

Removing ASCII characters like '{' will not cause a problem. UTF-8 is designed so that no byte in a multi-byte character is in the range 0-127 and thus can't be confused with an ASCII character.

John Machin