views:

189

answers:

4

Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.

I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251). And buf is of type basic_string<wchar_t>

The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).

This all works perfectly with english characters...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...

+1  A: 

Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.

Jerry Coffin
So, std::basic_ifstream<wchar_t> just cannot be done? Then why does it exist? Forgive the nature of my questions, I just don't see a way, at all, to read multibyte characters using streams, and have them be anything but garbage as soon as they're read, unless you write code specifically for each kind of multibyte encoding - which defeats the point of templates altogether.
Mark
@Mark: The important point here is that your input isn't Unicode. Is your implementation expecting Unicode?
David Thornley
I'm not really sure what you mean - all I know is that the file will be in either ASCII or Unicode (and it's supposed to be selectable at compile time whether or not to use wide or narrow characters - using a template).
Mark
basic_[io]stream<wchar_t> can be done, but most implementations assume the external encoding will be something like ISO 8859-x or shift JIS rather than Unicode. Though they didn't really plan it that way, it's possible to make them read/write files in UTF-8 encoded Unicode. Getting it to work with UTF-16 or UTF-32/UCS-4 would be more difficult. Given that you're doing different transformations with each, at some point you need unique code for each encoding. The template reduces unnecessary duplication elsewhere.
Jerry Coffin
A: 

Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t> used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.

If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>, which just passes through data unchanged.

Billy ONeal
How is this used? If you don't mind me asking.
Mark
You just pass the facet to basic_istream<wchar_t>::use_facet, like you would with any other facet.
Billy ONeal
I'm not sure that exists... Maybe I'm misunderstanding how facets work, but I don't see how you could pass one to use_facet, since I don't think use_facet is defined for basic_ifstream. I could be wrong...
Mark
Sorry -- I'm not very familiar with this stuff :( I think the method you're looking for is `std::basic_ifstream<t>::imbue`.
Billy ONeal
+1  A: 

There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.

Hans Passant
A: 

I am not sure, but you can try to call setlocale(LC_CTYPE, "");

VitalyVal
Err.. no, that's the default locale in any case.
Billy ONeal