views:

388

answers:

1

I have a text file that was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE for input format and UTF-8 for output format... it works great.

My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i..e Field1 Field2). I have tried string and wstring versions of getline and while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values so the start and length values are off.

How do I read the UCS-2LE data into a C++ string and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here?? Please help!

My example code looks like this:

wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring  srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
    wstring field1;
    field1 = srcBuf.substr(12,12);
    ...
    ...
}

so, if for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s.," then the substr call above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.". What I want is to read in the string and process it without having to worry out the multi-byte representation. Does anybody have an example of using Boost (or something else) to read these strings from the file and convert to a fixed width representation for internal use? BTW, I am on a mac using eclipse and gcc...is it possible my stl does not understand wide character strings?

Thanks

A: 

substr works fine for me on Linux with g++ 4.3.3. The program

#include <string>
#include <iostream>

using namespace std;

int main()
{
  wstring s1 = L"Hello, world";
  wstring s2 = s1.substr(3,5);
  wcout << s2 << endl;
}

prints "lo, w" as it should.

However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.

Martin v. Löwis
Thanks for the reply. I see the same behavior. As you say, I don't think the UTF-16 to wchar_t is supported. I used iconv to convert the file to UFT-8 and it solved by problem.
Cryptik