views:

183

answers:

5

Hi again,

I have this simple code:

#include <iostream>
#include <fstream>

using namespace std;

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
        cout << line << "    starts with char: " << line.at(0) << " " << (int) line.at(0) << endl;
    }
    in.close();
    return 0;
}

which prints:

  0.000000 0.000000 0.010909 0.200000    starts with char:   32
A 0.023636 0.000000 0.014545 0.200000    starts with char: A 65
B 0.050909 0.000000 0.014545 0.200000    starts with char: B 66
C 0.078182 0.000000 0.014545 0.200000    starts with char: C 67

...

, 0.152727 0.400000 0.003636 0.200000    starts with char: , 44
< 0.169091 0.400000 0.005455 0.200000    starts with char: < 60
. 0.187273 0.400000 0.003636 0.200000    starts with char: . 46
> 0.203636 0.400000 0.005455 0.200000    starts with char: > 62
/ 0.221818 0.400000 0.010909 0.200000    starts with char: / 47
? 0.245455 0.400000 0.009091 0.200000    starts with char: ? 63
¡ 0.267273 0.400000 0.005455 0.200000    starts with char: � -62
£ 0.285455 0.400000 0.012727 0.200000    starts with char: � -62
¥ 0.310909 0.400000 0.012727 0.200000    starts with char: � -62
§ 0.336364 0.400000 0.009091 0.200000    starts with char: � -62
© 0.358182 0.400000 0.016364 0.200000    starts with char: � -62
® 0.387273 0.400000 0.018182 0.200000    starts with char: � -62
¿ 0.418182 0.400000 0.009091 0.200000    starts with char: � -62
À 0.440000 0.400000 0.012727 0.200000    starts with char: � -61
Á 0.465455 0.400000 0.014545 0.200000    starts with char: � -61

Strange... How can I get really the first character of the string?

Thanks in advance!

A: 

I think the last characters belong to the extended ASCII table, something which C++ does not support

ASCII Table

Edit1 : No from a fast look the characters on the bottom do not appear to be in Extended ASCII as well. maybe check what Martin York said.

Muggen
+7  A: 

You are getting the first character in the string.

But it looks like the string is a UTF-8 string (or possibly some other multibyte character format).

This means each symbol (glyph) that os printed is made of 1 (or more characters).
If it is UTF-8 then any character that is outside the ASCII (0-127) range is actually made up of 2 (or more characters) and the string printing code is correctly interpreting this. But it is not possible for the character printing code to correctly de-code a single character that is greater than 127.

Personally I think dynamic width character formats are not a good idea to use internally in a program (they are OK for transport and storage) as they make string manipulation much more complex. I would recommend that you convert the string into a fixed width format for internal processing then convert it back to UTF-8 for storage.

Personally I would use UTF-16 (or UTF-32 depending on what wchar_t is) internally (yes I know technically that UTF-16 is not fixed width but in all reasonable teaching circumstances it is fixed width (when we include sand-script then we may need to use UTF-32)). You just need to imbue the input/output stream with the appropriate codecvt facet for the automatic translation. Internally the code can then be manipulated as single characters use wchar_t type.

Martin York
this might also help http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring
celavek
Could you please post an example that uses a `codecvt` facet to convert from UTF-8 to `wchar_t`?
Philipp
Found the boost stuff. Though it looks like beta code: http://beta.boost.org/doc/libs/1_35_0/libs/serialization/doc/codecvt.html
Martin York
A: 

string is a container for char, which is only one byte. It should only be used for Ascii strings or binary data. Anything that's not in this case should use Unicode, using wstring, a container for wchar_t.

But the problem of how your Unicode text is encoded still exists, for that, see answers above.

`std::string` can store Unicode strings if you use an appropriate encoding such as UTF-8. Unicode is not an encoding.
Philipp
Although possible, this is not very good because you can't use [0] reliably. What's the point of building abstractions (characters rather than bytes) if you use them so incorrectly?
+1  A: 

The file is UTF-8 encoded. Use a Unicode library such as ICU to get access to the code points:

#include <iostream>
#include <fstream>
#include <utility>

#include "unicode/utf.h"

using namespace std;

const pair<UChar32, int32_t>
getFirstUTF8CodePoint(const string& str) {
  const uint8_t* ptr = reinterpret_cast<const uint8_t*>(str.data());
  const int32_t length = str.length();
  int32_t offset = 0;
  UChar32 cp = 0;
  U8_NEXT(ptr, offset, length, cp);
  return make_pair(cp, offset);
}

int main(void)
{
    ifstream in("file.txt");
    string line;
    while (getline(in, line))
    {
      pair<UChar32, string::size_type> cp = getFirstUTF8CodePoint(line);
      cout << line << "    starts with char: " << line.substr(0, cp.second) << " " << static_cast<unsigned long>(cp.first) << endl;
    }
    in.close();
    return 0;
}
Philipp
A: 

A very similar code sample with the use of the UTF-8 CPP library can be found here: http://utfcpp.sourceforge.net/#introsample

Nemanja Trifunovic