tags:

views:

197

answers:

5

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?

I need a function that does something like this:

ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars

I thought about reading to a char buffer, but wouldn't this change the character encoding?

+1  A: 

Maybe istream::getline is what you are looking for?

Space_C0wb0y
I don't think the OP wants to stop on newline.
Douglas Leeder
I believe so too, but I was wondering what happened, if you pass '' as the delimiter parameter (third overload in the documentation). Maybe this works as he intends?
Space_C0wb0y
+2  A: 

C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.

If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.

Phil Nash
+1  A: 

As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.

Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.

Possible interesting hits:

I'll leave further investigation to the reader.

Adam Luchjenbroers
+4  A: 

I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."

If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.

ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));

Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.

However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.

Charles Salvia
Good call on multi-byte characters.
nbolton
+1 This is dead on.
luke
A: 

I think you can use the sgetn member function of the streams associated streambuf...

char buf[32]; streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );

Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.