views:

613

answers:

5

I've been writing a binary version of iostreams. It essentially allows you to write binary files, but gives you much control over the format of the file. Example usage:

my_file << binary::u32le << my_int << binary::u16le << my_string;

Would write my_int as a unsigned 32-bit integer, and my_string as a length-prefixed string (where the prefix is u16le.) To read the file back, you would flip the arrows. Works great. However, I hit a bump in the design, and I'm still on the fence about it. So, time to ask SO. (We make a couple of assumptions, such as 8-bit bytes, 2s-complement ints, and IEEE floats at the moment.)

iostreams, under the hood, use streambufs. It's a fantastic design really -- iostreams code the serialization of an 'int' into text, and let the underlying streambuf handle the rest. Thus, you get cout, fstreams, stringstreams, etc. All of these, both the iostreams and the streambufs, are templated, usually on char, but sometimes also as a wchar. My data, however, is a byte stream, which best represented by 'unsigned char'.

My first attempts were to template the classes based on unsigned char. std::basic_string templates well enough, but streambuf does not. I ran into several problems with a class named codecvt, which I could never get to follow the unsigned char theme. This raises two questions:

1) Why is a streambuf responsible for such things? It seems like code-conversions lie way out of a streambuf's responsibility -- streambufs should take a stream, and buffer data to/from it. Nothing more. Something as high level as code conversions feels like it should belong in iostreams.

Since I couldn't get the templated streambufs to work with unsigned char, I went back to char, and merely casted data between char/unsigned char. I tried to minimize the number of casts, for obvious reasons. Most of the data basically winds up in a read() or write() function, which then invoke the underlying streambuf. (And use a cast in the process.) The read function is basically:

size_t read(unsigned char *buffer, size_t size)
{
    size_t ret;
    ret = stream()->sgetn(reinterpret_cast<char *>(buffer), size);
    // deal with ret for return size, eof, errors, etc.
    ...
}

Good solution, bad solution?


The first two questions indicate that more info is needed. First, projects such as boost::serialization were looked at, but they exist at a higher level, in that they define their own binary format. This is more for reading/writing at a lower level, where it is wished to define the format, or the format is already defined, or the bulk metadata is not required or desired.

Second, some have asked about the binary::u32le modifier. It is an instantiation of a class that holds the desired endianness and width, at the moment, perhaps signed-ness in the future. The stream holds a copy of the last-passed instance of that class, and used that in serialization. This was a bit of a workaround, I orginally tried overloading the << operator thusly:

bostream &operator << (uint8_t n);
bostream &operator << (uint16_t n);
bostream &operator << (uint32_t n);
bostream &operator << (uint64_t n);

However at the time, this didn't seem to work. I had several problems with ambiguous function call. This was especially true of constants, although you could, as one poster suggested, cast or merely declare it as a const <type>. I seem to remember that there was some other larger problem however.

+1  A: 

As I understand it, the stream properties that you're using to specify types would be more appropriate for specifying endian-ness, packing, or other "meta-data" values. The handling of types themselves should be done by the compiler. At least, that's the way the STL seems to be designed.

If you use overloads to separate the types automatically, you would need to specify the type only when it was different from the declared type of the variable:

Stream& operator<<(int8_t);
Stream& operator<<(uint8_t);
Stream& operator<<(int16_t);
Stream& operator<<(uint16_t);
etc.

uint32_t x;
stream << x << (uint16_t)x;

Reading types other than the declared type would be a little messier. In general, though, reading to or writing from variables of a type different from the output type should be avoided, I think.

I believe the default version of std::codecvt does nothing, returning "noconv" for everything. It only really does anything when using the "wide" character streams. Can't you set up a similar definition for codecvt? If, for some reason, it's impractical to define a no-op codecvt for your stream, then I don't see any problem with your casting solution, especially since it's isolated to one location.

Finally, are you sure you wouldn't be better off using some standard serialization code, like Boost, rather than rolling your own?

Tim Sylvester
Boost::serialization resides at a slightly higher level, and cannot be used to aid reading of existing binary protocols, for example, or give quite the fine-grained control. As for the codecvt, I made some initial stabs at writing a no-op one for unsigned char, but was unsuccessful. As for the definitions, I originally started with what you had, but ran into problems with ambiguous function calls, and move to the current solution. I might try again, as using the type would be much more natural.
Thanatos
Well, it sounds like you know what you're doing. That's all the feedback I have. If you want to take another crack at either the overloading or codecvt problems, I'm sure SO would be happy to look at it.
Tim Sylvester
I think that this is the way to go. Maybe you should try to work the 'ambiguous function call' errors. I don't think (have not tried it) that the overload design is flawed (I can be proven wrong)
David Rodríguez - dribeas
A: 

We needed to do something similar to what you are doing but we followed another path. I am interested in how you have defined your interface. Part of what I don't know how you can handle are the manipulators you have defined (binary::u32le, binaryu16le).

With basic_streams, the manipulator controls how all the following elements will be read/written, but in your case, it probably does not make sense, as the size (part of your manipulator information) is affected by the variable passed in and out.

binary_istream in;
int i;
int i2;
short s;
in >> binary::u16le >> i >> binary::u32le >> i2 >> s;

In the code above, it can make sense determining that whether the i variable is 32 bits (assuming int is 32 bits) you want to extract from the serialized stream only 16 bits, while you want to extract the full 32 bits into i2. After that, either the user is forced to introduce manipulators for each and every other type that is passed in, or else the manipulator still has effect and when the short is passed in and 32 bits are read with a possible overflow, and in any way the user will probably get unexpected results.

Size does not seem to belong (in my opinion) to manipulators.

Just as a side note, in our case, as we had other constraints as runtime definition of types, and we ended up building our own meta-type-system to build types at runtime (a type of variant), and then we ended up implementing de/serialization for those types (boost style), so our serializers don't work with basic C++ types, but rather with serialization/data pairs.

David Rodríguez - dribeas
I made some edits to the question in regards to your answer. You and the other poster now have me re-thinking my use of manipulators (good name, they needed one...). I feel like there were issues with ambiguous function calls. This is esp. true of in << 6, although this can be done with in << uint16_t(6). The manipulators persist, and it is an error to attempt to read into a 16bit variable with a 32bit manipulator present. I am going to think about this use, however, and see if perhaps the pattern described by you two is a better fit.
Thanatos
The name is not mine, but rather standard. In the C++ standard, chapter 27.6 is titled: 'Formatting and manipulators', and unlike your version, they are not implemented as objects that are passed into the stream but rather as free functions (templated) that get executed inside the stream (ios_base or basic_ios<>). In each case they take and return references to the given type (basic_stream<>, basic_ios<> or ios_base)
David Rodríguez - dribeas
A: 

I wouldn't use operator<< as its too intimately associated with formatted text I/O.

I wouldn't use an operator overload at all for this, actually. I'd find another idiom.

legalize
I consider its use just in the movement of things to/from a stream, but perhaps I see your point. Nonetheless, it's just a function call -- it could as easily be replaced with .read(...) or .write(...), but then you would end up with: stream.read(x).read(y).read(z) which may or may not make sense. However, since the class hierarchy matches that of iostreams, why not the API too?
Thanatos
The standard library already has methods that can read and write binary data. They are methods, not overloads of << or >>. The reason that I suggest not using << and >> for binary I/O is that people think of <em>formatted</em> input and output when they see those operators. Formatting implies things like locales, field justification, etc. You're not doing any of that with binary I/O, which is why the standard library provides the basic_istream<T>::read and basic_ostream<T>::write methods.
legalize
A: 

I agree with legalize. I needed to do almost exactly what you're doing, and looked at overloading << / >>, but came to the conclusion that iostream was just not designed to accommodate it. For one thing, I didn't want to have to subclass the stream classes to be able to define my overloads.

My solution (which only needed to serialize data temporarily on a single machine, and therefore did not need to address endianness) was based on this pattern:

// deducible template argument read
template <class T>
void read_raw(std::istream& stream, T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.read(reinterpret_cast<char*>(&value), sizeof(value));
}

// explicit template argument read
template <class T>
T read_raw(std::istream& stream)
{
    T value;
    read_raw(stream, value);
    return value;
}

template <class T>
void write_raw(std::ostream& stream, const T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.write(reinterpret_cast<const char*>(&value), sizeof(value));
}

I then further overloaded read_raw/write_raw for any non-POD types (e.g. strings). Note that only the first version of read_raw need be overloaded; if you use ADL correctly, the second (1-arg) version can call 2-arg overloads defined later and in other namespaces.

Write example:

int32_t x;
int64_t y;
int8_t z;
write_raw(is, x);
write_raw(is, y);
write_raw<int16_t>(is, z); // explicitly write int8_t as int16_t

Read example:

int32_t x = read_raw<int32_t>(is); // explicit form
int64_t y;
read_raw(is, y); // implicit form
int8_t z = numeric_cast<int8_t>(read_raw<int16_t>(is));

It's not as sexy as overloaded operators, and things don't fit on one line as easily (which I tend to avoid anyway, since debug breakpoints are line-oriented), but I think it turned out simpler, more obvious, and not much more verbose.

Trevor Robinson
A: 

From The C++ Programming Language, Section 20.10, Excercise 15: "Implement versions of istream and ostream that read and write numbers in their binary form."

Anyone have the answers to the exercises?

AFAIK you need to implement << and >> if istream_iterator and ostream_iteratator are to work, which of course are needed to use the STL algorithms.