views:

1923

answers:

5

Hi,

I have a set of classes I wish to serialize the data from. There is a lot of data though, (we're talking a std::map with up to a million or more class instances).

Not wishing to optimize my code too early, I thought I'd try a simple and clean XML implementation, so I used tinyXML to save the data out to XML, but it was just far too slow. So I've started looking at using Boost.Serialization writing and reading standard ascii or binary.

It seems to be much better suited to the task as I don't have to allocate all this memory as an overhead before I get started.

My question is essentially how to go about planning an optimal serialization strategy for a file format. I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after. Having played around with serialization a little (and looked at the output), I don't understand how loading the data back in could know when it's reached the end of the map for example, if I simply save out all the items one after another. What issues do you need to consider when planning a serialization strategy?

Thanks.

+3  A: 

Read this FAQ! Does that help to get started?

dirkgently
+1  A: 

I don't particularly want to serialize the whole map if it's not necessary, as it's really only the contents I'm after.

Does that mean you don't really need to serialize the whole object? Maybe you should reconsider just using a text-based format. If you really need to serialize only a subset of the key/value pairs in a map then you should probably just write them to a text file and read them in later. You don't necessarily need XML; just one line per map key followed by one line with the value should work.

+1  A: 

If all you want is key value pairs then the important thing is the types the keys and values take, this will colour how you deal with things.

Serialising the map itself would be a poor plan in general since you may wish to change your associative container type later but not invalidate (or have to translate) previous serialised files.

Serialising the container can be useful in certain circumstances if you wish to avoid the cost of rebuilding the container again (but pre-sizing the container is normally sufficient to avoid the vast majority of this overhead) but this should be a decision based on specific aspects of your application and usage.

If you supply the type of the key/values we can help more. without this here are some general tips:

  • If they are amenable to string representation then a simple CSV file may be sufficient (but use an existing reader writer library for it, reading and writing legit CSV is harder than it looks superficially)
  • IF they are fixed width then a simple binary format will make reading and writing very easy (and quick) but care should be taken to acknowledge the issues of:
    • endianess
    • whether you wish to allow simple catting of such files together or add CRC like values for integrity (you can do both but it's harder)
    • You lose the ability to grep the files (this is a real loss, you may end having to reinvent parts of your toolchain for this)
    • whether changing platform/compiler/size_t will break the format
  • Some structured textual format that is lighter than XML. There are several JSOM/YAML etc. These will provide extensibility you quite likely don't require.
ShuggyCoUk
Apologies, I missed out putting in my original answer that I intended on leaving xml entirely as I felt standard ascii / binary would suffice (have now edited the question). Thanks for your points though, there's some useful information there.
Dan
+1  A: 

Use Google's Protocol Buffers which is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.

There are bindings for C++, Java, Python, Perl, C#, and Ruby.

You describe your data in metadata .proto files

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

Then you would use it in C++ like this:

Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("[email protected]");

fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();

Or like this:

Person person;
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
  cerr << "Failed to parse person.pb." << endl;
  exit(1);
}

cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
  cout << "e-mail: " << person.email() << endl;
}

For a more complete example, see the tutorials.

chrish
+2  A: 

There are many advantages to boost.serialization. For instance, as you say, just including a method with a specified signature, allows the framework to serialize and deserialize your data. Also, boost.serialization includes serializers and readers for all the standard STL containers, so you don't have to bother if all keys have been stored (they will) or how to detect the last entry in the map when deserializing (it will be detected automatically).

There are, however, some considerations to make. For example, if you have a field in your class that it is calculated, or used to speed-up, such as indexes or hash tables, you don't have to store these, but you have to take into account that you have to reconstruct these structures from the data read from the disk.

As for the "file format" you mention, I think some times we try to focus in the format rather than in the data. I mean, the exact format of the file don't matter as long as you are able to retrieve the data seamlessly using (say) boost.serialization. If you want to share the file with other utilities that don't use serialization, that's another thing. But just for the purposes of (de)serialization, you don't have to care about the internal file format.

Diego Sevilla