views:

205

answers:

6

Good morning all,

I'm searching for a very fast binary serialization technique for c++. I only need to serialize data contained in objects (no pointers etc.). I'd like it to be as fast as possible. If it's specific to x86 hardware that's acceptable.

I'm familiar with the C methods of doing this. As a test I've bench marked a couple of techniques. I've found the C method is 40% faster than the best C++ method I implemented.

Any suggestions on how to improve the C++ method (or libraries that do this)? Anything good available for memory mapped files?

Thanks

// c style writes
{
   #pragma pack(1)
   struct item
   {
      uint64_t off;
      uint32_t size;
   } data;
   #pragma pack

   clock_t start = clock();

   FILE* fd = fopen( "test.c.dat", "wb" );
   for ( long i = 0; i < tests; i++ )
   {
      data.off = i;
      data.size = i & 0xFFFF;
      fwrite( (char*) &data, sizeof(data), 1, fd );
   }
   fclose( fd );

   clock_t stop = clock();

   double d = ((double)(stop-start))/ CLOCKS_PER_SEC;
   printf( "%8.3f seconds\n", d );
}

About 1.6 seconds for tests = 10000000

// c++ style ofstream writes

// define a DTO class
class test
{
public:
   test(){}

   uint64_t off;
   uint32_t size;

   friend std::ostream& operator<<( std::ostream& stream, const test& v );
};

// write to the stream
std::ostream& operator<<( std::ostream &stream,  const test& v )
{
   stream.write( (char*)&v.off, sizeof(v.off) );
   stream.write( (char*)&v.size, sizeof(v.size) );
   return stream;
}

{
   test data;

   clock_t start = clock();

   std::ofstream out;
   out.open( "test.cpp.dat", std::ios::out | std::ios::trunc | std::ios::binary );
   for ( long i = 0; i < tests; i++ )
   {
      data.off = i;
      data.size = i & 0xFFFF;
      out << data;
   }
   out.close();

   clock_t stop = clock();

   double d = ((double)(stop-start))/ CLOCKS_PER_SEC;
   printf( "%8.3f seconds\n", d );
}

About 2.6 seconds for tests = 10000000

+1  A: 

Hello,

Is there any way you can take advantage of things that stay the same?

I mean, you are just trying to run through "test.c.dat" as fast as you possibly can, right? Can you take advantage of the fact that the file does not change between your serialization attempts? If you are trying to serialize the same file, over and over again, you can optimize based on this. I can make the first serialization attempt take the same amount of time as yours, plus a tiny bit extra for another check, and then if you try and run the serialization again on the same input, I can make my second run go much faster than the first time.

I understand that this may just be a carefully crafted example, but you seem to be focused on making the language accomplish your task as quickly as possible, instead of asking the question of "do I need to accomplish this again?" What is the context of this approach?

I hope this is helpful.

-Brian J. Stinar-

Brian Stinar
It's going to be used as a configuration database. The code I wrote was simply to test the overhead of the methods. Good idea though.
Jay
+1  A: 

If you're on a Unix system, mmap on the file is the way to do what you want to do.

See http://msdn.microsoft.com/en-us/library/aa366556(VS.85).aspx for an equivalent on windows.

Alexandre C.
That was next on my list. Thanks for the confirmation
Jay
+4  A: 

If the task to be performed is really serialization you might check out Google's Protocol Buffers. They provide fast serialization of C++ classes. The site also mentions some alternative libraries e.g. boost.serialization (only to state that protocol buffers outperform them in most cases, of course ;-)

thorsten
Protocol Buffers (as much as I love it) is not really Serialization, it is more meant for message passing. The difference is that for protocol buffer you define a Message class while in serialization there is no intermediary representation.
Matthieu M.
Thinking a bit more about it, you could use the protobuf class to hold on your data within the real class, this way you would be able to use protobuf for data keeping and encoding/decoding while hiding this fact from your users.
Matthieu M.
A: 

A lot of the performance is going to depend on memory buffers and how you fill up blocks of memory before writing to disk. And there are some tricks to making standard c++ streams a little faster, like std::ios_base::sync_with_stdio (false);

But IMHO, the world doesn't need another implementation of serialization. Here are some that other folks maintain that you might want to look into:

  • Boost: Fast, assorted C++ library including serialization
  • protobuf: Fast cross-platform, cross-language serialization with C++ module
  • thrift: Flexible cross-platform, cross-language serialization with C++ module
samkass
Show me a serialization package that is useful in a constrained environment with deterministic memory usage and I'll show you the only serialization package you will ever need.Until then, it's a bit specious to say that we don't need another serialization package when everyone's requirements for serialization in contradictory ways.
MSN
I looked at boost. It jumps through all kinds of hoops to serialize any object and I only need POD's. Why pay for extra that you don't need?
Jay
@Jay: If you only need support for POD's, why not just use your C approach?
jalf
I was hoping someone here had thought of something I didn't :(
Jay
+1  A: 

There are just very few real-life cases where that matters at all. You only ever serialize to make your objects compatible with some kind of external resource. Disk, network, etcetera. The code that transmits the serialized data on the resource is always orders of magnitude slower then the code needed to serialize the object. If you make the serialization code twice as fast, you've made the overall operation no more than 0.5% faster, give or take. That is worth neither the risk nor the effort.

Measure three times, cut once.

Hans Passant
Excellent point. Thanks
Jay
A: 

Well, if you want the fastest serialization possible, then you can just write your own serialization class and give it methods to serialize each of the POD types.

The less safety you bring in, the faster it'll run and the harder it'll be to debug, however there is only a fixed number of built-in, so you could enumerate them.

class Buffer
{
public:
  inline Buffer& operator<<(int i); // etc...
private:
  std::deque<unsigned char> mData;
};

I must admit I don't understand your problem:

  • What do you actually want to do with the serialized message ?
  • Are you saving it for later ?
  • Do you have to worry about forward / backward compatibility ?

There might be better approaches that serialization.

Matthieu M.
I will be persisting the data to disk. It will only be loaded on the same machine it's save on. I was considering putting a version number on the objects so it would be better able to deal with changes. If you know better approaches I'd be happy to hear about them.
Jay
Versioning is a must, otherwise you're stucked. You can how much you version though, because versioning every single struct will make things a bit more costy and having a single version isn't that easy to maintain. I would also suggest using some 'sync' markers from place to place and perhaps a CRC code to check data integrity (in case the file gets corrupted). I've commented on Thorsten77's answer about protobuf, it looks like it could help you alot.
Matthieu M.