views:

229

answers:

2

Using Google protobuf, I am saving my serialized messaged data to a file - in each file there are several messages. We have both C++ and Python versions of the code, so I need to use protobuf functions that are available in both languages. I have experimented with using SerializeToArray and SerializeAsString and there seems to be the following unfortunate conditions:

  1. SerializeToArray: As suggested in one answer, the best way to use this is to prefix each message with it's data size. This would work great for C++, but in Python it doesn't look like this is possible - am I wrong?

  2. SerializeAsString: This generates a serialized string equivalent to it's binary counterpart - which I can save to a file, but what happens if one of the characters in the serialization result is \n - how do we find line endings, or the ending of messages for that matter?

Update:

Please allow me to rephrase slightly. As I understand it, I cannot write binary data in C++ because then our Python application cannot read the data, since it can only parse string serialized messages. Should I then instead use SerializeAsString in both C++ and Python? If yes, then is it best practice to store such data in a text file rather than a binary file? My gut feeling is binary, but as you can see this doesn't look like an option.

+1  A: 

The best practice for concatenating messages in this way is to prepend each message with its size. That way you read in the size (try a 32bit int or something), then read that number of bytes into a buffer and deserialize it. Then read the next size, etc. etc.

The same goes for writing, you first write out the size of the message, then the message itself.

See Streaming Multiple Messages on the protobuf documentation for more information.

Mike Weller
Should I be using a binary file or a text file?
nbolton
The protobuf data is always binary, it's just that python and C++ strings can hold binary data. So it doesn't matter what data type you end up using.
Mike Weller
A: 

We have had great success base64 encoding the messages, and using a simple \n to separate messages. This will ofcoirse depend a lot on your use - we need to store the messages in "log" files. It naturally has overhead encoding/decoding this - but this has not even remotely been an issue for us.

The advantage of keeping these messages as line separated text has so far been invaluable for maintenance and debugging. Figure out how many messages are in a file ? wc -l . Find the Nth message - head ... | tail. Figure out what's wrong with a record on a remote system you need to access through 2 VPNs and a citrix solution ? copy paste the message and mail it to the programmer.

nos

related questions