views:

339

answers:

3

I would like to find out what is the optimum way of storing some common data type that were not included in the list supported by protocol buffers.

  • datetime (seconds precision)
  • datetime (milliseconds precision)
  • decimals with fixed precision
  • decimals with variable precision
  • lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.

Also the idea is to map them very easy to corresponding C++/Python/Java data types.

+1  A: 

Sorry, not a complete answer, but a "me too".

I think this is a great question, one I'd love an answer to myself. The inability to natively describe fundamental types like datetimes and (for financial applications) fixed point decimals, or map them to language-specified or user-defined types is a real killer for me. Its more or less prevented me from being able to use the library, which I otherwise think is fantastic.

Declaring your own "DateTime" or "FixedPoint" message in the proto grammar isn't really a solution, because you'll still need to convert your platform's representation to/from the generated objects manually, which is error prone. Additionally, these nested messages get stored as pointers to heap-allocated objects in C++, which is wildly inefficient when the underlying type is basically just a 64-bit integer.

Specifically, I'd want to be able to write something like this in my proto files:

message Something {
   required fixed64 time = 1 [cpp_type="boost::posix_time::ptime"];
   required int64 price = 2 [cpp_type="fixed_point<int64_t, 4>"];
   ...
 };

And I would be required to provide whatever glue was necessary to convert these types to/from fixed64 and int64 so that the serialization would work. Maybe thru something like adobe::promote?

Bklyn
+2  A: 

Here are some ideas based on my experience with a wire protocol similar to Protocol Buffers.

datetime (seconds precision)

datetime (milliseconds precision)

I think the answer to these two would be the same, you would just typically be dealing with a smaller range of numbers in the case of seconds precision.

Use a sint64/sfixed64 to store the offset in seconds/milliseconds from some well-known epoch like midnight GMT 1/1/1970. This how Date objects are internally represented in Java. I'm sure there are analogs in Python and C++.

If you need time zone information, pass around your date/times in terms of UTC and model the pertinent time zone as a separate string field. For that, you can use the identifiers from the Olson Zoneinfo database since that has become somewhat standard.

This way you have a canonical representation for date/time, but you can also localize to whatever time zone is pertinent.

decimals with fixed precision

My first thought is to use a string similar to how one constructs Decimal objects from Python's decimal package. I suppose that could be inefficient relative to some numerical representation.

There may be better solutions depending on what domain you're working with. For example, if you're modeling a monetary value, maybe you can get away with using a uint32/64 to communicate the value in cents as opposed to fractional dollar amounts.

There are also some useful suggestions in this thread.

decimals with variable precision

Doesn't Protocol Buffers already support this with float/double scalar types? Maybe I've misunderstood this bullet point.

Anyway, if you had a need to go around those scalar types, you can encode using IEEE-754 to uint32 or uint64 (float vs double respectively). For example, Java allows you to extract the IEEE-754 representation and vice versa from Float/Double objects. There are analogous mechanisms in C++/Python.

lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.

If you are concerned about wasted bytes on the wire, you could use bit-masking techniques to compress many booleans into a single uint32 or uint64.

Because there isn't first class support in Protocol Buffers, all of these techniques require a bit of a gentlemens' contract between agents. Perhaps using a naming convention on your fields like "_dttm" or "_mask" would help communicate when a given field has additional encoding semantics above and beyond the default behavior of Protocol Buffers.

Joe Holloway
+2  A: 

The protobuf design rationale is most likely to keep data type support as "native" as possible, so that it's easy to adopt new languages in future. I suppose they could provide in-build message types, but where do you draw the line?

My solution was to create two message types:

DateTime
TimeSpan

This is only because I come from a C# background, where these types are taken for granted.

In retrospect, TimeSpan and DateTime may have been overkill, but it was a "cheap" way of avoiding conversion from h/m/s to s and vice versa; that said, it would have been simple to just implement a utility function such as:

int TimeUtility::ToSeconds(int h, int m, int s)

Bklyn, pointed out that heap memory is used for nested messages; in some cases this is clearly very valid - we should always be aware of how memory is used. But, in other cases this can be of less concern, where we're worried more about ease of implementation (this is the Java/C# philosophy I suppose).

There's also a small disadvantage to using non-intrinsic types with the protobuf TextFormat::Printer; you cannot specify the format in which it is displayed, so it'll look something like:

my_datetime {
    seconds: 10
    minutes: 25
    hours: 12
}

... which is too verbose for some. That said, it would be harder to read if it were represented in seconds.

To conclude, I'd say:

  • If you're worried about memory/parsing efficiency, use seconds/milliseconds.
  • However, if ease of implementation is the objective, use nested messages (DateTime, etc).
nbolton