ansaurus

Question

How to handle changing data structures on program version update?

Answer 1

A:

You may want to take a look at how Boost Serialization library deals with that issue.

Nemanja Trifunovic 2009-11-03 03:15:46

My problem with boost serialization is that it used to be broken and requires support code inside classes. Any one have had good experiences with it?

piotr 2009-11-03 06:34:16

You can't do serialization without support code inside classes for two reasons: 1) you need to be able to get to the private data of the class and 2) C++ has no reflection capability.

jmucchiello 2009-11-03 07:21:54

Mawg 2009-11-04 03:46:52

Answer 2

+4 A:

There's a huge concept that the relational database people use.

It's called breaking the architecture into "Logical" and "Physical" layers.

Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.

You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.

You don't need to reinvent SQL to accomplish this.

If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.

If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.

Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.

S.Lott 2009-11-03 04:02:53

Of course that is also very slow compared to the brainless struct.

jmucchiello 2009-11-03 07:16:25

I don't know about "very" slow. It involves some extra de-referencing, but with a little care it's only a few extra de-references to choice pointers to pointers instead of pointers to objects.

S.Lott 2009-11-03 12:50:58

Answer 3

+2 A:

Some simple guidelines if you're talking about a structure use as in a C API:

have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.

Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.

The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.

And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.

If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.

Michael Burr 2009-11-03 04:03:41

1) TLV. Thanks2) never remove a field - we may not have that luxury in embedded devices. A nice idea for PCs, though.3) version number? If the structure of a field changes radically, give it a different type, I would say/.Thanks for the sugegstions

Mawg 2009-11-04 03:43:25

See the revised answer for an example of how these rules were used to enable a robust data sharing protocol between a bootloader and program images.

Michael Burr 2009-11-04 07:12:38

Answer 4

+2 A:

What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.

struct address_book
{
  unsigned int length; // total length of this struct in bytes
  char items[0];
}

where 'items' is a variable length array of a structure that describes its own size and type

struct item
{
  unsigned int size; // how long data[] is
  unsigned int id;   // first name, phone number, picture, ...
  unsigned int type; // string, integer, jpeg, ...
  char data[0];
}

In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.

Variable Length Coder 2009-11-03 04:05:12

I would make `size` be the length of the whole `item` structure, to make quick navigation easier, but this is a good approach.

Carl Norum 2009-11-03 19:42:16

Classic TLV as I know it for the teclomms world. Sounds good, thanks, but most here are addressing how the data is stored (mysql, serialized, tlv), etc and what concerns me more is the underlying approach to the logic of how to handle new/modified/deleted fields).

Mawg 2009-11-04 03:41:50

Answer 5

+1 A:

Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:

ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();

To read it back:

igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
   torrent::Bencode b;
   is >> b;
   if( b.has_key("url") )
      d->url = b["url"].as_string();
} catch(...) {
}

I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.

piotr 2009-11-03 06:22:38

Answer 6

+1 A:

Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.

About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.

Goz 2009-11-03 07:54:06

Answer 7

+2 A:

I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?

For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.

simon 2009-11-03 15:15:11

Thanks, Simon. I did argue for doing it in the PC during s/w upgrade on my previous (embedded) project, but was argued down because they "have always done it in the device" and "it seems like too much new work".Something to push for on new projects, I would say.

Mawg 2009-11-04 03:46:06

btw, you have come closest to my "use case".Any general "philosophy" on add/delete/modify a field, or upgarders who skip a version from x to x+ 2, or who rollback to a previous version?

Mawg 2009-11-04 03:48:08

@mawg downgrading can problematic. We actually drew the line here and did not try to maintain values on a downgrade; user just had to re-enter the setting manually. A decision based on the fact, for our product, most people would not be downgrading, and if they did they had this slight inconvenience.

simon 2009-11-05 02:09:57

@mawg as for skipping, again it was handled on the pc. If we could upgrade straight to the new values we would, and if we had to make two conversions, we would do it on the pc, then stuff the all the new values back into the target. If the parameters were changing drastically, we would probably call it a new parameter, and make a judgement call if we could do a clean up conversion on the value. But generally for new parameters there are no old values to up convert; similar to deleting, no new parameter to convert too.So it is really just the modify case.

simon 2009-11-05 02:17:47

@mawg, an added bonus for extracting to the pc is the ability to save an entire configuration, then restore to the state of the saved file. This was probably as useful as the upgrade conversion, allowing for multiple configurations, that the user would not otherwise get in a very limited resource situation.

simon 2009-11-05 02:21:52

Answer 8

+3 A:

I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:

Define your goals - there are two:

new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)

Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!

If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.

Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.

Keep a set of test data from all versions around.

Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).

e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).

add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".

Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.

(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.

(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be

empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records

I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve

Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.

The following are "change" scenarios you can expect:

a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.

For implementing case 2, you also need to consider:

no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range

phew. that was a lot. But it's not as complicated as it seems.

peterchen 2009-11-04 08:05:00

That was some *excellent* advice !! And good philosophy too:-)All good, but my favourutes were:Define your goals - there are two: * new versions should be able to read what old versions write * old versions should be able to read what new versions write (harder)Add version support to release 0 - At least write a version header. >>> (most people don't even see a problem until v 2, and then it may be too late)

Mawg 2009-11-05 06:14:48

thanks :) --- *"most people don't even see a problem until v 2"* - and you see it forever in their file formats!

peterchen 2009-11-05 07:18:49

Answer 9

+1 A:

I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.

However if you must use a single struct for this, there are a few things you can do to help make things easier.

1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.

2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.

3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.

4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.

5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.

Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).

bta 2009-11-09 22:44:00

ansaurus

tags:

views:

answers:

How to handle changing data structures on program version update?

related questions