views:

641

answers:

6

Hi,

I have an idea for how to solve this problem, but I wanted to know if there's something easier and more extensible to my problem.

The program I'm working on has two basic forms of data: images, and the information associated with those images. The information associated with the images has been previously stored in a JET database of extreme simplicity (four tables) which turned out to be both slow and incomplete in the stored fields. We're moving to a new implementation of data storage. Given the simplicity of the data structures involved, I was thinking that a database was overkill.

Each image will have information of it's own (capture parameters), will be part of a group of images which are interrelated (taken in the same thirty minute period, say), and then part of a larger group altogether (taken of the same person). Right now, I'm storing people in a dictionary with a unique identifier. Each person then has a List of the different groups of pictures, and each picture group has a List of pictures. All of these classes are serializable, and I'm just serializing and deserializing the dictionary. Fairly straightforward stuff. Images are stored separately, so that the dictionary doesn't become astronomical in size.

The problem is: what happens when I need to add new information fields? Is there an easy way to setup these data structures to account for potential future revisions? In the past, the way I'd handle this in C was to create a serializable struct with lots of empty bytes (at least a k) for future extensibility, with one of the bytes in the struct indicating the version. Then, when the program read the struct, it would know which deserialization to use based on a massive switch statement (and old versions could read new data, because extraneous data would just go into fields which are ignored).

Does such a scheme exist in C#? Like, if I have a class that's a group of String and Int objects, and then I add another String object to the struct, how can I deserialize an object from disk, and then add the string to it? Do I need to resign myself to having multiple versions of the data classes, and a factory which takes a deserialization stream and handles deserialization based on some version information stored in a base class? Or is a class like Dictionary ideal for storing this kind of information, as it will deserialize all the fields on disk automatically, and if there are new fields added in, I can just catch exceptions and substitute in blank Strings and Ints for those values?

If I go with the dictionary approach, is there a speed hit associated with file read/writes as well as parameter retrieval times? I figure that if there's just fields in a class, then field retrieval is instant, but in a dictionary, there's some small overhead associated with that class.

Thanks!

+2  A: 

Sqlite is what you want. It's a fast, embeddable, single-file database that has bindings to most languages.

With regards to extensibility, you can store your models with default attributes, and then have a separate table for attribute extensions for future changes.

A year or two down the road, if the code is still in use, you'll be happy that 1)Other developers won't have to learn a customized code structure to maintain the code, 2) You can export, view, modify the data with standard database tools (there's an ODBC driver for sqlite files and various query tools), and 3) you'll be able to scale up to a database with minimal code changes.

Douglas Mayle
So, a separate table for each of the other tables? Would I then load them into class objects, so have to maintain both the objects and the data tables? How would I upgrade tables? Lots of details with these database implementations I don't understand...
mmr
+1  A: 

There's a database schema, for which I can't remember the name, that can handle this sort of situation. You basically have two tables. One table stores the variable name, and the other stores the variable value. If you want to group the variables, then add a third table that will have a one to many relationship with the variable name table. This setup has the advantage of letting you keep adding different variables without having to keep changing your database schema. Saved my bacon quite a few times when dealing with departments that change their mind frequently (like Marketing).

The only drawback is that the variable value table will need to store the actual value as a string column (varchar or nvarchar actually). Then you have to deal with the hassle of converting the values back to their native representations. I currently maintain something like this. The variable table currently has around 800 million rows. It's still fairly fast, as I can still retrieve certain variations of values in under one second.

hectorsosajr
That's called an EAV model (Entity/Attribute/Value). It's popular in fields like pharmaceutical testing where they don't have a good handle on the possible columns.
Turnkey
Interesting. So, I'd have a couple of tables (probably in sqlite, I'm gathering), and then maintain a bunch of classes into which data is built, and then use the dictionary as the in-program memory, right?
mmr
+1  A: 

My brain is fried at the moment, so I'm not sure I can advise for or against a database, but if you're looking for version-agnostic serialization, you'd be a fool to not at least check into Protocol Buffers.

Here's a quick list of implementations I know about for C#/.NET:

Alex Lyman
Sadly, this is C#, so Protocol Buffers aren't available for that language (yet!)
mmr
Sure there is, mmr! I've updated the answer to include links to the implementations I know about.
Alex Lyman
THanks, I'll check them out. I'm loving that there's so many different answers to try out... if only I had the time to understand them all
mmr
For what I need, protocol buffers ended up being the right call-- we're using Marc Gravell's version.
mmr
Glad I could help out, mmr.
Alex Lyman
+1  A: 

I'm no C# programmer but I like the mmap() call and saw there is a project doing such a thing for C#.

See Mmap

Structured files are very performing if tailored for a specific application but are difficult to manage and an hardly reusable code resource. A better solution is a virtual memory-like implementation.

  • Up to 4 gigabyte of information can be managed.
  • Space can be optimized to real data size.
  • All the data can be viewed as a single array and accessed with read/write operations.
  • No needing to structure to store but just use and store.
  • Can be cached. Is highly reusable.
epatel
Why is that better than serializing/deserializing a dictionary to disk? It looks like I'll still have extensibility problems in the future with this.
mmr
A: 

So go with sqllite for the following reasons:
1. You don't need to read/write the entire database from disk every time
2. Much easier to add to even if you don't leave enough placeholders at the beginning
3. Easier to search based on anything you want
4. easier to change data in ways beyond the application was designed

Problems with Dictionary approach
1. Unless you made a smart dictionary you need to read/write the entire database every time (unless you carefully design the data structure it will be very hard to maintain backwards compatibility)
----- a) if you did not leave enough place holders bye bye
2. It appears as if you'd have to linear search through all the photos in order to search on one of the Capture Attributes
3. Can a picture be in more than one group? Can a picture be under more than one person? Can two people be in the same group? With dictionaries these things can get hairy....

With a database table, if you get a new attribute you can just say Alter Table Picture Add Attribute DataType. Then as long as you don't make a rule saying the attribute has to have a value, you can still load and save older versions. At the same time the newer versions can use the new attributes.

Also you don't need to save the picture in the database. You could just store the path to the picture in the database. Then when the app needs the picture, just load it from a disk file. This keeps the database size smaller. Also the extra seek time to get the disk file will most likely be insignificant compared to the time to load the image.

Probably your table should be
Picture(PictureID, GroupID?, File Path, Capture Parameter 1, Capture Parameter 2, etc..)

If you want more flexibility you could make a table CaptureParameter(PictureID, ParameterName, ParameterValue) ... I would advise against this because it is a lot less efficient than just putting them in one table (not to mention the queries to retrieve/search the Capture Parameters would be more complicated).

Person(PersonID, Any Person Attributes like Name/Etc.)
Group(GroupID, Group Name, PersonID?)
PersonGroup?(PersonID, GroupID)
PictureGroup?(GroupID, PictureID)

Cervo
+2  A: 

Just a wee word of warning, SQLLite, Protocol Buffers, mmap et al...all very good but you should prototype and test each implementation and make sure that your not going to hit the same perf issues or different bottlenecks.

Simplicity may be just to upsize to SQL (Express) (you'll may be surprised at the perf gain) and fix whatever's missing from the present database design. Then if perf is still an issue start investigating these other technologies.

Kev