tags:

views:

123

answers:

4

I have an application that loads lots of data into memory (this is because it needs to perform some mathematical simulation on big data sets). This data comes from several database tables, that all refer to each other.

The consistency rules on the data are rather complex, and looking up all the relevant data requires quite some hashes and other additional data structures on the data.

Problem is that this data may also be changed interactively by the user in a dialog. When the user presses the OK button, I want to perform all the checks to see that he didn't introduce inconsistencies in the data. In practice all the data needs to be checked at once, so I cannot update my data set incrementally and perform the checks one by one.

However, all the checking code work on the actual data set loaded in memory, and use the hashing and other data structures. This means I have to do the following:

  • Take the user's changes from the dialog
  • Apply them to the big data set
  • Perform the checks on the big data set
  • Undo all the changes if the checks fail

I don't like this solution since other threads are also continuously using the data set, and I don't want to halt them while performing the checks. Also, the undo means that the old situation needs to be put aside, which is also not possible.

An alternative is to separate the checking code from the data set (and let it work on explicitly given data, e.g. coming from the dialog) but this means that the checking code cannot use hashing and other additional data structures, because they only work on the big data set, making the checks much slower.

What is a good practice to check user's changes on complex data before applying them to the 'application's' data set?

+3  A: 

I would try by any means to verify changes before applying them to the data set, as undoing the ripple effects of changes which later turn out to be invalid can easily become a nightmare.

If there is really a lot of data, I understand that creating a full copy of it may not be feasible - although in general "copy on write" would be the simplest and safest solution. If you really are only able to verify the changes by taking into account the whole set of data, you could try a "decorator"-like approach, i.e. somehow creating a "view" of the changes layered on top of the existing body of data, without actually modifying the latter. This could be used to validate the changes, and if the validation succeeds, you can actually apply the changes; otherwise you can simply throw away the "view" and the changes, without affecting the original data in any way.

Péter Török
+4  A: 

This is probably not much help now, since your app is built, and you probably don't want to reimplement, but I'll mention it for reference.

Using a ORM framework would help you here. Not only does it handle getting the data from the database into an object oriented representation, it also provides the tools to implement isolated temporary changes and views:

  • Using the ORM framework with transactions, you can allow the user to change the objects in the model without affecting other users, and without commiting the data "for real" until it has been checked. The ACID guarantees of transactions ensures that your changes are not persisted to the database, but held in your transaction, only visible to you. You can then run checks on the data and commit the transaction only if the data validates. If the data doesn't validate, you rollback the transaction and discard the changes. If it does validate, you commit the transaction and changes are made permanent.

  • Alternatively, you can create views which provide your data for validation. The views combine the base data and temporary tables (local to your current connection). This avoids locking tables, at the expense of having to write and maintain the views.

EDIT: If you already have a rich object model in memory, the hardest part to making that support incremental, local and isolated changes is direct references between objects. When you want to replace object A with A', that contains a change, you don't want to do a deep copy, with all referneces, since you mention that your object model is large. Also, you don't want to have to update all objects that were pointing to A to reference A'. As an example, consider a very large doubly linked list. It's not possible to create a new list that is the same as the old one with just one element changed, without duplicating the entire list. You can achieve isolation by storing the identifier for related objects rather than the object themselves. E.g. Instead of referencing A explicitly, your collaborators store a reference to the unique key that identifies A, key(A). This key is used to fetch the actual object at the time it is needed (e.g. during verification.) Your model then becomes a large Map of keys to objects, which can be decorated for local changes. When looking up an object by key, first check the local map for value, and if not found, check the universal map. To change A to A', you add an entry to the local map, that maps key(A) to A'. (Note that A and A' have the same key, since logically they are the same item.) When you run your veriification code, local changes are then incorporated, since objects referring to key(A) will get A', while other users using key(A) will get the original, A.

This may sound complex written down, but by removing explicit references and computing them on demand is the only way of supporting isolated updates without having to do a deep copy of the data.

An alternative, but equivalent way, is that your validator uses a map to lookup objects with their replacements before it uses them. E.g. your user modifies A, so you put A->A' into the map. The validator is iterating over the model and comes across A. Before using A, it checks the map, and finds A', which it then uses. The difficulty of this approach is that you have to make sure you check the map every time before an object is used. If you miss one, then your view on the model will be inconsistent.

mdma
The ORM wouldn't help here. Getting all the data from the DB to an object-orientation representation in memory is not the problem, it's about applying changes with complex checks on the OO-representation and doing this without updating all the classes before doing the checks. The view concept itself is a better idea.
Patrick
The ORM gives you more than what you are dismissing. It gives you the isolation that you presently down't have and the ability to add changes to individual objects without mass duplication. I've updated my post on how you can manually code something similar.
mdma
Do you mean I could use an ORM tool as an in-memory database? And have my threads connect to this in-memory database as different users?
Patrick
@mdma, Great explanation. Sorry I can only give a +1 and not +2 for this.
Patrick
I have something similar, but in my case, A (and thus A') keeps track of the objects pointing to A. How would you deal with these back-pointers if they would be needed during validation?
eli
@Patrick ORM can map with in memory databases as http://hsqldb.org/
Xavier Combelle
A: 

Hmm, I would suggest rather than loading data copying it in memory. This is expensive, but will allow you to work on all data concurrently. When changes on data are valid, just apply the changes from the copy to all data using some locking strategy. This way you do not need any undo as long as you can apply the changes atomically. You could even try some transaction system if your needs are more complex. Also think about lazy-loading(copying) your data as you really need them. Finally what comes to my mind is that if you need to workon large data sets from databases using transactions, try considering using Prolog. It might be reasonable to formulate your chcecks as predicates.

Gabriel Ščerbák
As I said, the application is a complex scientific application that uses a lot of data that is continuously changed (because of simulation aspects). Some bigger data sets at my customers even require about 6 GB of RAM (64-bit application). The huge-ness of the data and the intensive-ness the data is used means that I only want to load it once from database (thus no continuous loading from database), and I don't want to have multiple copies of the data in memory either.
Patrick
@Patrick Than I guess you cannot get around this: "I don't like this solution since other threads are also continuously using the data set, and I don't want to halt them while performing the checks."
Gabriel Ščerbák
A: 

Sounds as if you should instead move the rules etc to the database where they belong, by having the checks in our app you will always issues. Instead by placing as much of the logic in for instance stored procedures that are run when the user insert the values you could catch and rollback invalid input. But I guess you have your reasons for keeping it all in memory.

Anders K.
Yep, I have. Imagine Excel only recalculating all its formulas if you save your Excel file. Or Word only recalculating the table of contents if you save your Word file. Interactive checking on consistency is very important in my application.
Patrick