Pushing data changes vs. pulling data changes within an application.

Suppose you have an application that consists of two layers:

A: A data layer that stores all the data loaded from a database or from a file
B: A layer that shows the data in a nice user interface, e.g. a graphical report

Now, data is changed in layer A. We have 2 approaches to make sure that the reports from layer B are correctly updated.

The first approach is the PUSH approach. Layer A notifies layer B via observers so layer B can update its reports.

There are several disadvantages in the PUSH approach:

If data is changed multiple times (e.g. during load or in algorithms that change much data) the observers are executed many times. This can be solved by introducing a kind of buffering (prevent calling observers while you are still changing), but this can be very tricky and making the right buffering calls is often forgotten.
If much data is changed, the observer calls may cause an overhead that is not acceptible in the application.

The other approach is the PULL approach. Layer A just remembers which data was changed and sends out no notifications (layer A is flagged dirty). After the action that was executed by the user (could be running an algorithm or loading a file or something else), we check all of our user interface components, and ask them to update themselves. In this case layer B is asked to update itself. First it will check if any of its underlying layers (layer A) is dirty. If it is, it will get the changes and update itself. If layer A was not dirty, the report knew it had nothing to do.

The best solution depends on the situation. In my situation, the PUSH approach seems much better.

The situation becomes much more difficult if we have more than 2 layers. Suppose we have the following 4 layers:

A: A data layer that stores all the data loaded from a database or from a file
B: A layer that uses the data layer (layer A), e.g. to filter the data from A using a complex filter function
C: A layer that uses layer B, e.g. to aggregate data from layer B into smaller pieces of information
D: A report that interprets the results of layer C and presents it in a nice graphical way to the user

In this case, PUSHING the changes will almost certainly introduce a much higher overhead.

On the other hand, PULLING the changes requires that:

layer D has to call layer C to ask if it is dirty
layer C has to call layer B to ask if it is dirty
layer B has to call layer A to ask if it is dirty

If nothing has been changed the amount of calls to execute before you know that actually nothing has been changed and you don't have to do anything is rather big. It seems like the performance overhead that we try to avoid by not using the PUSH, is now coming back to use in the PULL approach because of the many calls to ask if anything is dirty.

Are there patterns that solve this kind of problem in a nice and high-performance (low overhead) way?

No. No free lunch, no silver bullet. It's all down to careful design. You've pretty much covered common techniques it's applying them cleverly which needs care and avoidance of assumptions.

I query two of your statements:

You imply that the controlling of PUSH notifications is unduly difficult. I would have expected that in many cases you tend to have a master computation engine, which grabs data and does calculations. The engine must surely stop at some point, and at that point it can send the "New Data Ready" event, which can contain finer-grained information about what's changed.

You say that make 4 inter-layer calls is too expensive. What's the basis for that? compared with what? If youa re concerned by the mutiplier factor (10 D instances) call ( 5 C instances ) call (2 B instances) call (1 A instance) so A gets hit with 100 calls, then surely we optimise? Each level can say "If I'm currently calling or I heard the answer recently, no need to call again".

When we consider the scaling benefits of the layers a few cheap queries may not be excessive.

I don't have a few instances; I have millions of instances. But your answer let me realize something important. Pushing changes is normally executed on the instance level (every instance change may push the change forward), while pulling changes is executed on the layer level (since every layer only has to pull once regardless of the number of instances). This implies that pulling might be much faster than pushing (at least in my case). Thanks for the tip.

Patrick 2010-10-29 13:00:38

By instances I meant the report processes, the Ds, rather than the number of data items. You don't have a million D's do you? If so then I'm impressed.

djna 2010-10-30 13:45:38

ansaurus

tags:

views:

answers:

Pushing data changes vs. pulling data changes within an application.

related questions