views:

605

answers:

2

I've been developing an in-house DSP application (Java w/ hooks for Groovy/Jython/JRuby, plugins via OSGi, plenty of JNI to go around) in data flow/diagram style, similar to pure data and simulink. My current design is a push model. The user interacts with some source component causing it to push data onto the next component and so on until a end block (typically a display or file writer). There are some unique challenges with this design specifically when a component starves for input. There is no easy way to request more input. I have mitiated some of this with feedback control flow, ex an FFT block can broadcast that it needs more data to source block of it's chain. I've contemplated adding support for components to be either push/pull/both.

I'm looking for responses regarding the merits of push vs pull vs both/hybrid. Have you done this before? What are some of the "gotchas"? How did you handle them? Is there a better solution for this problem?

+1  A: 

Some experience with a "mostly-pull" approach in a large-scale product:

Model: Nodes build a 1:N tree, i.e. each component (except the root) has 1 parent and 1..N children. Data flows almost exclusively from parent to children. Change notifications can originate from any node in the tree.

Implementation: All leafs are notified with the sending node's id and a "generation" counter. Leafs know which node path they depend on, so they know if they need to update. (Any other child node update algorithm would do, too, and might have been better in hindsight).

Leafs query their parent for current data, query bubbles up recursively. The generation counter is included, so the bubble-up stops at the originating node.

Advantages:

  • parent nodes don't need much/any information about their children. Data can be consumed by anyone - this allowed a generic approach to implementing some (initially not expected) non-UI functionality on top of the data intended for display
  • Child nodes can aggregate and delay updates (avoiding repaints sure beats fast painting)
  • inactive leafs do cause no data traffic at all

Disadvantages:

  • Incremental updates are expensive, as full data is published. The implementation actually allows for different data packets to be requested (and the generation counter could prevent unecessary data traffic), but the data packets initially designed are very large. Slicing them was an afterthought, but works ok.
  • You need a real good generation mechanism. The one initially implemented collided with initial updates (that need special handling - see "incremental updates") and aggregation of updates
  • the need for data travelling up the tree was greatly underestimated.
  • publish is cheap only when the node offers read-only access to current data. This might require additional update synchronization, though
  • sometimes you want intermediate nodes to update, even when all leafs are inactive
  • some leafs ended up implementing polling, some base nodes ended up relying on that. ugly.


Generally:

Data-Pull "feels" more native to me when data and processing layer should know nothing about the UI. However, it requires a complex change notificatin mechanism to avoid "Updating the universe".

Data-Push simplifies incremental updates, but only if the sender intimately knows the receiver.

I have no experience of similar scale using other models, so I can't really make a recommendation. Looking back, I see that I've mostly used pull, which was less of a hassle. It would be interesting to see other peoples experiences.

peterchen
Since this is mostly pull, is the root node of your tree the last node in the flow/chain? Is it possible to have multiple end nodes (multiple displays)?
basszero
leafs pull from their parent. (so the general flow direction is from root towards leaf). Multiple displays would be multiple leafs.
peterchen
A: 

I work on a pure-pull image processing library. It's more geared to batch-style operations where we don't have to deal with dynamic inputs and for that it seems to work very well. Pull works especially well for large data sets and for threading: we scale linearly to at least 32 CPUs (depending on the graph being evaluated, of course, heh).

We have a GUI that allows leaves to be dynamic data sources (for example, a video camera delivering frames) and they are handled by throwing away and rebuilding the relevant parts of the graph on a change. This is cheap in our case, so the overhead isn't that high.

jrgc