design pattern asking for advice: push model v.s. pull model

Hello everyone,

My application has several workers (working on different things as different processes) and some resources (working unit). Different workers need to process on all working unites. For example, I have workers like W1, W2 and W3, working unit U1 and U2. Then W1 needs to process U1 and U2, the same as W2 and W3. The restriction is different workers can not work on the same work unit at the same time.

I have two designs and want to ask for advice which one is better.

Push model: using a central job scheduler to assign work units to different workers, to ensure different workers are not working on the same work unit;
Pull model: each worker will ask a central job scheduler for work units to process, and job scheduler will select an appropriate work unit which is not being processed by other worker for the asking worker.

I want to know the pros and cons of each design. And one of my major concerns is about -- finding a loosely coupled design (it is one of my major goal, but not the only goal). I am not sure whether push model or poll model has better extensibility (option 1 is more loosely coupled)?

thanks in advance, George

The advantage of the "pull" model is that each worker knows locally how much it is loaded and thus can manage its load.

Also, the "pull" model might be more "decoupled" as the variable of "load" is kept local to the worker whereas in the "push" model one would need a communication protocol (and overhead) to communicate this state.

Think of the success of the "pull" model in the auto industry: it went from the traditional "push" model where inventories would be difficult to track and required lots of feedback to the now successful and ubiquitous "pull" model.

When it comes to scaling, you can have an intermediate layer of "schedulers" that "poll" for jobs from the layer above. The base workers can now interact with the intermediate layer in a partitioned way.

Note that in either model a coordination communication protocol is required: it is the nature of the coordination protocol that differs. In the "push model", there is an additional control loop required to report/poll the "load factor" of each worker. As one scale the system, more bandwidth is required, more state on the scheduler side, more latency incurred etc.

The down side is that you then need to build in tracking to ensure that workloads that have been pulled and fail can be redistributed.Having said that I do prefer the pull model as it allows decoupling of the central workload from those requesting work.

MattC 2009-12-01 13:59:00

This can be solved in the usual way: the status of an assigned job must be reported to a "control entity" of some sort.

jldupont 2009-12-01 14:01:16

If a "job" fails to be reported within "time X", then a failure recovery can be triggered.

jldupont 2009-12-01 14:02:24

"Also, the "pull" model might be more "decoupled" as the variable of "load" is kept local to the worker whereas in the "push" model one would need a communication protocol (and overhead) to communicate this state." -- I want to know more details what the communication protocol is needed for push model? And why pull model does not need a communication protocol (I think no matter pull or push, in order for two parties to work together, we need a communication protocol)?

George2 2009-12-01 14:18:42

MattC, for your comments -- "pull model as it allows decoupling of the central workload from those requesting work.", why push model do you think does not have such decoupling effects? Can you describe in more details please?

George2 2009-12-01 14:20:27

MattC, for your comments -- "you then need to build in tracking to ensure that workloads that have been pulled and fail can be redistributed" -- I think for both pull and push models, we need to track fail, why do you think only pull model has such overheads?

George2 2009-12-01 14:21:20

@Georges2: why don't you draw a block diagram so we can have a more productive interaction?

jldupont 2009-12-01 14:38:49

"In the "push model", there is an additional control loop required to report/poll the "load factor" of each worker" -- I agree. And I think it is the same for both push and pull model. Suppose in pull model, a worker pulls which worker unit should run from scheduler. In order to answer pull requests from worker to assign the worker the optimum worker unit, the scheduler also have to maintain loader factor and historical data as well.

George2 2009-12-01 17:34:17

hmmm... a "worker" *pulls* a job from the *job todo repository*. In asking for a Job, the client worker can specify its "limits". Please draw a diagram before we continue this discussion.

jldupont 2009-12-01 17:50:11

@Georges2: better still, start a new Question with specific points (don't forget to "accept" an answer from here ;-)

jldupont 2009-12-01 18:08:14

Cool, question answered!

George2 2009-12-03 05:50:02

Why simpler? Could you provide more details please?

George2 2009-12-01 14:16:05

Although I am a proponent of the "pull" model, your "answer" without any supporting facts or otherwise seem "vaporous" at best.

jldupont 2009-12-01 14:31:55

updated the content

Jader Dias 2009-12-01 15:31:47

Jader, I am not sure why Push model needs duplex communication? My design of push model is, scheduler start worker and assign work units, and then the worker start it work, when complete its work, it will report to scheduler about the complete status. So, just one way communication -- i.e. worker sends work item status information to scheduler.

George2 2009-12-01 17:30:45

No, 1.worker subscribes to scheduler 2.scheduler sends job to worker 1.work sends results back

Jader Dias 2009-12-01 17:41:38

Thanks Jader, question answered!

George2 2009-12-03 05:50:37

ansaurus

tags:

views:

answers:

design pattern asking for advice: push model v.s. pull model

related questions