views:

299

answers:

5

We have a design like below and I would like to get opinions or protocol guidlines for the below error scenario.

   Layer1                                                 
---------------                                                
 |      ^    ^                                          
 | (1)  |(4) |(6)
 v      |    |                                           Remote entity
----------------                                        ---------------  

   Layer0-----------------(2)------------------------------->Layer0
   Layer0<----------------(3)--------------------------------Layer0
   Layer0<----------------(5)--------------------------------Layer0


1. New session request to remote entity.
2. Establish link + data(session request)
3. Link Establishment ongoing 
4. Link Establishment pending
5. Link Established + data (session accepted)
6. session accepted.

If layer1 decides that it does not need the remote entities service between step 4 and 6. i.e event 4 is received and event 6 is yet to be received due to some error.

1) Should it wait for event 6 to happen and initiate a session release or
2) Layer1 should instruct Layer 0 to terminate the connection establishment procedure
immediately.

Which is the correct way?

The problem with (1) will be, even though we know that we are going to terminate the session because of an error, we need to handle other events before event6 comes in.

+7  A: 

I am a fan of fail fast designs. As soon as you know that you can't continue, you should notify the other side, and quit.

If for some reason you need to verify that the other side got your quit message, I prefer responding to requests with either an UNABLE_TO_COMPLY message, or discarding the events entirely. The problem is that you can get in a half-open state.

One way to handle a situation where the other side is stuck waiting for responses from other requests after you have already sent the fail message is to use a priority queue. Instead of processing requests in the order they are received, you can indicate that some messages are processed immediately no matter when they are received. The higher priority messages get inserted in the front of the queue, so quit_on_failure events are not blocked by other requests that you know you cannot really process.

I generally also dislike time-based watchdogs (because the time length the developer chooses is never correct for all situations), but for these kinds of protocols you often have to define a worst-case scenario where the other side never responds to your fail message. In these situations, a configurable time-out is usually the only way to clean up. Timeouts should always be the last resort, never the first.

Christopher
A: 

Are there 4 messages in a row in one direction (through layers) without any acknowledgement from the receiving side? Making client to send confirmation messages, like SYN-ACK/RST packets in TCP, would automatically take care of the scenario in question, and also help with other error modes (when client or network fails).

ima
+1  A: 

You should be adding some kind of (non) acknowledgement messages in your protocol and appropriate timeouts. Then a request by the Layer 1 to cancel a pending session can either be implemented by a nack to the next message from your remote session or simply the failure of your client to respond at all to a response from the remote session and the remote session timing out due to the inactivity.

As a previous poster correctly states, you cannot have a complete protocol without timeout handling because that is a good way to catch underlying transport failures. Whether you choose to just rely on timeouts as a way of signalling the "termination of the protocol" is a design decision. I'd normally try and send a nack or cancel in the protocol to at least try and finish the protocol in a timely way at both ends. But you need the timeout as well.

Alan Moore
A: 

1) Should it wait for event 6 to happen and initiate a session release

IMHO, you shouldn't ever block the client on a termination request. It may be that an application needs to close because the system is shutting down. You can't do it in reality anyway, the admin has control of the power switch at the lowest level. Not accepting this reality will earn products built upon your protocol a reputation for being annoying.

2) Layer1 should instruct Layer 0 to terminate the connection establishment procedure

Yes.

I notice that layer 1 may also need to cancel any time after step 1, not just after step 4.

It looks like any time after step 2, it may take some degree of 'unwinding' to terminate.

The local and remote entities have a network delay between them and may have different perceptions of what state things are in at any given time. For example locally you may have completed step 2, while the remote entity believes he as completed step 5. If messages can arrive out-of-order (e.g., UDP) even more possiblities arise.

The 'termination' may or may not be successful to varying degrees. What happens if you send your step 2, then a termination, yet never hear back from the remote entity?

Higher-level clients may impose additional requirements. For example, is cancelling after step 5 different than after step 2? Do steps 3 and 5 represent a persistent modification at the remote entity (like a database commit)? Does the higher-level client need to know how far things actually progressed before his termination request got processed?

Work out all the possible combinations and make good decisions putting yourself in the user's shoes. The user here may be an engineer trying to build on your protocol to make it 'just work' for his own users. After you've done that, consider what happens when each message gets lost or delayed along the way, particularly during the unwinding process.

Then consider it from the remote entity's point of view. His admin needs to be able to power him off sometimes, and so will need a way to efficiently terminate the connection from every state as well.

Marsh Ray
A: 

Can you tell me what advantage is there in waiting? You've already pointed out that there are problems with that approach and it sounds to me like you've already discovering that option 2 is actually the best way. Correctness is just a matter of opinion though but I would say option 2.

Layer 1 should instruct layer 0 to end the connection process. That procedure should also instruct layer 0 that it's not interested in receiving events anymore. Layer 1 can then carry on in the disconnected state.

To me, thats the simplest and most obvious thing to do and the least problematic. As you've already said you don't need to handle any more state at layer 1 because you won't be expected those other messages when you've already closed the connection.

Matt H