views:

57

answers:

2

We've been dealing with a problem with our event gateways since we upgraded from ColdFusion 8 Enterprise to ColdFusion 9 Enterprise.

We have an event gateway setup to establish a connection to a third party. They went us updates at least every 10 seconds and sometimes many times a second.

We have a Java class configured as the Event Gateway listener, and it pushes events and data to a CFC function. In that function we actually use a named <cflock> to ensure that requests are dealt with in order, and requests will be queued at this point to have exclusive access to the named lock. This lock has a 30 second timeout.

I've also got lots of debugging in this function, and I've noticed a few things:

  • Requetst will queue up before the <cflock> tag, waiting for the lock, and this queue can be over 40 events long
  • All requests are processed when they get the lock within 6 seconds. The average is actually 1.5 to 2 sec from when they get the lock to when they complete processing

So the problem is that, now and again, I will have a lock timeout occur, and it's after 30 seconds. First I log if the request is waiting for the lock. It looks like this:

"Information","Thread-23","06/23/10","15:45:18","APP1","F4B42A5A-F34D-C614-DE01B150E6906F78 (1277304318606) : PRE LOCK"

And then further down the logs, I see this for the same request:

"Error","Thread-23","06/23/10","15:45:48","APP1","F4B42A5A-F34D-C614-DE01B150E6906F78 (1277304348607) : LOCK ERROR: A timeout occurred while attempting to lock lock_ResponseDispatcher."

There's 30 seconds between them. The requests and any event data associated with it is lost at this point. Not good for me.

So I thought I'd see if the queue was being processed fast enough. I'm unsure how events are queued by <cflock>. Is there a hard limit?

Anyway, on this particular run, I saw that:

  • The request went onto the queue when there was 6 requests there already, so it's number 7 in the queuea

  • Over the next 30 seconds about 17 requests are removed from the queue

  • About the same number of requests are added to the queue

  • In this period, the request in question is not processed and after 30 seconds times out

I can't believe my eyes! It's as if the <cflock> queue is not processed as first in first out (FIFO), but first in last out (FILO)!

Is such a thing possible? Has anyone else every seen this kind of behaviour?

Many thanks in advance to anyone with any thoughts.

Ciaran

+1  A: 

I'm not sure what to make of your FIFO vs LIFO issue; but I can offer a suggestion for this:

There's 30 seconds between them. The requests and any event data associated with it is lost at this point. Not good for me.

The CFLock tag has an attribute named throwOnTimeout, which defaults to true. If you set it to false, instead of an exception being thrown, in the event of a timeout, processing will skip the locked-block of code and continue normally. You can use this to your advantage, with something like this:

<cfset made_it_through_lock = false />
<cflock name="single_threaded_lock_name" throwOnTimeout="false">
    <!--- ... do stuff ... --->
    <cfset made_it_through_lock = true />
</cflock>
<cfif made_it_through_lock eq false>
    <!--- 
          log the event data that you don't want to lose, then abort, 
          setting the necessary http status code & headers
    --->
</cfif>
Adam Tuttle
I was actually catching the exception by wrapping the locked code in a try/catch. But +1 for that suggestion.
Ciaran Archer
Yeah, that's another option but throwing and catching exceptions is expensive. If it's an expected behavior, you would probably see better performance by not using them. I'm wondering -- can you distill down the FIFO vs LIFO problem into a simple example that does just the locking and timeouts to show that it _really is_ LIFO?
Adam Tuttle
@Adam, it's not really expected behaviour you see. However, I was trying to think of a way to demonstrate the problem simply. I suspect the key issue is that fact that the CFC in which this lock exists is called through an async event gateway. Calling this CFC normally just results in the requests getting queued by CF (and they show up in cfstat). Any calls via an event gateway don't even register on cfstat. I'll do some tinkering and get back to you.
Ciaran Archer
@Adam, I think I've figured this out - see below! Thanks.
Ciaran Archer
+2  A: 

I think the key here is that fact that I am using event gateways which are asynchronous. In fact after conducting an experiment the reason for my problem seemed very obvious :)

I have specifed the number of threads that can be used to handle event gateway requests in CF admin (see under Event Gateway -> Settings). This setting is stuck at 1 on CF Dev Edition, but can be increased in Enterprise Edition. I increased it to 5 for this experiment. It's only when it's increased that you see this strange behaviour.

So my event gateway code is really simple, it just creates a UUID (so I can track the request in my logs) and then locks the thread for 5 seconds to simulate real processing. This sleep occurs inside of a cflock call, so only one thread can do the processing at once. We need this to avoid duplicates being processed in the real code.

Here is the CFC in full:

component {

    public void function onIncomingMessage (required struct msg) {

        var sys = createObject("java", "java.lang.System");
        var tag = createUUID();
        var logFile = "test\gatewaytest";

        writelog (file=logFile, text="#tag# - about to queue");

        try {

            lock name="myTestLock" timeout="120" {
                writelog (file=logFile, text="#tag# - got lock");

                thread action="sleep" duration="5000"; //ms 
            }

            writelog (file=logFile, text="#tag# - released lock");

        } catch (any e) {
            writelog (file=logFile, text="#tag# - ERROR - #e.message#");
        }
    }
}

Note the very long timeout value (2 mins) on the lock. This is to deal with the problem posed by the event gateway asynchronous processing.

The event gateway is a simple built-in CFML type with an ID of 'TestGW' which I link to the CFC above.

I set up a simple script to send events to the event gateway, here it is in full:

<cfset msg = {mymessage = "hello gateway"} />
<cfset sendGatewayMessage("TestGW", msg) />

SCENARIO 1 - SINGLE THREAD:

If the number of Event Gateway Processing Threads is set to 1 and I pound the gateway I see the following log output:

"Information","Thread-17","06/25/10","10:32:09",,"50805BB4-1C23-9073-67A70A86CA6F8E54 - about to queue"
"Information","Thread-17","06/25/10","10:32:09",,"50805BB4-1C23-9073-67A70A86CA6F8E54 - got lock"
"Information","Thread-17","06/25/10","10:32:14",,"50805BB4-1C23-9073-67A70A86CA6F8E54 - released lock"
"Information","Thread-17","06/25/10","10:32:14",,"50811F1A-1C23-9073-67AD3E9C0BF2000C - about to queue"
"Information","Thread-17","06/25/10","10:32:14",,"50811F1A-1C23-9073-67AD3E9C0BF2000C - got lock"
"Information","Thread-17","06/25/10","10:32:19",,"50811F1A-1C23-9073-67AD3E9C0BF2000C - released lock"
"Information","Thread-17","06/25/10","10:32:19",,"5081E27F-1C23-9073-67B5D2EF6AED8426 - about to queue"
"Information","Thread-17","06/25/10","10:32:19",,"5081E27F-1C23-9073-67B5D2EF6AED8426 - got lock"
"Information","Thread-17","06/25/10","10:32:24",,"5081E27F-1C23-9073-67B5D2EF6AED8426 - released lock"
"Information","Thread-17","06/25/10","10:32:24",,"5082A5E1-1C23-9073-674E9467F395686F - about to queue"
"Information","Thread-17","06/25/10","10:32:24",,"5082A5E1-1C23-9073-674E9467F395686F - got lock"
"Information","Thread-17","06/25/10","10:32:29",,"5082A5E1-1C23-9073-674E9467F395686F - released lock"

The key thing to notice here is that it is single threaded. It's all about queuing the events one at a time and everything happens in sequence.

SCENARIO 2 - MORE THREADS:

If the number of Event Gateway Processing Threads is increased to 5 and I pound the gateway I see the following log output:

"Information","Thread-18","06/25/10","11:26:01",,"526CC05B-C9E1-FADE-73CE3426BC0A3F92 - about to queue"
"Information","Thread-18","06/25/10","11:26:01",,"526CC05B-C9E1-FADE-73CE3426BC0A3F92 - got lock"
"Information","Thread-27","06/25/10","11:26:01",,"526CD0EB-049E-D382-2C3A7E3C0DBF8ED3 - about to queue"
"Information","Thread-21","06/25/10","11:26:02",,"526CDEED-C2B3-3C92-0F57CFA317AC02F8 - about to queue"
"Information","Thread-20","06/25/10","11:26:02",,"526CEE25-F25B-890C-F7501B5489C6BB21 - about to queue"
"Information","Thread-25","06/25/10","11:26:02",,"526CFD3C-EAFD-40E7-EBA2BE59B87D5936 - about to queue"
"Information","Thread-24","06/25/10","11:26:03",,"526D0FC5-E5E2-642E-452636C8838ADE33 - about to queue"
"Information","Thread-26","06/25/10","11:26:03",,"526D1096-C82E-535B-36D57D3A431D1436 - about to queue"
"Information","Thread-23","06/25/10","11:26:03",,"526D1F9C-9A9C-FA84-E153A944123E77BE - about to queue"
"Information","Thread-19","06/25/10","11:26:04",,"526D2EDC-EA54-4D83-3F6BB681A5CCAA89 - about to queue"
"Information","Thread-22","06/25/10","11:26:04",,"526D3F09-073F-2B0C-E94652D1C95B09CB - about to queue"
"Information","Thread-18","06/25/10","11:26:06",,"526CC05B-C9E1-FADE-73CE3426BC0A3F92 - released lock"
"Information","Thread-22","06/25/10","11:26:06",,"526D3F09-073F-2B0C-E94652D1C95B09CB - got lock"
"Information","Thread-22","06/25/10","11:26:11",,"526D3F09-073F-2B0C-E94652D1C95B09CB - released lock"
"Information","Thread-27","06/25/10","11:26:11",,"526CD0EB-049E-D382-2C3A7E3C0DBF8ED3 - got lock"
"Information","Thread-27","06/25/10","11:26:16",,"526CD0EB-049E-D382-2C3A7E3C0DBF8ED3 - released lock"
"Information","Thread-19","06/25/10","11:26:16",,"526D2EDC-EA54-4D83-3F6BB681A5CCAA89 - got lock"
"Information","Thread-19","06/25/10","11:26:21",,"526D2EDC-EA54-4D83-3F6BB681A5CCAA89 - released lock"

Pay particular attention to the request with UUID 526D3F09-073F-2B0C-E94652D1C95B09CB. It is the last request logged and so is at the end of the queue. Hoever, as soon as lock is available, it jumps and grabs the lock - not the 526CD0EB-049E-D382-2C3A7E3C0DBF8ED3 request which was there first.

CONCLUSION:

Since when we use more than one thread we cannot guarantee the order in which threads are processed when waiting for cflock when using event gateways. We need to ensure that the timeout value for the lock is high enough so that during busy times the queue of events can be processed in full before any one request exceeds the lock timeout.

I guess this might serve us right for using cflock with multi-threaded event gateways!

I hope this might help anyone else who experiences this issue.

Cheers, Ciaran.

Ciaran Archer