I've had an unusual client request in regard to fraud detection in an online raffle which I'm trying to build as efficently as possible by pushing as much of the task into SQL as possible.
The structure is thus:
table: codes raffle_code | ticket_type | sequence A00000001 Red 1 A00000002 Red 2 ... A0000000X Red X B00000001 Blue 1 etc
There can be X ticket_types, each with it's own sequence. The Raffle codes are unique hashes, I've just simplified for the example.
Now the problem is that these codes are issued at random and any event where they are entered in sequence is (probably) fraudulent. The enteries of the codes are recorded in a transaction file like so:
table: transactions user_id | raffle_code | transaction_number | timestamp 1 A00000032 1 etc...
The desired rule is that if more than 5 raffle codes, of a given type, are entered in sequence within a given set of 100 transactions it is suspect and the users involved should be pulled.
My intial thought was to attempt to detect fraud events first, then scoop the fraudulent user-transactions for each fraud event seperately. The rough SQL I was thinking of was something like:
SELECT c.raffle_code, MIN(t.timestamp), MAX(t.timestamp), MIN(t.transaction_number), MAX(t.transaction_number) FROM codes c, transactions t WHERE c.raffle_code = t.raffle_code GROUP BY c.raffle_code HAVING ( MAX(c.sequence) - MIN(c.sequence) ) = ( COUNT(*) -1 ) AND ( MAX(t.transaction_number) - MIN(t.transaction_number) ) 4;
(Thanks to Bell over in this question for sparking the idea)
This would give a place in the transaction file (transaction_number range and timestamp range) to re-query for User IDs. It would require eyes-on verification of a suspect event to be declared fraud, but that's currently an accepted caveat of the system.
The problem with the above SQL is if "red" tickets with the sequence "3,4,5,6,7,48" occur in a given 100 transactions, it's not detected as the range is 45, which is far greater than the count of 6 and therefore to be an expected occurance if the tickets were being entered randomly.
The transaction file is destined to be quite large and have frequent hits, so dredging it 100 transactions at a time using server code to pick through it and build transaction stacks to analyse would get very inefficent.
Anyone encountered anything like this before? The client has still been quite vague with their requirements other than a very short text brief, so I'm also entertaining alternative concepts for detecting sequential input.