views:

94

answers:

1

I've had an unusual client request in regard to fraud detection in an online raffle which I'm trying to build as efficently as possible by pushing as much of the task into SQL as possible.

The structure is thus:


table: codes

raffle_code | ticket_type | sequence

A00000001     Red            1
A00000002     Red            2
...
A0000000X     Red            X
B00000001     Blue           1

etc

There can be X ticket_types, each with it's own sequence. The Raffle codes are unique hashes, I've just simplified for the example.

Now the problem is that these codes are issued at random and any event where they are entered in sequence is (probably) fraudulent. The enteries of the codes are recorded in a transaction file like so:

table: transactions

user_id  |  raffle_code | transaction_number | timestamp
1           A00000032      1              etc...

The desired rule is that if more than 5 raffle codes, of a given type, are entered in sequence within a given set of 100 transactions it is suspect and the users involved should be pulled.

My intial thought was to attempt to detect fraud events first, then scoop the fraudulent user-transactions for each fraud event seperately. The rough SQL I was thinking of was something like:


SELECT c.raffle_code, MIN(t.timestamp), MAX(t.timestamp), MIN(t.transaction_number), MAX(t.transaction_number)
FROM codes c, transactions t
WHERE c.raffle_code = t.raffle_code
GROUP BY c.raffle_code
HAVING ( MAX(c.sequence) - MIN(c.sequence) ) = ( COUNT(*) -1 )
AND    ( MAX(t.transaction_number) - MIN(t.transaction_number) )  4;

(Thanks to Bell over in this question for sparking the idea)

This would give a place in the transaction file (transaction_number range and timestamp range) to re-query for User IDs. It would require eyes-on verification of a suspect event to be declared fraud, but that's currently an accepted caveat of the system.

The problem with the above SQL is if "red" tickets with the sequence "3,4,5,6,7,48" occur in a given 100 transactions, it's not detected as the range is 45, which is far greater than the count of 6 and therefore to be an expected occurance if the tickets were being entered randomly.

The transaction file is destined to be quite large and have frequent hits, so dredging it 100 transactions at a time using server code to pick through it and build transaction stacks to analyse would get very inefficent.

Anyone encountered anything like this before? The client has still been quite vague with their requirements other than a very short text brief, so I'm also entertaining alternative concepts for detecting sequential input.

+1  A: 

The impression I get from your question is that you're saying the set "3,4,5,6,7,48" does not count as fraudulent as it doesn't meet the specified criteria but that you feel it should count as it feels a bit suspect.

The problem of detecting a possible fraudulent transaction set is certainly tricky - it's easy for us to feel that a transaction set is fraudulent without being able to define clear rules for the computer to follow.

Why not simply lower the length of a sequential set that counts as being fraudulent?

Your intuition might be that this would throw up too many false positives. Ignore that unless you have data to back it up!

Consider this:

  • What is the probability of a non-cheating user entering a few as two sequential raffle codes? Call this probability X.
  • What is the probability of a cheating user entering as few as two sequential raffle codes? Call this probability Y.

If X >> Y, then any set of two or more sequential raffle codes has a high probability of being fraudulent.

Try starting out by defining a fraudulent transaction set as containing two or more sequential raffle codes from the same user (within a set of 100 transactions). See how often this actually happens.

If you find you get too many false positives, increase the minimum sequential sequence length to 3. Repeat as needed.

I feel you're being overly cautious about identifying as potentially fraudulent sets of transactions that are not. Try not to think too hard about this.

This has more to do with probability than precision so you're never going to be 100% right and you're not going to be close to 100% right to begin with. Some training and testing and examination of what actually happens is the better approach.

Jon Cram
I agree that collecting false positives is nigh inevitable, a legitimate user obtaining a run of tickets may happen since they'll be physically issued from one pool.The first point you make is of course right, 3,4,5,6,7,48 is fraudulent, but only 3,4,5,6,7 - disregarding the 48, I think I confused issues by saying that a sequence range of 45 was to be expected, as in 3, 12, 48 is ok, but the above is not.Probability of being legit vs fraudulent is a good point, I've been developing a user weighting system that highlights users that exceed a certain number of suspect actions.
bkgordon