views:

68

answers:

1

Hi,

I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:

Name
Nationality
Current Address
Login
Interests

It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?

What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.

Thanks

+1  A: 

The answer really depends upon how you model your users and what constitutes a duplicate.

There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)

If you are looking for records that are approximately similar try this simple approach: Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.

To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.

let m_i = min_j (h_i(w_j)

and the signature is S = m1.m2.m3....mk

The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.

Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

@user485440 - Can you elaborate on what you mean by picking the min shingle? I am also a little confused by why we need to use k different hash functions and then concatenating them? I am also wondering, what sort of hash functions are out there for hashing alphanumeric text within an acceptable runtime?
sc_ray
With low values of k(say 1), two records that differ say 20-30% are also likely to have same signature. As you increase K, it would be less and less likely that these records have same signature. But two records that are say 99% the same. Then even with k = 10 most likely the will have the same signature. Hope this helps.
Also, I don't have much idea about well performing hashes, but perhaps this deserves another question on SO.
@user485440 - So the idea is that apply a hash to a record. Keep track of the hash result and concatenate it with the subsequent application of another hash on this record and build up a string of hashes for k different hashes. Do the same for the other record that needs to be compared. If the hashes match up, we have a duplicate. Is that what you are suggesting? Won't there be an overhead on performing the hash and then doing a pattern match on the hash results? Multiplying this with a million+/billion records, what sort of performance implications will be there?
sc_ray
I think my original post was probably not clear enough. I have added more text to explain the idea. But basically the advantage of using a min hash based signature is that you can now look for exact matches not near matches which can be done in linear time using hash tables or n.log n using sort.