ansaurus

Question

Flagging possible identical users in an account management system

Answer 1

+1 A:

The answer really depends upon how you model your users and what constitutes a duplicate.

There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)

If you are looking for records that are approximately similar try this simple approach: Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.

To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.

let m_i = min_j (h_i(w_j)

and the signature is S = m1.m2.m3....mk

The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.

Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

2010-10-24 23:50:17

@user485440 - Can you elaborate on what you mean by picking the min shingle? I am also a little confused by why we need to use k different hash functions and then concatenating them? I am also wondering, what sort of hash functions are out there for hashing alphanumeric text within an acceptable runtime?

sc_ray 2010-10-25 13:56:59

With low values of k(say 1), two records that differ say 20-30% are also likely to have same signature. As you increase K, it would be less and less likely that these records have same signature. But two records that are say 99% the same. Then even with k = 10 most likely the will have the same signature. Hope this helps.

2010-10-25 14:43:37

Also, I don't have much idea about well performing hashes, but perhaps this deserves another question on SO.

2010-10-25 14:44:22

@user485440 - So the idea is that apply a hash to a record. Keep track of the hash result and concatenate it with the subsequent application of another hash on this record and build up a string of hashes for k different hashes. Do the same for the other record that needs to be compared. If the hashes match up, we have a duplicate. Is that what you are suggesting? Won't there be an overhead on performing the hash and then doing a pattern match on the hash results? Multiplying this with a million+/billion records, what sort of performance implications will be there?

sc_ray 2010-10-25 18:49:29

I think my original post was probably not clear enough. I have added more text to explain the idea. But basically the advantage of using a min hash based signature is that you can now look for exact matches not near matches which can be done in linear time using hash tables or n.log n using sort.

2010-10-25 21:19:46

ansaurus

tags:

views:

answers:

Flagging possible identical users in an account management system

related questions