Hi,
I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks