views:

279

answers:

9

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as [email protected], [email protected], [email protected], etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?

The test code I'm using now is:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;

namespace LevenshteinAnalyzer
{
    class Program
    {
        const string INPUT_FILE = @"C:\Input.txt";
        const string OUTPUT_FILE = @"C:\Output.txt";

        static void Main(string[] args)
        {
            var inputWords = File.ReadAllLines(INPUT_FILE);
            var outputWords = new SortedSet<string>();

            for (var i = 0; i < inputWords.Length; i++)
            {
                if (i % 100 == 0) 
                    Console.WriteLine("Processing record #" + i);

                var word1 = inputWords[i].ToLower();
                for (var n = i + 1; n < inputWords.Length; n++)
                {
                    if (i == n) continue;
                    var word2 = inputWords[n].ToLower();

                    if (word1 == word2) continue;
                    if (outputWords.Contains(word1)) continue;
                    if (outputWords.Contains(word2)) continue;
                    var distance = LevenshteinAlgorithm.Compute(word1, word2);

                    if (distance <= 2)
                    {
                        outputWords.Add(word1);
                        outputWords.Add(word2);
                    }
                }
            }

            File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
            Console.WriteLine("Found {0} words", outputWords.Count);
        }
    }
}

Edit: Some of the stuff I'm trying to catch looks like:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

+8  A: 

You could start by applying some prioritization to which emails to compare to one another.

A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.

For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.

You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.

Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.

LBushkin
+2  A: 

You could add a few optimizations:

1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.

2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.

3) Consider stripping the domain in general. [email protected] and [email protected] will never trigger your flag.

glowcoder
Chris
You are doing `toLower` very often as well. Consider doing `toLower` to the entire array before you start so you're not doing it every time.
glowcoder
+1  A: 

If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.

Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.

Thomas
+5  A: 

Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.

1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2

2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.

OR

Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.

Jeff B
A: 

Just for completeness, you should consider the semantics of email addresses as well, in terms of:

  1. Gmail treats user.name and username as being the same, so both are valid email addresses belonging to the same user. Other services may do this as well. LBushkin's suggestion to strip special characters would help here.

  2. Sub-adrressing can potentially trip your filter if users wise up to it. You'd want to drop the sub-address data before comparison.

alexandru
I would say (using gmail as my example as well) parse the address or a + sign and strip everything from the + up to the @, then strip .'s, then pull all duplicates. Obviously you can't work backwards from this, so it's going to turn into a tuple where the first is a unique key, the second the original, the third the modified, then work with the list of the third ordinal of the tuple. Idk, that's how I would approach it, but based on potential sub-addressing schemes per available provider. ~~ Of course, users on their own domain could pretty easily spawn whatever address they want, so...
drachenstern
A: 

You might want to look at the full data set to see if there is other commonality between accounts that have spoofed emails.

i don't know what your application does, but if there are other key points, then use those to filter down what addresses you are going to compare.

Chris Lively
A: 

Sort everything into a hashtable first. The key should be the domain name of the email; "gmail.com". Strip out special characters from the values, as was mentioned above.

Then check all the gmail.com's against one another. That should be much faster. Do not compare things that are more than 3 characters different in length.

As a second step, check all the keys against one another, and develop groupings there. (gmail.com == googlemail.com, for example.)

Dean J
A: 

I agree with others comments about comparing email addresses not being to helpful, since users could just aswell create fraudulent disimilar looking addresses.

I think a better to come with other solutions, such as limiting the amount of emails you can write down per hour/day, or the time between those addresses being received by you and being sent to the users. Basically, work it out in a way where it is comfortable to send to send a few invites per day, but a PITA to send out many. I guess most users would forget/give up to do it if they had to do it through out a relatively long period of time in order to get their freebies.

Francisco Noriega
A: 

Is there any way you can do a check on the IP address of the person creating the email. That would be a simple way to determine, or at least give you added information about whether the different email addresses have come from the same person.

Andrew Campion