views:

189

answers:

4

Hi!

Suppose We have a, IEnumerable Collection with 20 000 Person object items. Then suppose we have created another Person object.

We want to list all Persons that ressemble this Person. That means, for instance, if the Surname affinity is more than 90 % , add that Person to the list.

e.g. ("Andrew" vs "Andrw")

What is the most effective / quick way of doing this?

Iterating through the collection and comparing char by char with affinity determination? OR? Any ideas?

Thank you!

+6  A: 

You may be interested in:

Levenshtein Distance Algorithm

Peter Norvig - How to Write a Spelling Corrector (you'll be interested in the part where he compares a word against a collection of existing words)

Ryan Ische
Yeah this is the algorithm I was looking for.. Thank you
PaN1C_Showt1Me
+1  A: 

Depending on how often you'll need to do this search, the brute force iterate and compare method might be fast enough. Twenty thousand records really isn't all that much and unless the number of requests is large your performance may be acceptable.

That said, you'll have to implement the comparison logic yourself and if you want a large degree of flexibility (or if you need find you have to work on performance) you might want to look at something like Lucene.Net. Most of the text search engines I've seen and worked with have been more file-based, but I think you can index in-memory objects as well (however I'm not sure about that).

Good luck!

akmad
+1  A: 

I'm not sure if you're asking for help writing the search given your existing affinity function, or if you're asking for help writing the affinity function. So for the moment I'll assume you're completely lost.

Given that assumption, you'll notice that I divided the problem into two pieces, and that's what you need to do as well. You need to write a function that takes two string inputs and returns a boolean value indicating whether or not the inputs are sufficiently similar. Then you need a separate search a delegate that will match any function with that kind of signature.

The basic signature for your affinity function might look like this:

bool IsAffinityMatch(string p1, string p2)

And then your search would look like this:

MyPersonCollection.Where(p => IsAffinityMatch(p.Surname, OtherPerson.Surname));
Joel Coehoorn
Well I think I didnt ask correctly, but what I really needed is that algorithm.
PaN1C_Showt1Me
A: 

I provide the source code of that Affinity method:

        /// <summary>
        /// Compute Levenshtein distance according to the Levenshtein Distance Algorithm
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        private static int Compare(string s, string t)
        {
            /* if both string are not set, its uncomparable. But others fields can still match! */
            if (string.IsNullOrEmpty(s) && string.IsNullOrEmpty(t)) return 0;

            /* if one string has value and the other one hasn't, it's definitely not match */
            if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(t)) return -1;

            s = s.ToUpper().Trim();
            t = t.ToUpper().Trim();

            int n = s.Length;
            int m = t.Length;
            int[,] d = new int[n + 1, m + 1];

            int cost;

            if (n == 0) return m;
            if (m == 0) return n;

            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;

            for (int i = 1; i <= n; i++)
            {
                for (int j = 1; j <= m; j++)
                {
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }

            return d[n, m];
        }

that means, if 0 is returned, 2 strings are identical.

PaN1C_Showt1Me