ansaurus

Question

How to match 2 lists most effectively (fast)?

Answer 1

A:

I think the best way would be to prepare an index. Like this (Javascript)

index = [];
index["1"] = [0,1,2];
index["2"] = [3,4];

Well sorted lists aren't really required in this case.

Arpit Tambi 2009-08-09 09:32:37

That works on JavaScript, but what C# data structure supports this? I'll give you +1 if it doesn't match `<*,*<*>>`.

Kobi 2009-08-09 10:36:47

Arpit, that would require the not so fast datastructure Dictionary.

Dykam 2009-08-09 15:36:06

Answer 2

+1 A:

Edited, rewritten, nontested, should have O(source + target) performance. Usage could be MatchMaker.Match(source, target).ToList();

public static class MatchMaker
{
    public class Source
    {
     char Term { get; set; }
     IEnumerable<string> Results { get; set; }
    }

    public static IEnumerable<Source> Match(IEnumerable<string> source, IEnumerable<string> target)
    {
     int currentIndex = 0;
     var matches = from term in source
          select new Source
          {
           Term = term[0],
           Result = from result in target.FromIndex(currentIndex)
               .TakeWhile((r, i) => {
                currentIndex = i;
                return r[0] == term[0];
               })
              select result
          };
    }
    public static IEnumerable<T> FromIndex<T>(this IList<T> subject, int index)
    {
     while (index < subject.Count) {
      yield return subject[index++];
     }
    }
}

A simple LinQ, probably not the uttermost fast, but the clearest:

var matches = from result in target
              from term in source
              where result[0] == term[0]
              select new {
              Term: term,
              Result: result
              };

I'm against premature optimalization.

Dykam 2009-08-09 09:39:42

Thanks, is your 3rd line supposed to be term instead of source? I don't understand why you index as [0]?

Joan Venge 2009-08-09 09:43:07

It is indeed term. And indexed 0 to get the char value of it. You can't compare string and char.But your question was a little unclear. Now I get it, this will probably not do what you need.

Dykam 2009-08-09 09:54:00

This is O(n^2), so it might not be good enough, considering that the title includes the word "fast".

erikkallen 2009-08-09 09:55:51

While I was benchmarking anyway, this LINQ solution took 0.87 seconds for 1M elements (I added a GroupBy). Still not too bad really.

Thorarin 2009-08-09 09:59:41

Answer 3

+3 A:

I'm not sure this is worth attempting to optimize very far. You could implement some kind of binary search with this, but it's effectiveness would be rather limited. How many elements are we talking about?

Without unmatched elements in target

Assuming the lists are sorted, and no elements in target exist that cannot be matched against source:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        while (!MatchName(source[s], target[t]))
        {
            s++;
            if (s >= source.Length)
                return matches;
        }

        matches[s].Add(target[t]);
    }

    return matches;
}

With unmatched elements

If there is a possibility of elements existing in target that have no match in source, the above will break (if the elements are not at the end of target). To fix that, it would be best to go with a different implementation for the comparison. Instead of a boolean, we need it to return 'less than', 'equal' or 'greater than', like a Comparer when used in sorting:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        int m = CompareName(source[s], target[t]);
        if (m == 0)
        {
            matches[s].Add(target[t]);
        }
        else if (m > 0)
        {
            s++;
            if (s >= source.Length)
                return matches;
            t--;
        }
    }

    return matches;
}

static int CompareName(string source, string target)
{
    // Whatever comparison you need here, this one is really basic :)
    return target[0] - source[0];
}

The two are otherwise essentially the same. As you can see, you loop through the target elements once, advancing the index to the source array when you no longer find a match.

If the number of source elements is limited, it might be worth doing a little smarter search. If the number of source elements is also large, the supposed benefit of that would decrease.

Then again, the first algorithm takes 0.18 seconds with 1 million target elements on my machine, in Debug mode. The second one is even faster (0.03 seconds), but that because of the simpler comparison that is being done. Possibly you would have to compare everything up to the first whitespace character, making it significantly slower.

Thorarin 2009-08-09 09:42:13

Thanks, around 1M elements.

Joan Venge 2009-08-09 09:44:15

1M, that is quite a lot. Is it something to be done live?

Dykam 2009-08-09 09:47:05

No just once when the program starts.

Joan Venge 2009-08-09 10:21:19

Only problem I have with this answer is it fails if there are elements in the target list that are not matchable by the source list. Which might be fine of course!

Andrew Barrett 2009-08-09 10:35:40

@Andrew: I've added a different flavor to account for that possibility, but I've moved away from the `NameMatch` signature that OP provided. You would need to need a full comparer for that (less than, equal, greater than), like in your example.

Thorarin 2009-08-09 11:25:07

@Thorarin I was very tempted to write something like that, but I didn't want to add the complexity of fiddling with the loop counter for the OP.

Andrew Barrett 2009-08-09 11:53:57

Answer 4

A:

Well you obviously stop looping over the target list as soon as you go past the current source prefix. In this case you be better of with a prefix method than the matches one so you can tell what the current prefix is and stop searching the target if you go past it.

pjp 2009-08-09 09:49:16

Answer 5

+1 A:

As the items are sorted, you can just loop though the lists:

string[] source = {"1", "2", "A", "B" };
string[] target = { "1 - new music", "1 / classic", "1 | pop", "2 edit", "2 no edit", "A - sing", "B (listen)" };

List<string>[] matches = new List<string>[source.Length];
int targetIdx = 0;
for (int sourceIdx = 0; sourceIdx < source.Length; sourceIdx++) {
   matches[sourceIdx] = new List<string>();
   while (targetIdx < target.Length && NameMatch(source[sourceIdx], target[targetIdx])) {
      matches[sourceIdx].Add(target[targetIdx]);
      targetIdx++;
   }
}

Guffa 2009-08-09 10:11:00

Essentially the same as my implementation, but with the source and target looping swapped. For some reason, the compiler seems to like this solution slightly less. Probably because there is some optimization for the `for` loop, and the number of target elements is larger than the number of source elements. In any case, the difference is very minor, and arguably your version might be more easy to understand, as it doesn't use negative logic?

Thorarin 2009-08-09 10:21:24

Answer 6

+1 A:

Here is an answer that only loops through both lists once, using the logic that both are sorted as an optimisation. Like most have said, I wouldn't worry too much about optimisations as it's likely to be fast enough with any of these answers, I would go for the most readable and maintainable solution.

That being said, I need something to do with my cup of coffee so here you go. One of the advantages of the below is that it allows things in the target list that have no matches in the source list, although I'm unsure if you'd require that functionality.

class Program
{
    public class Source
    {
        private readonly string key;
        public string Key { get { return key;}}

        private readonly List<string> matches = new List<string>();
        public List<string> Matches { get { return matches;} }

        public Source(string key)
        {
            this.key = key;
        }
    }

    static void Main(string[] args)
    {
        var sources = new List<Source> {new Source("A"), new Source("C"), new Source("D")};
        var targets = new List<string> { "A1", "A2", "B1", "C1", "C2", "C3", "D1", "D2", "D3", "E1" };

        var ixSource = 0;
        var currentSource = sources[ixSource++];

        foreach (var target in targets)
        {
            var compare = CompareSourceAndTarget(currentSource, target);

            if (compare > 0)
                continue;

            // Try and increment the source till we have one that matches 
            if (compare < 0)
            {
                while ((ixSource < sources.Count) && (compare < 0))
                {
                    currentSource = sources[ixSource++];
                    compare = CompareSourceAndTarget(currentSource, target);
                }
            }

            if (compare == 0)
            {
                currentSource.Matches.Add(target);
            }

            // no more sources to match against
            if ((ixSource > sources.Count))
                break;
        }

        foreach (var source in sources)
        {
            Console.WriteLine("source {0} had matches {1}", source.Key, String.Join(" ", source.Matches.ToArray()));
        }
    }

    private static int CompareSourceAndTarget(Source source, string target)
    {
        return String.Compare(source.Key, target.Substring(0, source.Key.Length), StringComparison.OrdinalIgnoreCase);
    }
}

Andrew Barrett 2009-08-09 10:18:54

Hmm, too many conditionals for my liking, and I find the way you iterate over sources slightly confusing. Maybe it's just that *I* haven't had my coffee yet. On the other hand, it does put the matches in the source objects like the question stated, but I figured that was an exercise to be left to the OP, for code brevity :)

Thorarin 2009-08-09 10:32:26

The iteration over the source comes when (compare < 0), currentSource = sources[ixSource++];. You only need to iterate over the source list when it is less than the current target. It's not the nicest code in the world, but I gave myself the challenge to write it while I was drinking my coffee, so it's what I could do in that time. :)

Andrew Barrett 2009-08-09 10:33:42

Well, my new version (that accounts for elements that can't be matched) isn't all that pretty either. Particularly the `t--` to counteract the advancement in the `for` loop, but the alternative was a `while` loop and some dodgy use of `continue` :P

Thorarin 2009-08-09 11:35:34

Answer 7

A:

Since they are sorted, isn't it just a basic O(N) merge loop?

ia = ib = 0;
while(ia < na && ib < nb){
  if (A[ia] < B[ib]){
    // A[ia] is unmatched
    ia++;
  }
  else if (B[ib] < A[ia]){
    // B[ib] is unmatched
    ib++;
  }
  else {
    // A[ia] matches B[ib]
    ia++;
    ib++;
  }
}
while(ia < na){
  // A[ia] is unmatched
  ia++;
}
while(ib < nb){
  // B[ib] is unmatched
  ib++;
}

Mike Dunlavey 2009-08-10 01:51:06

ansaurus

tags:

views:

answers:

How to match 2 lists most effectively (fast)?

Without unmatched elements in target

With unmatched elements

related questions