views:

394

answers:

7

I have 2 lists<string> of items, source and target. The items in the source list will have 0 to n matches in the target list, but there will not be duplicate matches.

Considering both lists are sorted, how would you do the matching most effectively in terms of performance.

Example:

source = {"1", "2", "A", "B", ...}
target = {"1 - new music", "1 / classic", "1 | pop", "2 edit", "2 no edit", "A - sing", "B (listen)", ...}

Basically the match is simple prefix match, but say you have a method called MatchName. You can use a new function if you are gonna do more optimized searches. NameMatch just compares 2 strings and returns a bool.

In the end source[0] will have source[0].Matches to contain target[0, 1 and 2] in this case.

A: 

I think the best way would be to prepare an index. Like this (Javascript)

index = [];
index["1"] = [0,1,2];
index["2"] = [3,4];

Well sorted lists aren't really required in this case.

Arpit Tambi
That works on JavaScript, but what C# data structure supports this? I'll give you +1 if it doesn't match `<*,*<*>>`.
Kobi
Arpit, that would require the not so fast datastructure Dictionary.
Dykam
+1  A: 

Edited, rewritten, nontested, should have O(source + target) performance. Usage could be MatchMaker.Match(source, target).ToList();

public static class MatchMaker
{
    public class Source
    {
     char Term { get; set; }
     IEnumerable<string> Results { get; set; }
    }

    public static IEnumerable<Source> Match(IEnumerable<string> source, IEnumerable<string> target)
    {
     int currentIndex = 0;
     var matches = from term in source
          select new Source
          {
           Term = term[0],
           Result = from result in target.FromIndex(currentIndex)
               .TakeWhile((r, i) => {
                currentIndex = i;
                return r[0] == term[0];
               })
              select result
          };
    }
    public static IEnumerable<T> FromIndex<T>(this IList<T> subject, int index)
    {
     while (index < subject.Count) {
      yield return subject[index++];
     }
    }
}


A simple LinQ, probably not the uttermost fast, but the clearest:

var matches = from result in target
              from term in source
              where result[0] == term[0]
              select new {
              Term: term,
              Result: result
              };

I'm against premature optimalization.

Dykam
Thanks, is your 3rd line supposed to be term instead of source? I don't understand why you index as [0]?
Joan Venge
It is indeed term. And indexed 0 to get the char value of it. You can't compare string and char.But your question was a little unclear. Now I get it, this will probably not do what you need.
Dykam
This is O(n^2), so it might not be good enough, considering that the title includes the word "fast".
erikkallen
While I was benchmarking anyway, this LINQ solution took 0.87 seconds for 1M elements (I added a GroupBy). Still not too bad really.
Thorarin
+3  A: 

I'm not sure this is worth attempting to optimize very far. You could implement some kind of binary search with this, but it's effectiveness would be rather limited. How many elements are we talking about?

Without unmatched elements in target

Assuming the lists are sorted, and no elements in target exist that cannot be matched against source:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        while (!MatchName(source[s], target[t]))
        {
            s++;
            if (s >= source.Length)
                return matches;
        }

        matches[s].Add(target[t]);
    }

    return matches;
}

With unmatched elements

If there is a possibility of elements existing in target that have no match in source, the above will break (if the elements are not at the end of target). To fix that, it would be best to go with a different implementation for the comparison. Instead of a boolean, we need it to return 'less than', 'equal' or 'greater than', like a Comparer when used in sorting:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        int m = CompareName(source[s], target[t]);
        if (m == 0)
        {
            matches[s].Add(target[t]);
        }
        else if (m > 0)
        {
            s++;
            if (s >= source.Length)
                return matches;
            t--;
        }
    }

    return matches;
}

static int CompareName(string source, string target)
{
    // Whatever comparison you need here, this one is really basic :)
    return target[0] - source[0];
}

The two are otherwise essentially the same. As you can see, you loop through the target elements once, advancing the index to the source array when you no longer find a match.

If the number of source elements is limited, it might be worth doing a little smarter search. If the number of source elements is also large, the supposed benefit of that would decrease.

Then again, the first algorithm takes 0.18 seconds with 1 million target elements on my machine, in Debug mode. The second one is even faster (0.03 seconds), but that because of the simpler comparison that is being done. Possibly you would have to compare everything up to the first whitespace character, making it significantly slower.

Thorarin
Thanks, around 1M elements.
Joan Venge
1M, that is quite a lot. Is it something to be done live?
Dykam
No just once when the program starts.
Joan Venge
Only problem I have with this answer is it fails if there are elements in the target list that are not matchable by the source list. Which might be fine of course!
Andrew Barrett
@Andrew: I've added a different flavor to account for that possibility, but I've moved away from the `NameMatch` signature that OP provided. You would need to need a full comparer for that (less than, equal, greater than), like in your example.
Thorarin
@Thorarin I was very tempted to write something like that, but I didn't want to add the complexity of fiddling with the loop counter for the OP.
Andrew Barrett
A: 

Well you obviously stop looping over the target list as soon as you go past the current source prefix. In this case you be better of with a prefix method than the matches one so you can tell what the current prefix is and stop searching the target if you go past it.

pjp
+1  A: 

As the items are sorted, you can just loop though the lists:

string[] source = {"1", "2", "A", "B" };
string[] target = { "1 - new music", "1 / classic", "1 | pop", "2 edit", "2 no edit", "A - sing", "B (listen)" };

List<string>[] matches = new List<string>[source.Length];
int targetIdx = 0;
for (int sourceIdx = 0; sourceIdx < source.Length; sourceIdx++) {
   matches[sourceIdx] = new List<string>();
   while (targetIdx < target.Length && NameMatch(source[sourceIdx], target[targetIdx])) {
      matches[sourceIdx].Add(target[targetIdx]);
      targetIdx++;
   }
}
Guffa
Essentially the same as my implementation, but with the source and target looping swapped. For some reason, the compiler seems to like this solution slightly less. Probably because there is some optimization for the `for` loop, and the number of target elements is larger than the number of source elements. In any case, the difference is very minor, and arguably your version might be more easy to understand, as it doesn't use negative logic?
Thorarin
+1  A: 

Here is an answer that only loops through both lists once, using the logic that both are sorted as an optimisation. Like most have said, I wouldn't worry too much about optimisations as it's likely to be fast enough with any of these answers, I would go for the most readable and maintainable solution.

That being said, I need something to do with my cup of coffee so here you go. One of the advantages of the below is that it allows things in the target list that have no matches in the source list, although I'm unsure if you'd require that functionality.

class Program
{
    public class Source
    {
        private readonly string key;
        public string Key { get { return key;}}

        private readonly List<string> matches = new List<string>();
        public List<string> Matches { get { return matches;} }

        public Source(string key)
        {
            this.key = key;
        }
    }

    static void Main(string[] args)
    {
        var sources = new List<Source> {new Source("A"), new Source("C"), new Source("D")};
        var targets = new List<string> { "A1", "A2", "B1", "C1", "C2", "C3", "D1", "D2", "D3", "E1" };

        var ixSource = 0;
        var currentSource = sources[ixSource++];

        foreach (var target in targets)
        {
            var compare = CompareSourceAndTarget(currentSource, target);

            if (compare > 0)
                continue;

            // Try and increment the source till we have one that matches 
            if (compare < 0)
            {
                while ((ixSource < sources.Count) && (compare < 0))
                {
                    currentSource = sources[ixSource++];
                    compare = CompareSourceAndTarget(currentSource, target);
                }
            }

            if (compare == 0)
            {
                currentSource.Matches.Add(target);
            }

            // no more sources to match against
            if ((ixSource > sources.Count))
                break;
        }

        foreach (var source in sources)
        {
            Console.WriteLine("source {0} had matches {1}", source.Key, String.Join(" ", source.Matches.ToArray()));
        }
    }

    private static int CompareSourceAndTarget(Source source, string target)
    {
        return String.Compare(source.Key, target.Substring(0, source.Key.Length), StringComparison.OrdinalIgnoreCase);
    }
}
Andrew Barrett
Hmm, too many conditionals for my liking, and I find the way you iterate over sources slightly confusing. Maybe it's just that *I* haven't had my coffee yet. On the other hand, it does put the matches in the source objects like the question stated, but I figured that was an exercise to be left to the OP, for code brevity :)
Thorarin
The iteration over the source comes when (compare < 0), currentSource = sources[ixSource++];. You only need to iterate over the source list when it is less than the current target. It's not the nicest code in the world, but I gave myself the challenge to write it while I was drinking my coffee, so it's what I could do in that time. :)
Andrew Barrett
Well, my new version (that accounts for elements that can't be matched) isn't all that pretty either. Particularly the `t--` to counteract the advancement in the `for` loop, but the alternative was a `while` loop and some dodgy use of `continue` :P
Thorarin
A: 

Since they are sorted, isn't it just a basic O(N) merge loop?

ia = ib = 0;
while(ia < na && ib < nb){
  if (A[ia] < B[ib]){
    // A[ia] is unmatched
    ia++;
  }
  else if (B[ib] < A[ia]){
    // B[ib] is unmatched
    ib++;
  }
  else {
    // A[ia] matches B[ib]
    ia++;
    ib++;
  }
}
while(ia < na){
  // A[ia] is unmatched
  ia++;
}
while(ib < nb){
  // B[ib] is unmatched
  ib++;
}
Mike Dunlavey