views:

455

answers:

9

Hello,

I have several huge sorted enumerable sequences that I want to merge. Theses lists are manipulated as IEnumerable but are already sorted. Since input lists are sorted, it should be possible to merge them in one trip, without re-sorting anything.

I would like to keep the defered execution behavior.

I tried to write a naive algorithm which do that (see below). However, it looks pretty ugly and I'm sure it can be optimized. It may exist a more academical algorithm...

IEnumerable<T> MergeOrderedLists<T, TOrder>(IEnumerable<IEnumerable<T>> orderedlists, 
                                            Func<T, TOrder> orderBy)
{
    var enumerators = orderedlists.ToDictionary(l => l.GetEnumerator(), l => default(T));
    IEnumerator<T> tag = null;

    var firstRun = true;
    while (true)
    {
        var toRemove = new List<IEnumerator<T>>();
        var toAdd = new List<KeyValuePair<IEnumerator<T>, T>>();
        foreach (var pair in enumerators.Where(pair => firstRun || tag == pair.Key))
        {
            if (pair.Key.MoveNext())
                toAdd.Add(pair);
            else
                toRemove.Add(pair.Key);
        }

        foreach (var enumerator in toRemove)
            enumerators.Remove(enumerator);

        foreach (var pair in toAdd)
            enumerators[pair.Key] = pair.Key.Current;

        if (enumerators.Count == 0)
            yield break;

        var min = enumerators.OrderBy(t => orderBy(t.Value)).FirstOrDefault();
        tag = min.Key;
        yield return min.Value;

        firstRun = false;
    }
}

The method can be used like that:

// Person lists are already sorted by age
MergeOrderedLists(orderedList, p => p.Age);

assuming the following Person class exists somewhere:

    public class Person
    {
        public int Age { get; set; }
    }

Duplicates should be conserved, we don't care about their order in the new sequence. Do you see any obvious optimization I could use?

+1  A: 

I'm suspicious LINQ is smart enough to take advantage of the prior existing sort order:

IEnumerable<string> BiggerSortedList =  BigListOne.Union(BigListTwo).OrderBy(s => s);
Brent Arias
On any `IEnumerable` in general? I doubt it.
Thomas
Does Union preserve duplicates, though? I know UNION in SQL doesn't.
mgroves
It seems impossible to know that. Using OrderBy performs a sort that consumes all the memory -- something I would like to avoid
franck
@mgroves, Union is deferred and causes no dupes.
pipelinecache
Use Concat instead of Union to preserve all elements of the sources. Also - This is a full re-ordering and does not take advantage of the pre-sorting.
David B
If Union *doesn't* preserve dupes, then that *won't* work for franck, right?
mgroves
+4  A: 

How many lists do you expect to need to merge? It looks like your algorithm will not be efficient if you have many different lists to merge. This line is the issue:

var min = enumerators.OrderBy(t => orderBy(t.Value)).FirstOrDefault();

This will be run once for each element in all the lists, so your runtime will be O(n * m), where n is the TOTAL number of elements in all the lists, and n is the number of lists. Expressed in terms of the average length of a list in the list of lists, the runtime is O(a * m^2).

If you are going to need to merge a lot of lists, I would suggest using a heap. Then each iteration you can remove the smallest value from the heap, and add the next element to the heap from the list that the smallest value came from.

Daniel Plaisted
That's a good observation. But I would say that lists to merge are likely to be 2 - 10 at most.
franck
+1, assuming you mean a heap as a priority queue implementation
orip
+2  A: 

Here is my solution:
The algorithm takes the first element of each list and puts them within a small helper class (a sorted list that accepts mutliple elements with the same value). This sorted list uses a binary insert.
So the first element in this list is the element we want to return next. After doing so we remove it from the sorted list and insert the next element from its original source list (at least as long as this list contains any more elements). Again, we can return the first element of our sorted list. When the sorted list is empty once, we used all element from all different source lists and are done.

This solution uses less foreach statements and no OrderBy in each step - which should improve the runtime behaviour. Only the binary insert has to be done again and again.

IEnumerable<T> MergeOrderedLists<T, TOrder>( IEnumerable<IEnumerable<T>> orderedlists, Func<T, TOrder> orderBy )
{
    // Get an enumerator for each list, create a sortedList
    var enumerators = orderedlists.Select( enumerable => enumerable.GetEnumerator() );
    var sortedEnumerators = new SortedListAllowingDoublets<TOrder, IEnumerator<T>>();

    // Point each enumerator onto the first element
    foreach( var enumerator in enumerators )
    {
        // Missing: assert true as the return value
        enumerator.MoveNext();

        //  Initially add the first value
        sortedEnumerators.AddSorted( orderBy( enumerator.Current ), enumerator );
    }

    // Continue as long as we have elements to return
    while( sortedEnumerators.Count != 0 )
    {
        // The first element of the sortedEnumerator list always
        // holds the next element to return
        var enumerator = sortedEnumerators[0].Value;

        // Return this enumerators current value
        yield return enumerator.Current;

        // Remove the element we just returned
        sortedEnumerators.RemoveAt( 0 );

        // Check if there is another element in the list of the enumerator
        if( enumerator.MoveNext() )
        {
            // Ok, so add it to the sorted list
            sortedEnumerators.AddSorted( orderBy( enumerator.Current ), enumerator );
        }
    }

My helper class (using a simple binary insert):

private class SortedListAllowingDoublets<TOrder, T> : Collection<KeyValuePair<TOrder, T>> where T : IEnumerator
{
    public void AddSorted( TOrder value, T enumerator )
    {
        Insert( GetSortedIndex( value, 0, Count - 1 ), new KeyValuePair<TOrder, T>( value, enumerator ) );
    }

    private int GetSortedIndex( TOrder item, int startIndex, int endIndex )
    {
        if( startIndex > endIndex )
        {
            return startIndex;
        }
        var midIndex = startIndex + ( endIndex - startIndex ) / 2;
        return Comparer<TOrder>.Default.Compare( this[midIndex].Key, item ) < 0 ? GetSortedIndex( item, midIndex + 1, endIndex ) : GetSortedIndex( item, startIndex, midIndex - 1 );
    }
}

What's not implemented right now: check for an empty list, which will cause problems.
And the SortedListAllowingDoublets class could be improved to take a comparer instead of using the Comparer<TOrder>.Default on its own.

tanascius
This is at least an order more elegant than mine. I like it.
sixlettervariables
Have you tried using a linked list instead of a Collection? It seems like it could be faster.
Gabe
@Gabe: No, I didn't try a linked list. This answer is a solution I just coded for the OP - I didn't use a profiler, too. But I am very optimistic, that it is a fast algorithm (which however can be improved further ^^)
tanascius
+9  A: 

One guess I would make that might improve clarity and performance is this:

  • Create a priority queue over pairs of T, IEnumerable<T> ordered according to your comparison function on T
  • For each IEnumerable<T> being merged, add the item to the priority queue annotated with a reference to the IEnumerable<T> where it originated
  • While the priority queue is not empty
    • Extract the minimum element from the priority queue
    • Advance the IEnumerable<T> in its annotation to the next element
    • If MoveNext() returned true, add the next element to the priority queue annotated with a reference to the IEnumerable<T> you just advanced
    • If MoveNext() returned false, don't add anything to the priority queue
    • Yield the dequeued element
Doug McClean
Incidentally, this is also how you would structure a concurrent mergesort.
MSN
Lasse V. Karlsen
Um, yup, exactly like that. Well done, Lasse.
Doug McClean
+3  A: 

Edit: Updated to use simple linear search for re-insert, faster than binary search probably because the lists are all presorted

Here is my fourth (thanks to @tanascius for pushing this along to something much more LINQ) cut at it:

public static IEnumerable<T> MergePreserveOrder3<T, TOrder>(
    this IEnumerable<IEnumerable<T>> aa,
    Func<T, TOrder> orderFunc)
where TOrder : IComparable<TOrder>
{
    var items = aa.Select(xx => xx.GetEnumerator()).Where(ee => ee.MoveNext())
        .OrderBy(ee => orderFunc(ee.Current)).ToList();

    while (items.Count > 0)
    {
        yield return items[0].Current;

        var next = items[0];
        items.RemoveAt(0);
        if (next.MoveNext())
        {
            // simple sorted linear insert
            var value = orderFunc(next.Current);
            var ii = 0;
            for ( ; ii < items.Count; ++ii)
            {
                if (value.CompareTo(orderFunc(items[ii].Current)) <= 0)
                {
                    items.Insert(ii, next);
                    break;
                }
            }

            if (ii == items.Count) items.Add(next);
        }
        else next.Dispose(); // woops! can't forget IDisposable
    }
}

Results:

for (int pp = 0; pp < people.Count; ++pp)
{
    Console.WriteLine("List {0}:", pp + 1);
    Console.WriteLine("\t{0}", String.Join(", ", people[pp].Select(xx => xx.Name).ToArray()));
}

Console.WriteLine("Merged:");
foreach (var person in people.MergePreserveOrder(pp => pp.Age))
{
    Console.WriteLine("\t{0}", person.Name);
}

List 1:
        8yo, 22yo, 47yo, 49yo
List 2:
        35yo, 47yo, 60yo
List 3:
        28yo, 55yo, 64yo
Merged:
        8yo
        22yo
        28yo
        35yo
        47yo
        47yo
        49yo
        55yo
        60yo
        64yo
sixlettervariables
Nice, it is much cleaner and performs a bit better than my awful version. If you think about another version don't hesitate to update your answer. thanks!
franck
Latest cut MergePreserveOrder2 is linear w.r.t. number of people or lists. My original and your original are both much worse in terms of growth.
sixlettervariables
Ok, I think this is the best solution according my needs. The algorithm seems to perform only necessary operations, which is probably optimal in terms of performances, and is still easy to read/understand.
franck
A: 

My version of sixlettervariables's answer. I reduced the number of calls to orderFunc (each element only passes through orderFunc once), and in the case of ties, sorting is skipped. This is optimized for small numbers of sources, larger numbers of elements within each source and possibly an expensive orderFunc.

public static IEnumerable<T> MergePreserveOrder<T, TOrder>(
  this IEnumerable<IEnumerable<T>> sources, 
  Func<T, TOrder> orderFunc)  
  where TOrder : IComparable<TOrder> 
{
  Dictionary<TOrder, List<IEnumerable<T>>> keyedSources =
    sources.Select(source => source.GetEnumerator())
      .Where(e => e.MoveNext())
      .GroupBy(e => orderFunc(e.Current))
      .ToDictionary(g => g.Key, g => g.ToList()); 

  while (keyedSources.Any())
  {
     //this is the expensive line
    KeyValuePair<TOrder, List<IEnumerable<T>>> firstPair = keyedSources
      .OrderBy(kvp => kvp.Key).First();

    keyedSources.Remove(firstPair.Key);
    foreach(IEnumerable<T> e in firstPair.Value)
    {
      yield return e.Current;
      if (e.MoveNext())
      {
        TOrder newKey = orderFunc(e.Current);
        if (!keyedSources.ContainsKey(newKey))
        {
          keyedSources[newKey] = new List<IEnumerable<T>>() {e};
        }
        else
        {
          keyedSources[newKey].Add(e);
        }
      }
    }
  }
}

I'm betting this could be further improved by a SortedDictionary, but am not brave enough to try a solution using one without an editor.

David B
You use a Dictionary/List combination to sort the elements - I am not sure about creating a own `List` for each value. The OP says that he wants to sort large lists - so the initialization of so many lists could be a problem. I had a solution using a `SortedDictionary`, but the key has to be unique - so the value needs to be a collection again. That's why I decided to use a single list which is able to contain multiple keys (and uses a fast binary search)
tanascius
Good point, it is not difficult to pool the Lists.
David B
+3  A: 

Here's a solution with NO SORTING ... just the minimum number of comparisons. (I omitted the actual order func passing for simplicity). Updated to build a balanced tree:-

    /// <summary>
    /// Merge a pair of ordered lists
    /// </summary>
    public static IEnumerable<T> Merge<T>(IEnumerable<T> aList, IEnumerable<T> bList)
        where T:IComparable<T>
    {
        var a = aList.GetEnumerator();
        bool aOK = a.MoveNext();

        foreach (var b in bList)
        {
            while (aOK && a.Current.CompareTo(b) <= 0) {yield return a.Current; aOK = a.MoveNext();}
            yield return b;
        }
        // And anything left in a
        while (aOK) { yield return a.Current; aOK = a.MoveNext(); }
    }

    /// <summary>
    /// Merge lots of sorted lists
    /// </summary>
    public static IEnumerable<T> Merge<T>(IEnumerable<IEnumerable<T>> listOfLists)
        where T : IComparable<T>
    {
        int n = listOfLists.Count();
        if (n < 2) 
            return listOfLists.FirstOrDefault();
        else
            return Merge (Merge(listOfLists.Take(n/2)), Merge(listOfLists.Skip(n/2)));
    }


public static void Main(string[] args)
{

    var sample = Enumerable.Range(1, 5).Select((i) => Enumerable.Range(i, i+5).Select(j => string.Format("Test {0:00}", j)));

    Console.WriteLine("Merged:");
    foreach (var result in Merge(sample))
    {
        Console.WriteLine("\t{0}", result);
    }
Hightechrider
This is how I would have done it.
Craig Wilson
listOfLists.FirstOrDefault() <- the elements of this list will be enumerated m times, where m is the number of lists to be merged. http://www.joelonsoftware.com/articles/fog0000000319.html
David B
This is a very clever solution ... and it is fast for a small amount of lists, too. @Craig: It deserves more than a simple "This is how I would have done it". But for lots of lists it will perform poor.
tanascius
What did you want me to say? He proposed a solution that looked very similar to mine. Therefore, I didn't post the same thing twice. sure, in the Merge listOfLists routine, continually building another enumerable would probably be faster, but how many lists are we talking about here really? Premature optimization may be a problem. This is a dead simple solution that is lazy enumerated and solves the problem.
Craig Wilson
@tanscius - take another look with improved balanced tree approach.
Hightechrider
It doesn't compile anymore ^^ No overload of Merge() takes two IEnumerables. But I guess that this wouldn't help anyfurther - the problem with your approach is, that the comparison between the elements of the different lists is a big overhead (see Daniel's answer). But again, this happens only for a large amount of lists (2-5 certainly won't be an issue).
tanascius
A: 

The issue is that by containing several lists at once, you're doing far more comparisons than is generally a good idea. It would be perhaps faster just to stick them together and re-sort. It's possible to add two pre-sorted lists in linear time (to the total number of elements). More than that, and you're looking at log(n).

If you're familiar with quicksort, the second part of the algorithm is to sum together a bunch of sorted lists in a sorted fashion. You could basically just use that. Infact, it might well flat out be faster to just concatenate the lists and quicksort.

DeadMG
Yeah that's not right. If all he needs to do is the second part, then it won't be faster to do the whole thing. The time it would take to do 3 lists is m + n + o, which ultimately comes back down to O(n).
Craig Wilson
+1  A: 

This looks like a terribly useful function to have around so i decided to take a stab at it. My approach is a lot like heightechrider in that it breaks the problem down into merging two sorted IEnumerables into one, then taking that one and merging it with the next in the list. There is most likely some optimization you can do but it works with my simple testcase:

      public static IEnumerable<T> mergeSortedEnumerables<T>(
            this IEnumerable<IEnumerable<T>> listOfLists, 
            Func<T, T, Boolean> func)
      {
            IEnumerable<T> l1 = new List<T>{};
            foreach (var l in listOfLists)
            {
                 l1 = l1.mergeTwoSorted(l, func);
            }

            foreach (var t in l1)
            {
                 yield return t;
            }
      }

      public static IEnumerable<T> mergeTwoSorted<T>(
            this IEnumerable<T> l1, 
            IEnumerable<T> l2, 
            Func<T, T, Boolean> func)
      {
            using (var enumerator1 = l1.GetEnumerator())
            using (var enumerator2 = l2.GetEnumerator())
            {
                 bool enum1 = enumerator1.MoveNext();
                 bool enum2 = enumerator2.MoveNext();
                 while (enum1 || enum2)
                 {
                      T t1 = enumerator1.Current;
                      T t2 = enumerator2.Current;

                      //if they are both false
                      if (!enum1 && !enum2)
                      {
                            break;
                      }
                      //if enum1 is false
                      else if (!enum1)
                      {
                            enum2 = enumerator2.MoveNext();
                            yield return t2;

                      }
                      //if enum2 is false
                      else if (!enum2)
                      {
                            enum1 = enumerator1.MoveNext();
                            yield return t1;

                      }
                      //they are both true
                      else
                      {
                            //if func returns true then t1 < t2
                            if (func(t1, t2))
                            {
                                 enum1 = enumerator1.MoveNext();
                                 yield return t1;

                            }
                            else
                            {
                                 enum2 = enumerator2.MoveNext();
                                 yield return t2;

                            }
                      }
                 }
            }
      }

Then to test it:

                List<int> ws = new List<int>() { 1, 8, 9, 16, 17, 21 };
                List<int> xs = new List<int>() { 2, 7, 10, 15, 18 };
                List<int> ys = new List<int>() { 3, 6, 11, 14 };
                List<int> zs = new List<int>() { 4, 5, 12, 13, 19, 20 };
                List<IEnumerable<int>> lss = new List<IEnumerable<int>> { ws, xs, ys, zs };

                foreach (var v in lss.mergeSortedEnumerables(compareInts))
                {
                     Console.WriteLine(v);
                }
cirons42