views:

86

answers:

4

Hello Everyone,

I was looking for some fast and efficient method do merge items in array. This is my scenario. The collection is sorted by From. Adjacent element not necessarily differ by 1, that is there can be gaps between the last To and the next From, but they never overlap.

var list = new List<Range>();
list.Add(new Range() { From = 0, To = 1, Category = "AB" });
list.Add(new Range() { From = 2, To = 3, Category = "AB" });
list.Add(new Range() { From = 4, To = 5, Category = "AB" });
list.Add(new Range() { From = 6, To = 8, Category = "CD" });
list.Add(new Range() { From = 9, To = 11, Category = "AB" }); // 12 is missing, this is ok
list.Add(new Range() { From = 13, To = 15, Category = "AB" });

I would like the above collection to be merged in such way that the first three (this number can vary, from at least 2 elements to as many as the condition is satisfied) elements become one element. Cannot merge elements with different category.

new Range() { From = 0, To = 5, Category = "AB" };

So that the resulting collection would have 4 elements total.

0 - 5    AB
6 - 8    CD
9 - 11   AB // no merging here, 12 is missing
13 - 15  AB

I have a very large collection with over 2.000.000 items and I would like to this as efficiently as possible. Any ideas?

Thank you.

+2  A: 

Well, from the statement of the problem I think it is obvious that you cannot avoid iterating through the original collection of 2 million items:

var output = new List<Range>();
var currentFrom = list[0].From;
var currentTo = list[0].To;
var currentCategory = list[0].Category;
for (int i = 1; i < list.Count; i++)
{
    var item = list[i];
    if (item.Category == currentCategory && item.From == currentTo + 1)
        currentTo = item.To;
    else
    {
        output.Add(new Range { From = currentFrom, To = currentTo,
            Category = currentCategory });
        currentFrom = item.From;
        currentTo = item.To;
        currentCategory = item.Category;
    }
}
output.Add(new Range { From = currentFrom, To = currentTo,
    Category = currentCategory });

I’d be interested to see if there is a solution more optimised for performance.

Edit: I assumed that the input list is sorted. If it is not, I recommend sorting it first instead of trying to fiddle this into the algorithm. Sorting is only O(n log n), but if you tried to fiddle it in, you easily get O(n²), which is worse.

list.Sort((a, b) => a.From < b.From ? -1 : a.From > b.From ? 1 : 0);

As an aside, I wrote this solution because you asked for one that is performance-optimised. To this end, I didn’t make it generic, I didn’t use delegates, I didn’t use Linq extension methods, and I used local variables of primitive types and tried to avoid accessing object fields as much as possible.

Timwi
Tomislav Markovski
OK, fixed. ————
Timwi
This method works as well, and is as fast as the generic solution answer below. Both produced identical results with the same speed. Thank you.
Tomislav Markovski
I ran my code with 2 million entries through all of the 4 answers proposed here and all of them completed in exactly the same time with identical results, 5 seconds. I'm really sorry, but I can only accept one answer.
Tomislav Markovski
Thanks for explaining. I’m surprised that it would be the same speed, but since you ran it, I believe you. If it is really the same speed, then the answer you accepted is definitely better because it is reusable.
Timwi
+5  A: 

Here's a generic, reusable solution rather than an ad hoc, specific solution. (Updated based on comments)

IEnumerable<T> Merge<T>(this IEnumerable<T> coll, 
                      Func<T,T,bool> canBeMerged, Func<T,T,T>mergeItems)
{
    using(IEnumerator<T> iter = col.GetEnumerator())
    {
      if (iter.MoveNext())
      {
          T lhs = iter.Current;
          while(iter.MoveNext())
          {
              T rhs = iter.Current;
              if (canBeMerged(lhs, rhs)
                 lhs=mergeItems(lhs, rhs);
              else
              {
                 yield return lhs;
                 lhs= rhs;
              }
          }
          yield return lhs;
      }
    }
}

You will have to provide method to determine if the item can be merged, and to merge them. These really should be part of your Range class, so it would be called like them:

list.Merge((l,r)=> l.IsFollowedBy(r), (l,r)=> l.CombineWith(r));

If you don't have these method, then you would have to call it like:

list.Merge((l,r)=> l.Category==r.Category && l.To +1 == r.From,
           (l,r)=> new Range(){From = l.From, To=r.To, Category = l.Category});
James Curran
Nice, general solution. The only problem I have is that you are using `GetEnumerator()` without a `using`...
Timwi
Why was this downvoted? It works, it's generic, and it returns a nice IEnumerable<T> usable in real-time (as the merge is happening.)
Kirk Woll
@Timwi: I'm not sure what you are getting at. `IEnumerator` does not imply `IDisposable`.
James Curran
@James, take a look at all the implementations in class Enumerable. (And take a look at the declaration of IEnumerator<T>: public interface IEnumerator<T> : IDisposable, IEnumerator)
Kirk Woll
@James, `IEnumerator<T>` implements `IDisposable`... check MSDN ;) http://msdn.microsoft.com/en-us/library/78dfe2yb.aspx
Thomas Levesque
@Thomas, @Timwi: D'Oh! I looked up `IEnumerator` insted of `IEnumerator<T>`. I'll fix the code.
James Curran
Actually, this solution works only if mergeable items are adjacent... for instance, if the 1st and 3rd items can be merged (but not the 2nd), this method won't merge them.
Thomas Levesque
Just tested this and it works. Thank you.
Tomislav Markovski
A: 

Assuming that the list is sorted -and- the ranges are non overlapping, as you have stated in the question, this will run in O(n) time:

var flattenedRanges = new List<Range>{new Range(list.First())};

foreach (var range in list.Skip(1))
{
    if (flattenedRanges.Last().To + 1 == range.From && flattenedRanges.Last().Category == range.Category)
        flattenedRanges.Last().To = range.To;
    else
        flattenedRanges.Add(new Range(range));
}

This is assuming you have a copy-constructor for Range

EDIT: Here's an in-place algorithm:

    for (int i = 1; i < list.Count(); i++)
    {
        if (list[i].From == list[i - 1].To+1  && list[i-1].Category == list[i].Category)
        {
            list[i - 1].To = list[i].To;
            list.RemoveAt(i--);
        }
    }

EDIT:

Added the category check, and fixed the inplace version.

Ani
① This solution trashes the input collection. I assumed that the input collection should stay intact. ② This solution forgets to compare the categories.
Timwi
@ Timwi: It is a minor change to use a for-loop with indices instead.
Ani
@Ani: Huh? You’re modifying the objects from the input collection, how does a minor change with a for loop fix that?
Timwi
@ Timwi: Ok I will write it.
Ani
@Ani: Whether you use a `for` loop or `foreach` makes no difference. You are trashing the input collection either way.
Timwi
Oh, furthermore, your solution does not compile if `Range` is a value type.
Timwi
I think it's clear from the input collection and the expected result in the example that the bounds are inclusive.
Tomislav Markovski
@ Timwi: Ok now? I don't understand why this was a big issue for you anyway, considering the Range type was never even defined properly, we had to infer it from the property setters.
Ani
@ Tomislav: Look at the original question withou the edits.
Ani
@Ani: No, it’s getting worse. Your in-place algorithm skips an item every time it removes one. I give up.
Timwi
@ Timwi: I concede that, and its fixed, but the inplace went in only on your request, there was nothing wrong with the original.
Ani
I still need to check the condition if the category is the same.
Tomislav Markovski
@ Tomislav: The check is there, is it incorrect?
Ani
Yes, this works as well. This method modifies the original collection, but that's not an issue. Please don't down rate this answer.Thank you.
Tomislav Markovski
@ Tomislav: Thanks. As I said, I added the copy-constructor so that the original Range objects would not be modified. You can use the first version to create a new list, and the second to modify the original in constant space. Cheers.
Ani
+1  A: 

Here's another one :

IEnumerable<Range> Merge(IEnumerable<Range> input)
{
    input = input.OrderBy(r => r.Category).ThenBy(r => r.From).ThenBy(r => r.To).ToArray();
    var ignored = new HashSet<Range>();
    foreach (Range r1 in input)
    {
        if (ignored.Contains(r1))
            continue;

        Range tmp = r1;
        foreach (Range r2 in input)
        {
            if (tmp == r2 || ignored.Contains(r2))
                continue;

            Range merged;
            if (TryMerge(tmp, r2, out merged))
            {
                tmp = merged;
                ignored.Add(r1);
                ignored.Add(r2);
            }
        }
        yield return tmp;
    }
}

bool TryMerge(Range r1, Range r2, out Range merged)
{
    merged = null;
    if (r1.Category != r2.Category)
        return false;
    if (r1.To + 1 < r2.From || r2.To + 1 < r1.From)
        return false;
    merged = new Range
    {
        From = Math.Min(r1.From, r2.From),
        To = Math.Max(r1.To, r2.To),
        Category = r1.Category
    };
    return true;
}

You could use it directly:

var mergedList = Merge(list);

But that would be very inefficient it you have many items as the complexity is O(n²). However, since only items in the same category can be merged, you can group them by category and merge each group, then flatten the result:

var mergedList = list.GroupBy(r => r.Category)
                    .Select(g => Merge(g))
                    .SelectMany(g => g);
Thomas Levesque
Thank you, worked well.
Tomislav Markovski