views:

339

answers:

5

I've got an interresting problem: Given an IEnumerable<string>, is it possible to yield a sequence of IEnumerable<string> that groups identical adjacent strings in one pass?

Let me explain.

1. Basic illustrative sample :

Considering the following IEnumerable<string> (pseudo representation):

{"a","b","b","b","c","c","d"}

How to get an IEnumerable<IEnumerable<string>> that would yield something of the form:

{ // IEnumerable<IEnumerable<string>>
    {"a"},         // IEnumerable<string>
    {"b","b","b"}, // IEnumerable<string>
    {"c","c"},     // IEnumerable<string>
    {"d"}          // IEnumerable<string>
}

The method prototype would be:

public IEnumerable<IEnumerable<string>> Group(IEnumerable<string> items)
{
    // todo
}

But it could also be :

public void Group(IEnumerable<string> items, Action<IEnumerable<string>> action)
{
    // todo
}

...where action would be called for each subsequence.

2. More complicated sample

Ok, the first sample is very simple, and only aims to make the high level intent clear.

Now imagine we are dealing with IEnumerable<Anything>, where Anything is a type defined like this:

public class Anything
{
    public string Key {get;set;}
    public double Value {get;set;}
}

We now want to generate the subsequences based on the Key, (group every consecutive Anything that have the same key) to later use them in order to calculate the total value by group:

public void Compute(IEnumerable<Anything> items)
{
    Console.WriteLine(items.Sum(i=>i.Value));
}

// then somewhere, assuming the Group method 
// that returns an IEnumerable<IEnumerable<Anything>> actually exists:
foreach(var subsequence in Group(allItems))
{
    Compute(subsequence);
}

3. Important notes

  • Only one iteration over the original sequence
  • No intermediary collections allocations (we can assume millions of items in the original sequence, and millions consecutives items in each group)
  • Keeping enumerators and defered execution behavior
  • We can assume that resulting subsequences will be iterated only once, and will be iterated in order.

Is it possible, and how would you write it?

A: 

Your second bullet is the problematic one. Here's why:

var groups = CallMagicGetGroupsMethod().ToList();
foreach (string x in groups[3])
{
    ...
}
foreach (string x in groups[0])
{
    ...
}

Here, it's trying to iterate over the fourth group and then the first group... that's clearly only going to work if all the groups are buffered or it can reread the sequence, neither of which is ideal.

I suspect you want a more "reactive" approach - I don't know offhand whether Reactive Extensions does what you want (the "consecutive" requirement is unusual) but you should basically provide some sort of action to be executed on each group... that way the method won't need to worry about having to return you something which could be used later on, after it's already finished reading.

Let me know if you'd like me to try to find a solution within Rx, or whether you would be happy with something like:

void GroupConsecutive(IEnumerable<string> items,
                      Action<IEnumerable<string>> action)
Jon Skeet
I perfectly understand what you say. However, you can consider that I have complete control over the calling code, and that each subsequence will be iterated only once, and in order. "Providing an action to be executed on each group" -- How to pass the group (as an IEnumerable<string>) to the action?
Romain Verdier
That is a very good point. I think something along the lines of what the OP is trying to do, in spirit, though, *is* possible. He just needs to understand its limitations, e.g., that trying to use the resulting value just like any other `IEnumerable` (such as by calling `ToList` on it) is going to cause problems.
Dan Tao
@Romain: `action(group);` where of course `group is IEnumerable<string>`. Momentary blackout?
Jon
@Jon: I mean: how to yield the IEnumerable<string> to pass to the action? Having an action only ensure that subsequences are consumed sequentially, and only once, but my question is more about "how to partitionate an enumerable into several enumerables in *one trip*.
Romain Verdier
Although the *other* Jon here has a very valid point, I 've just added an answer. Check it out.
Jon
Question updated. I'm curious about an Rx based solution though...
Romain Verdier
+5  A: 

Way Better Solution That Meets All Requirements

OK, scrap my previous solution (I'll leave it below, just for reference). Here's a much better approach that occurred to me after making my initial post.

Write a new class that implements IEnumerator<T> and provides a few additional properties: IsValid and Previous. This is all you really need to resolve the whole mess with having to maintain state inside an iterator block using yield.

Here's how I did it (pretty trivial, as you can see):

internal class ChipmunkEnumerator<T> : IEnumerator<T> {

    private readonly IEnumerator<T> _internal;
    private T _previous;
    private bool _isValid;

    public ChipmunkEnumerator(IEnumerator<T> e) {
        _internal = e;
        _isValid = false;
    }

    public bool IsValid {
        get { return _isValid; }
    }

    public T Previous {
        get { return _previous; }
    }

    public T Current {
        get { return _internal.Current; }
    }

    public bool MoveNext() {
        if (_isValid)
            _previous = _internal.Current;

        return (_isValid = _internal.MoveNext());
    }

    public void Dispose() {
        _internal.Dispose();
    }

    #region Explicit Interface Members

    object System.Collections.IEnumerator.Current {
        get { return Current; }
    }

    void System.Collections.IEnumerator.Reset() {
        _internal.Reset();
        _previous = default(T);
        _isValid = false;
    }

    #endregion

}

(I called this a ChipmunkEnumerator because maintaining the previous value reminded me of how chipmunks have pouches in their cheeks where they keep nuts. Does it really matter? Stop making fun of me.)

Now, utilizing this class in an extension method to provide exactly the behavior you want isn't so tough!

Notice that below I've defined GroupConsecutive to actually return an IEnumerable<IGrouping<TKey, T>> for the simple reason that, if these are grouped by key anyway, it makes sense to return an IGrouping<TKey, T> rather than just an IEnumerable<T>. As it turns out, this will help us out later anyway...

public static IEnumerable<IGrouping<TKey, T>> GroupConsecutive<T, TKey>(this IEnumerable<T> source, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    using (var e = new ChipmunkEnumerator<T>(source.GetEnumerator())) {
        if (!e.MoveNext())
            yield break;

        while (e.IsValid) {
            yield return e.GetNextDuplicateGroup(keySelector);
        }
    }
}

public static IEnumerable<IGrouping<T, T>> GroupConsecutive<T>(this IEnumerable<T> source)
    where T : IEquatable<T> {

    return source.GroupConsecutive(x => x);
}

private static IGrouping<TKey, T> GetNextDuplicateGroup<T, TKey>(this ChipmunkEnumerator<T> e, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    return new Grouping<TKey, T>(keySelector(e.Current), e.EnumerateNextDuplicateGroup(keySelector));
}

private static IEnumerable<T> EnumerateNextDuplicateGroup<T, TKey>(this ChipmunkEnumerator<T> e, Func<T, TKey> keySelector)
    where TKey : IEquatable<TKey> {

    do {
        yield return e.Current;

    } while (e.MoveNext() && keySelector(e.Previous).Equals(keySelector(e.Current)));
}

(To implement these methods, I wrote a simple Grouping<TKey, T> class that implements IGrouping<TKey, T> in the most straightforward way possible. I've omitted the code just so as to keep moving along...)

OK, check it out. I think the code example below pretty well captures something resembling the more realistic scenario you described in your updated question.

var entries = new List<KeyValuePair<string, int>> {
    new KeyValuePair<string, int>( "Dan", 10 ),
    new KeyValuePair<string, int>( "Bill", 12 ),
    new KeyValuePair<string, int>( "Dan", 14 ),
    new KeyValuePair<string, int>( "Dan", 20 ),
    new KeyValuePair<string, int>( "John", 1 ),
    new KeyValuePair<string, int>( "John", 2 ),
    new KeyValuePair<string, int>( "Bill", 5 )
};

var dupeGroups = entries
    .GroupConsecutive(entry => entry.Key);

foreach (var dupeGroup in dupeGroups) {
    Console.WriteLine(
        "Key: {0} Sum: {1}",
        dupeGroup.Key.PadRight(5),
        dupeGroup.Select(entry => entry.Value).Sum()
    );
}

Output:

Key: Dan   Sum: 10
Key: Bill  Sum: 12
Key: Dan   Sum: 34
Key: John  Sum: 3
Key: Bill  Sum: 5

Notice this also fixes the problem with my original answer of dealing with IEnumerator<T> objects that were value types. (With this approach, it doesn't matter.)

There's still going to be a problem if you try calling ToList here, as you will find out if you try it. But considering you included deferred execution as a requirement, I doubt you would be doing that anyway. For a foreach, it works.


Original, Messy, and Somewhat Stupid Solution

Something tells me I'm going to get totally refuted for saying this, but...

Yes, it is possible (I think). See below for a damn messy solution I threw together. (Catches an exception to know when it's finished, so you know it's a great design!)

Now, Jon's point about there being a very real problem in the event that you try to do, for instance, ToList, and then access the values in the resulting list by index, is totally valid. But if your only intention here is to be able to loop over an IEnumerable<T> using a foreach -- and you're only doing this in your own code -- then, well, I think this could work for you.

Anyway, here's a quick example of how it works:

var ints = new int[] { 1, 3, 3, 4, 4, 4, 5, 2, 3, 1, 6, 6, 6, 5, 7, 7, 8 };

var dupeGroups = ints.GroupConsecutiveDuplicates(EqualityComparer<int>.Default);

foreach (var dupeGroup in dupeGroups) {
    Console.WriteLine(
        "New dupe group: " +
        string.Join(", ", dupeGroup.Select(i => i.ToString()).ToArray())
    );
}

Output:

New dupe group: 1
New dupe group: 3, 3
New dupe group: 4, 4, 4
New dupe group: 5
New dupe group: 2
New dupe group: 3
New dupe group: 1
New dupe group: 6, 6, 6
New dupe group: 5
New dupe group: 7, 7
New dupe group: 8

And now for the (messy as crap) code:

Note that since this approach requires passing the actual enumerator around between a few different methods, it will not work if that enumerator is a value type, as calls to MoveNext in one method are only affecting a local copy.

public static IEnumerable<IEnumerable<T>> GroupConsecutiveDuplicates<T>(this IEnumerable<T> source, IEqualityComparer<T> comparer) {
    using (var e = source.GetEnumerator()) {
        if (e.GetType().IsValueType)
            throw new ArgumentException(
                "This method will not work on a value type enumerator."
            );

        // get the ball rolling
        if (!e.MoveNext()) {
            yield break;
        }

        IEnumerable<T> nextDuplicateGroup;

        while (e.FindMoreDuplicates(comparer, out nextDuplicateGroup)) {
            yield return nextDuplicateGroup;
        }
    }
}

private static bool FindMoreDuplicates<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer, out IEnumerable<T> duplicates) {
    duplicates = enumerator.GetMoreDuplicates(comparer);

    return duplicates != null;
}

private static IEnumerable<T> GetMoreDuplicates<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer) {
    try {
        if (enumerator.Current != null)
            return enumerator.GetMoreDuplicatesInner(comparer);
        else
            return null;

    } catch (InvalidOperationException) {
        return null;
    }
}

private static IEnumerable<T> GetMoreDuplicatesInner<T>(this IEnumerator<T> enumerator, IEqualityComparer<T> comparer) {
    while (enumerator.Current != null) {
        var current = enumerator.Current;
        yield return current;

        if (!enumerator.MoveNext())
            break;

        if (!comparer.Equals(current, enumerator.Current))
            break;
    }
}
Dan Tao
Hey @Dan, well done. That's one correct solution. Thanks!
Romain Verdier
@Dan - +1 for improved usage. I had the same realization about `IsValid` and `Previous`. Your solution is a bit nicer than mine from a usage perspective, but it uses the same approach.
dss539
@dss539: Nice, looks like great minds think alike ;) Personally, I do like the idea of having an `IEnumerator<T>` that provides `Previous` and `IsValid` properties, independent of any specific problem, as I feel it could prove useful in other scenarios as well. But your approach is certainly more concise!
Dan Tao
A: 

I think you are looking for the GroupBy operator.

var items = new[] { "a", "b", "c", "b", "c", "d", "b" };
var groups = items.GroupBy(i => i);

foreach(var @group in groups)
{
    Console.WriteLine(@group.Key);
    foreach(var v in @group)
        Console.WriteLine("    " + v);
}
Craig Wilson
This isn't quite going to do it. He wants groups of *consecutive* characters, this would group non-consecutive characters together.
Anthony Pegram
Yep, I see that. My bad.
Craig Wilson
+4  A: 

Here's a solution that I think satisfies your requirements, works with any type of data item, and is quite short and readable:

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> list)
{
    var current = list.FirstOrDefault();

    while (!Equals(current, default(T))) {
        var cur = current;
        Func<T, bool> equalsCurrent = item => item.Equals(cur);
        yield return list.TakeWhile(equalsCurrent);
        list = list.SkipWhile(equalsCurrent);
        current = list.FirstOrDefault();
    }
}

Notes:

  1. Deferred execution is there (both TakeWhile and SkipWhile do it).
  2. I think this iterates over the entire collection only once (with SkipWhile); it does iterate over the collection once more when you process the returned IEnumerables, but the partitioning itself iterates only once.
  3. If you don't care about value types, you can add a constraint and change the while condition to a test for null.

If I am somehow mistaken, I 'd be especially interested in comments pointing out the mistakes!

Very Important Aside:

This solution will not allow you to enumerate the produced enumerables in any order other than the one it provides them in. However, I think the original poster has been pretty clear in comments that this is not a problem.

Jon
Interesting approach, but you iterate the entire list twice. You do break up the iteration into chunks, but every item is compared twice (1 for Take, then 1 for Skip). Also, this precludes default values as part of the data set (e.g. null strings or integer value 0). Still, this is pretty cool and I do not have a better approach.
dss539
@dss: Well, any solution obviously needs to iterate once over the collection to partition it (this is what `SkipWhile` does here). The second iteration only happens when *you* iterate over the results this method provides (only *then* is `TakeWhile` executed). Am I wrong in this? Regarding value types: as I mention, this is the best that can be done if you want to support them. :-)
Jon
Thanks for answering Jon! This solution seems correct but there is a little problem though, regarding the first constraint: Using TakeWhile then SkipWhile makes you iterate **twice** over each group, so you iterate the collection twice.
Romain Verdier
@Jon - Can you please take a look at my new answer and tell me if you see anything wrong? http://stackoverflow.com/questions/2828203/grouping-consecutive-identical-items-ienumerablet-to-ienumerableienumerablet/2829939#2829939
dss539
@Jon: I think you're mistaken that any solution needs to iterate twice. Take a look at my answer, which only iterates once (you can confirm this by considering that the entire solution only uses a single enumerator which is never reset).
Dan Tao
+3  A: 

Is this what you are looking for?

  • Iterate list only once.
  • Defer execution.
  • No intermediate collections (my other post failed on this criteria).

This solution relies on object state because it's difficult to share state between two IEnumerable methods that use yield (no ref or out params).

internal class Program
{
    static void Main(string[] args)
    {
        var result = new[] { "a", "b", "b", "b", "c", "c", "d" }.Partition();
        foreach (var r in result)
        {
            Console.WriteLine("Group".PadRight(16, '='));
            foreach (var s in r)
                Console.WriteLine(s);
        }
    }
}

internal static class PartitionExtension
{
    public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> src)
    {
        var grouper = new DuplicateGrouper<T>();
        return grouper.GroupByDuplicate(src);
    }
}

internal class DuplicateGrouper<T>
{
    T CurrentKey;
    IEnumerator<T> Itr;
    bool More;

    public IEnumerable<IEnumerable<T>> GroupByDuplicate(IEnumerable<T> src)
    {
        Itr = src.GetEnumerator();
        More = Itr.MoveNext();

        while (More)
            yield return GetDuplicates();
    }

    IEnumerable<T> GetDuplicates()
    {
        CurrentKey = Itr.Current;
        while (More && CurrentKey.Equals(Itr.Current))
        {
            yield return Itr.Current;
            More = Itr.MoveNext();
        }
    }
}

Edit: Added extension method for cleaner usage. Fixed loop test logic so that "More" is evaluated first.

dss539
+1: looks good to me
Jon
Simple and correct solution, thanks!
Romain Verdier
+1: Nicely done.
Dan Tao