tags:

views:

108

answers:

4

I have recently been in a situation where I needed to perform an operation a grouped slowly yielding Linq query.

Now, groupBy looses it's lazyness, that means that you have to wait for the entire Sequence to finish until you get any groups returned. This to me logically seems not the best solution, as a group can be returned as soon as it is first encountered.

I have written the following code, which seems to work fine, and am looking for pitfalls and general improvements, as well as thoughts on the concept itself (eg. can/should a groupBy method return groups as soon as possible).

public static IEnumerable<KeyValuePair<R, IEnumerable<T>>> GroupByLazy<T, R>(this IEnumerable<T> source, Func<T, R> keySelector)
        {
            var dic = new Dictionary<R, BlockingCollection<T>>();
            foreach (var item in source)
            {
                var Key = keySelector(item);
                BlockingCollection<T> i;
                if (!dic.TryGetValue(Key, out i))
                {
                    i = new BlockingCollection<T>();
                    i.Add(item);
                    dic.Add(Key, i);
                    yield return new KeyValuePair<R, IEnumerable<T>>(Key, i);
                }
                else i.TryAdd(item);
            }
            // mark all the groups as completed so that enumerations of group-items can finish
            foreach (var groupedValues in dic.Values)
                groupedValues.CompleteAdding();
        }

Simple Test:

var slowIE = Observable.Interval(TimeSpan.FromSeconds(1)).ToEnumerable().Take(10);
            var debug = slowIE.Do(i => Console.WriteLine("\teval " + i));

            var gl = debug.GroupByLazy(i => i % 2 == 0);

            var g = debug.GroupBy(i => i % 2 == 0);

            Console.WriteLine("Lazy:");
            gl.Run(i => Console.WriteLine("Group returned: " + i.Key));
            Console.WriteLine(gl.Single(i => i.Key).Value.Count());

            Console.WriteLine("NonLazy:");
            g.Run(i => Console.WriteLine("Group returned: " + i.Key));
            Console.WriteLine(g.Single(i => i.Key).Count());

            Console.ReadLine();

which prints:

Lazy:
        eval 0
Group returned: True
        eval 1
Group returned: False
        eval 2
        eval 3
        eval 4
        eval 5
        eval 6
        eval 7
        eval 8
        eval 9
NonLazy:
        eval 0
        eval 1
        eval 2
        eval 3
        eval 4
        eval 5
        eval 6
        eval 7
        eval 8
        eval 9
Group returned: True
Group returned: False

As you can see, in my LazyGroupBy the groups are returned as soon as they are first encountered, and can thus be acted upon without waiting for the entire sequence to be grouped.

Thoughts?

Edit: quick thought, I think "Lazy" is not the right term...I'm not a native speaker, what term am I actually looking for?

+4  A: 

In your solution it appears that the returned groups will change after the group is returned. This may suit some programming patterns, but I don't see it as generally useful.

Imagine that you process a group when it is first returned, and then at some later time a new item is added to the group. How do you know to reprocess the group's members? I imagine some grouped items might never be processed by the caller. Even though CompleteAdding is called, no notification is provided to the consumer of LazyGroupBy.

Again, this may suit some situations, but I can't think of when I would use it offhand.

Drew Noakes
+2  A: 

It's interesting but can you show a real world use case for this?

I would imagine that in most real world situations you would iterate over the groups and for each group iterate over the elements, or call some aggregate function on that group. In this case that aggregate operation will be blocking anyway. In this situation there is no advantage over using GroupBy.

Another situation is where you aren't interested in the elements, only the groups. But then you don't need a GroupBy at all - you can use Select then Distinct.

If you have a situation where you have needed this "lazy" GroupBy then please add it to your question to give a bit of background and motivation.

Mark Byers
My scenario is I have a few thousand files, that can logically be grouped. Foreach group i need to do some expensive web lookups, and only when that is complete (and a bit more postprocessing) do I need the actual files that belong to each group (to write them to the db).
Steven
A: 

I would go about it differrently

public static IEnumerable<KeyValuePair<R, IEnumerable<T>>> GroupByLazy<T, R>(this IEnumerable<T> source, Func<T, R> keySelector)
        {
            var set = HashSet();
            foreach (var item in source)
            {
                var Key = keySelector(item);
                if(set.Add(Key))
                {
                   var groupedItems = from i in source 
                                      where keySelector(i) == Key
                                      select i;
                   yield return new KevValuePair<R,IEnumerable<T>>(Key, groupedItems);
                }
            }
        }

the down side is of cause that the filtering will be applied to the entire source for each group but usually when Lazy evaluation is a must it's due to latency more than end-to-end speed

Rune FS
My first thought is that your solution goes around a few limitations (mentioned in Guffa's answer). My key selector is a bit expensive unforntunaltyl though, perhaps it's worth to look at memoization.
Steven
A: 

This "lazy" execution is called deferred execution.

When you return a group, it only contains the first item, and no items will be added to it until you get more groups. So this approach only works if you handle the groups in a separate thread so that the main thread can continue reading the collection, or if you first read all groups and then process them, which of course makes the deferred processing pointless.

Also, you always have to read all groups for the groups to be complete, if you use Take to limit the query, the method won't complete, and the already returned groups might never be completed. This also means that you may have threads still waiting for data that will never be there.

Guffa
This is true, however i believe Rune FS' approach would get around both of these, correct?
Steven
@Steven: Yes, it will. However, it adds another limitiation; it only works if you can read the collection multiple times. If you for example read from a database, this will requery the result for each group, which can make it much slower than waiting for the entire result once.
Guffa