views:

721

answers:

8

I have this query that is bothering me; it is encapsulated as a new query operator, I made two versions of it, trying to see which one performs better. Both perform horribly.

First attempt; declarative style

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
    return source.Any()
        ? source.Take(length).Cons(source.Skip(length).Section(length))
        : Enumerable.Empty<IEnumerable<α>>();
}

Second attempt: imperative "yield return" style

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
    var fst = source.Take(length);
    var rst = source.Skip(length);

    yield return fst;

    if (rst.Any())
        foreach (var section in rst.Section(length))
            yield return section;
}

In fact the second attempt is worse, both in terms of readability, compositionality and in terms of speed.

Any clues as to how to optimize this?

+5  A: 

Wherever possible, I try to only iterate through a source once within an operator. If the source is something like the result of a Reverse() operator, calling Any, Take and Skip could cause a lot of nasty performance.

It's not entirely clear what your operator is trying to do, but if you can do it without reading the source multiple times, that may well help - although it will very much depend on what the input is.

Jon Skeet
Hi Jon, I am aware that operators like Reverse and Sum are dangerous as they evaluate the whole sequence, however this is not quite what I'm doing. Actually Reverse().Any() ought to be fast because it ought to be possible to implement it without actually reversing the collection - unfortunately however, this is likely not how LINQ to Objects work.
Bent Rasmussen
Reverse().Any(..) would be a kind of optimization that says "the value I'm looking for is know (or assumed) to be closer to the end than to the start, so looking from the end is faster", provided the Reverse().Any() is actually possible to optimize, which in the general case, it isn't.
Lasse V. Karlsen
It appears to pretty much be the `Batch()` operator from MoreLinq.
LBushkin
Not at all Lasse, it would check if there is just a single value (doens't care about where or how many more) and it would do this on the un-reversed enumerable.
Bent Rasmussen
+9  A: 

If I understand your question correctly, you're trying to build a lazy implementation of an enumerator that splits a larger collection of items into smaller enumerable collections of items.

For instance, a sequence of a million numbers could be split up into "sections", each producing only 100 of them, and you want it all done lazily, ie. no collecting 100 items into a list before producing them.

First, your attempts will re-iterate over the collection many times, which is bad, hence the performance problem.

If you're trying to build a pure lazy implementation, you should consider the following issues:

  • You want to iterate over the underlying collection only once
  • You should return enumerables that re-use the underlying enumerator
  • You need to handle that a section you return isn't completely enumerated over (for instance, the calling code only needed the first 50 of those 100 items).

Edit: Before I go into my simplistic solution, here's a few caveats about it:

  • You cannot save each section for later, ie. you cannot do: collection.Sequence(10).ToArray() to get an array of sections.
  • You cannot enumerate over each section more than once, because when you do, it changes a hidden underlying data structure.

Basically: My solution is not general purpose. You should go with @LBushkin's comment about MoreLinq Batch if you need this, and I would hesitate to put my code into a class library, it would have to be local where it is needed, or renamed to something that clearly warns you about the issues with it.


Here's a simplistic implementation, I'm pretty sure there are bugs here, so you might want to look at implementing a ton of unit-testing for edgecases:

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication20
{
    class SectionEnumerable<T> : IEnumerable<T>
    {
        private readonly IEnumerator<T> _Enumerator;

        public SectionEnumerable(IEnumerator<T> enumerator, int sectionSize)
        {
            _Enumerator = enumerator;
            Left = sectionSize;
        }

        public IEnumerator<T> GetEnumerator()
        {
            while (Left > 0)
            {
                Left--;
                yield return _Enumerator.Current;
                if (Left > 0)
                    if (!_Enumerator.MoveNext())
                        break;
            }
        }

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }

        public int Left { get; private set; }
    }

    static class SequenceExtensions
    {
        public static IEnumerable<IEnumerable<T>> Section<T>(this IEnumerable<T> collection, int sectionSize)
        {
            if (collection == null)
                throw new ArgumentNullException("collection");
            if (sectionSize < 1)
                throw new ArgumentOutOfRangeException("sectionSize");

            using (IEnumerator<T> enumerator = collection.GetEnumerator())
            {
                while (enumerator.MoveNext())
                {
                    SectionEnumerable<T> enumerable = new SectionEnumerable<T>(enumerator, sectionSize);
                    yield return enumerable;
                    for (int index = 0; index < enumerable.Left; index++)
                        if (!enumerator.MoveNext())
                            yield break;
                }
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var sequence = Enumerable.Range(0, 100);
            var sections = sequence.Section(10);
            foreach (var section in sections)
            {
                Console.WriteLine(
                    String.Join(", ",
                    section.Take(5).ToArray().Select(i => i.ToString()).ToArray()));
            }
            Console.ReadLine();
        }
    }
}

Output:

0, 1, 2, 3, 4
10, 11, 12, 13, 14
20, 21, 22, 23, 24
30, 31, 32, 33, 34
40, 41, 42, 43, 44
50, 51, 52, 53, 54
60, 61, 62, 63, 64
70, 71, 72, 73, 74
80, 81, 82, 83, 84
90, 91, 92, 93, 94

Things you should unit-test:

  • Empty input collection produces no sections
  • A collection that has just the right number of elements, produces only one section
  • A collection that contains a multiplum of the section-size elements, (ie. 10, 20, 30, etc. number of elements with a section size of 5 or 10), doesn't produce an empty section after all the expected ones
  • That it is actually lazy, if you enumerate over the first 10-element section, but only the first 5 of the second section, only the first 15 elements of the underlying collection is enumerated over
Lasse V. Karlsen
Hm, I guess the default Standard Query Operators implementation is not quite as smart as I'd like. I will probably have to build in some memoization then.
Bent Rasmussen
I dislike creating custom IEnumerable implementations for this on principle but I can see how in practice, at least given the current C# compiler, this will help.
Bent Rasmussen
I must mark this as an answer. The custom IEnumerable implementation is extremely fast. Tak, Lasse :-)
Bent Rasmussen
+1 I like how in your approach your delegating the enumeration to the sectionenumerator.
JoshBerke
This looks a lot like the `Batch()` operator in MoreLinq; see: http://code.google.com/p/morelinq/source/browse/trunk/MoreLinq/Batch.cs
LBushkin
@LBushkin, It's similar, except that my implementation is extremely lazy, it doesn't collect a complete batch/section before producing it, so there's little memory usage, and no latency due to a periodic "collect a batch/section before moving on".
Lasse V. Karlsen
Note that there is one rather big problem with my answer code, which isn't present in the Batch algorithm linked to by @LBushkin, and it's that you cannot enumerate over the sections more than once. This is contrary to how collections typically work, and you might consider this bad enough behaviour to use the batch-implementation provided in MoreLinq instead. Just saying. I'll edit my answer with a few issues.
Lasse V. Karlsen
Yes, this is very undesirable in general but will probably work for me.
Bent Rasmussen
Thanks LBushkin
Bent Rasmussen
This is deeply broken. The moment you step outside of your test case and do anything different, you get crazy results. For example: "var sections = sequence.Section(10).ToArray(); foreach (var num in sections[5]) Console.WriteLine(num);" -- you'd expect to get {50...59}. Instead you get {99}.
Eric Lippert
Hi Eric, yes, I know, as I mentioned in my answer, this is for just basic one-time local usage. The batch implementation in MoreLinq, as indicated by @LBushkin, is the correct way, but not quite as lazy. It depends on what he wants.
Lasse V. Karlsen
+1  A: 

Is this any faster? It should be, because it only needs a single iteration through the source sequence.

public static IEnumerable<IEnumerable<T>> Section<T>(
    this IEnumerable<T> source, int length)
{
    return source
        .Select((x, i) => new { Value = x, Group = i / length })
        .GroupBy(x => x.Group, y => y.Value);
}
LukeH
Maybe, but still orders of magnitude slower than Lasse's custom IEnumerator.
Bent Rasmussen
Of course, but if you need raw performance then why use LINQ at all? LINQ, in my opinion, is all about readability, not raw performance. Use LINQ when you want concise, readable, declarative code. When you need out-and-out speed you should probably avoid LINQ altogether.
LukeH
I'm sorry, I want compositionality, readability and speed, I don't want any sacrifices. So even writing a custom IEnumerable as Lasse has done is preferable to foregoing LINQ to me.
Bent Rasmussen
+2  A: 

Here's another approach not using linq which was much faster then your second method:

 public static IEnumerable<IEnumerable<a>> Section<a>(this IEnumerable<a> source, int length)
        {


            var enumerator = source.GetEnumerator();
            var continueLoop = true;
            do
            {
                var list = new List<a>();
                var index = 0;
                for (int i = 0; i < length; i++)
                {
                    if (enumerator.MoveNext())
                    {
                        list.Add(enumerator.Current);
                        index++;
                    }
                    else
                    {
                        continueLoop = false;
                        break;
                    }
                }
                if (list.Count > 0)
                {
                    yield return list;
                }
            } while (continueLoop);


        }
JoshBerke
+1 This is very similiar to a non-lazy Sectioning method I wrote when I couldn't get GroupBy to give me sections fast enough.
David B
+6  A: 

I suspect that the problem you're having is related to the fact that enumerating the final result is at least an O(n^2) operation, possibly worse; I haven't worked it all out in my head yet.

Why is that? Well, suppose you have [1, 2, 3, 4, 5, 6] and you split that up into what you think is { { 1, 2 }, {3, 4}, {5, 6} }

That's not what you've done. You've in fact split this up into { take the first two, take the first two and discard them and then take the next two, take the first two and discard then and then take the next two and discard them and then take the third two }

Notice how each step along the way re-calculate the result? That's because the array could be changing between calls to the enumeration. LINQ was designed to always get you up-to-date results; you write a query that means "skip the first four and iterate the next two", that's exactly what you get -- a query that executes that code when you enumerate it.

Is the original sequence small enough and fast enough that you can read the whole thing into memory and split it all up at once, rather than trying to do so lazily? Alternatively, is the sequence indexible? If all you get is forward access to the sequence and it is too big or slow to read into memory all at once then there is not a whole lot you can do here. But if you have one or both of those properties then you can make this at least linear.

Eric Lippert
I like to design for laziness, to remain flexible. I also know the source will not change - that's not going to help me now though. I'll try to disassemble the .Net code and look a bit closer into what the different operators and compiled code actually does. I imagine there could be some smarts around combinations of operators.
Bent Rasmussen
How about designing for immutability? When you have a sequence you know absolutely cannot change; wouldn't you want to optimize LINQ for this scenario which will become increasingly more common. I don't know how to do it, maybe having a way to extend the C# compiler to treat a LINQ query specially at compile-time. - Or else creating a custom LINQ provider that optimizes for immutable streams albeit with the overhead of "JIT" "LINQ compilation".
Bent Rasmussen
@Bent: Indeed, this subject is one we are researching. It's a hard problem and I don't know if we'll realistically make any headway on it soon.
Eric Lippert
Yes I was also thinking that in iterator code, if you have side-effect free steps, then the Enumerable.Skip method could also skip the computations, completely by-passing all work. But of course, it would require a completely different treatment of iterators and LINQ to Objects.
Bent Rasmussen
Nice explanation of the problem and the causes behind it... as usual :)
Ian
Agreed, Eric's explanation is great. - I think I found a pretty nice solution to it now though :-) Although I'd still prefer an L2O optimizer.
Bent Rasmussen
A: 

I had an idea today; check this out

public static IEnumerable<α> Take<α>(this IEnumerator<α> iterator, int count)
{
    for (var i = 0; i < count && iterator.MoveNext(); i++)
        yield return iterator.Current;
}

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerator<α> iterator, int length)
{
    var sct = Enumerable.Empty<α>();
    do
    {
        sct = iterator.Take(length).ToArray();
        if (sct.Any())
            yield return sct;
    }
    while (sct.Any());
}

This is still not super-elegant but at least the implementation is very short and readable.

It may be quite interesting investigating query operators over IEnumerator.

And for convenience

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
    using (var iterator = source.GetEnumerator())
        foreach (var e in iterator.Section(length))
            yield return e;
}
Bent Rasmussen
Actually, removing the ToArray call here is a good speed booster. One could apply memoization on the whole Section sequence but that appears to give some rather slow performance with the memoizing enumerable I've tested.
Bent Rasmussen
A: 

Do you need to keep your original source around for some reason as you progress? If not, why don't you use recursion and use hd :: tl style to pull the head, pass the tl into the recursive call, and on any even number recursion merge the two you have sitting around into a section?

Ryan Riley
Not sure I follow you, maybe you could provide an example? In any event C# does not like recursion. As far as I know it does not apply tail call optimization as F# does. The implementation provided here though, appears to be fast and correct - although I'm sure you can squeeze a bit more juice out of it - at the cost of elegance.
Bent Rasmussen
Ran out of time. I did think of another potential solution, similar to yours: create lists using the indexer overloads and then zip them.
Ryan Riley
A: 

How about an extension method

public static class IEnumerableExtensions
{
    public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max)
    {
        List<T> toReturn = new List<T>();
        foreach(var item in source)
        {
                toReturn.Add(item);
                if (toReturn.Count == max)
                {
                        yield return toReturn;
                        toReturn = new List<T>();
                }
        }
        if (toReturn.Any())
        {
                yield return toReturn;
        }
    }
}

some tests:

[TestFixture]
public class When_asked_to_return_items_in_sets
{
    [Test]
    public void Should_return_the_correct_number_of_sets_if_the_input_contains_a_multiple_of_the_setSize()
    {
        List<string> input = "abcdefghij".Select(x => x.ToString()).ToList();
        var result = input.InSetsOf(5);
        result.Count().ShouldBeEqualTo(2);
        result.First().Count.ShouldBeEqualTo(5);
        result.Last().Count.ShouldBeEqualTo(5);
    }

    [Test]
    public void Should_separate_the_input_into_sets_of_size_requested()
    {
        List<string> input = "abcdefghijklm".Select(x => x.ToString()).ToList();
        var result = input.InSetsOf(5);
        result.Count().ShouldBeEqualTo(3);
        result.First().Count.ShouldBeEqualTo(5);
        result.Last().Count.ShouldBeEqualTo(3);
    }
}        
Handcraftsman
That's a very good and simple idea. One small optimization to add is to set the capacity of toReturn to max, i.e. new List<T>(max);
Bent Rasmussen
It's actually a bit slower than my implementation if I remove the call to ToArray. It would be nice with fast memoizing enumerators as well but in my tests they have quite an overhead by themselves.
Bent Rasmussen