ansaurus

Question

Answer 1

+4 A:

~~Your code looks fine.~~

My only suggestion would be replacing IEnumerable<object> with the non-generic IEnumerable. (In System.Collections)

EDIT:

On further inspection, you're casting more times than necessary.

Replace the if with the following code:

var token = list as Token;
if (token != null && token.TokenType == type) {

Also, you can get rid your currentT variable by writing t[t.Count - 1] or t.Last(). This will make the code clearer, but might have a tiny negative effect on performance.
Alternatively, you could store a reference to the inner list in a variable and use it directly. (This will slightly improve performance)

Finally, if you can change the return type to List<List<Object>>, you can return t directly; this will avoid an array copy and will be noticeably faster for large lists.

By the way, your variable names are confusing; you should swap the names of t and list.

SLaks 2010-02-16 01:45:18

Answer 2

A:

Do you need to convert it to an array? You could potentially use LINQ and delayed execution to return the results.

EDIT:
With the clarified question it would be hard to bend LINQ to make it return the results the way you want. If you still want to have the execution of each cycle delayed you could write your own enumerator.

I recommend perf testing this compared to the other options to see if there is a performance gain for your scenario if you attempt this approach. It might cause more overhead managing the iterator which would be bad for cases with little data.

I hope this helps.

// This is the easy way to make your own iterator using the C# syntax
// It will return sets of separated tokens in a lazy fashion
// This sample is based on the version provided by @Ants
public static IEnumerable<IEnumerable<object>> Split(this IEnumerable<object> tokens,
                                          TokenType type) {        
    var current = new List<object>();

    foreach (var item in tokens) 
    {
        Token token = item as Token;
        if (token != null && token.TokenType == type) 
        {
            if( current.Count > 0)
            {
                yield return current;
                current = new List<object>();
            }
        }
        else 
        {
            current.Add(item);
        }
    }

    if( current.Count > 0)
        yield return current;
}

Warning: This compiles but has still might have hidden bugs. It is getting late here.

// This is doing the same thing but doing it all by hand. 
// You could use this method as well to lazily iterate through the 'current' list as well
// This is probably overkill and substantially more complex
public class TokenSplitter : IEnumerable<IEnumerable<object>>, IEnumerator<IEnumerable<object>>
{
    IEnumerator<object> _enumerator;
    IEnumerable<object> _tokens;
    TokenType _target;

    List<object> _current;
    bool _isDone = false;

    public TokenSplitter(IEnumerable<object> tokens, TokenType target)
    {
        _tokens = tokens;
        _target = target;
        Reset();
    }        

    // Cruft from the IEnumerable and generic IEnumerator
    public IEnumerator<IEnumerable<object>> GetEnumerator() { return this; }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() 
    {
        return GetEnumerator();
    }

    public IEnumerable<object> Current { get { return _current; } }                
    public void Dispose() { }        

    #region IEnumerator Members

    object System.Collections.IEnumerator.Current { get { return Current; } }

    // See if there is anything left to get
    public bool MoveNext()
    {
        if (_isDone) return false;

        FillCurrent();

        return !_isDone;            
    }

    // Reset the enumerators so that you could reuse this structure if you wanted
    public void Reset()
    {
        _isDone = false; 
        _enumerator = _tokens.GetEnumerator();
        _current = new List<object>();
        FillCurrent();
    }

    // Fills the current set of token and then begins the next set
    private void FillCurrent()
    {
        // Try to accumulate as many tokens as possible, this too could be an enumerator to delay the process more
        bool hasNext = _enumerator.MoveNext();
        for( ; hasNext; hasNext = _enumerator.MoveNext())
        {            
            Token token = _enumerator.Current as Token;
            if (token == null || token.TokenType != _target)
            {
                _current.Add(_enumerator.Current);
            }
            else
            {
                _current = new List<object>();
            }
        }

        // Continue removing matching tokens and begin creating the next set
        for( ; hasNext; hasNext = _enumerator.MoveNext())
        {
            Token token = _enumerator.Current as Token;
            if (token == null || token.TokenType != _target)
            {
                _current.Add(_enumerator.Current);
                break;
            }
        }

        _isDone = !hasNext;
    }

    #endregion
}

smaclell 2010-02-16 01:47:04

I may be able to use this method, can you elaborate on how it would be done?

Dested 2010-02-16 06:33:19

Answer 3

+2 A:

My first thought would be instead of looking up t[currentT] all the time, just store a currentList and add directly to that.

Anon. 2010-02-16 01:47:49

This would make the code much simpler and more readable.

SLaks 2010-02-16 01:52:51

Answer 4

+1 A:

I think that there are broken cases for these scenarios where assuming that list items are lower case letters, and the item with matching token type is T:

{ T a b c ... };
{ ... x y z T };
{ ... j k l T T m n o ... };
{ T }; and
{ }

Which will result in:

{ { } { a b c ... } };
{ { ... x y z } { } };
{ { ... j k l } { } { } { m n o ... } };
{ { } }; and
{ }

Doing a straight refactoring:

public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
                                          TokenType type) {
    var outer = new List<List<object>>();
    var inner = new List<object>();
    foreach (var item in tokens) {
        Token token = item as token;
        if (token != null && token.TokenType == type) {
            outer.Add(inner);
            inner = new List<object>();
            continue;
        }
        inner.Add(item);
    }
    outer.Add(inner);
    return outer.ToArray();
}

To fix the broken cases (assuming that those are truly broken), I recommend:

public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
                                          TokenType type) {
    var outer = new List<List<object>>();
    var inner = new List<object>();
    var enumerator = tokens.GetEnumerator();
    while (enumerator.MoveNext()) {
        Token token = enumerator.Current as token;
        if (token == null || token.TokenType != type) {
            inner.Add(enumerator.Current);
        }
        else if (inner.Count > 0) {
            outer.Add(inner);
            inner = new List<object>();
        }
    }
    return outer.ToArray();
}

Which will result in:

{ { a b c ... } };
{ { ... x y z } };
{ { ... j k l } { m n o ... } };
{ }; and
{ }

Ants 2010-02-16 05:14:15

Answer 5

+3 A:

Type-testing and casts can be a performance killer. If at all possible, your token types should implement a common interface or abstract class. Instead of passing in and object, you should pass in an IToken which wraps your object.

Here's some concept code you can use to get started:

using System;
using System.Collections.Generic;

namespace Juliet
{
    interface IToken<T>
    {
        bool IsDelimeter { get; }
        T Data { get; }
    }

    class DelimeterToken<T> : IToken<T>
    {
        public bool IsDelimeter { get { return true; } }
        public T Data { get { throw new Exception("No data"); } }
    }

    class DataToken<T> : IToken<T>
    {
        public DataToken(T data)
        {
            this.Data = data;
        }

        public bool IsDelimeter { get { return false; } }
        public T Data { get; private set; }
    }

    class TokenFactory<T>
    {
        public IToken<T> Make()
        {
            return new DelimeterToken<T>();
        }

        public IToken<T> Make(T data)
        {
            return new DataToken<T>(data);
        }
    }

    class Program
    {

        static List<List<T>> SplitTokens<T>(IEnumerable<IToken<T>> tokens)
        {
            List<List<T>> res = new List<List<T>>();
            foreach (IToken<T> token in tokens)
            {
                if (token.IsDelimeter)
                {
                    res.Add(new List<T>());
                }
                else
                {
                    if (res.Count == 0)
                    {
                        res.Add(new List<T>());
                    }

                    res[res.Count - 1].Add(token.Data);
                }
            }

            return res;
        }

        static void Main(string[] args)
        {
            TokenFactory<string> factory = new TokenFactory<string>();
            IToken<string>[] tokens = new IToken<string>[]
                {
                    factory.Make("a"), factory.Make("b"), factory.Make("c"), factory.Make(),
                    factory.Make("d"), factory.Make("e"), factory.Make(),
                    factory.Make("f"), factory.Make("g"), factory.Make("h"), factory.Make(),
                    factory.Make("i"), factory.Make("j"), factory.Make("k"), factory.Make("l"), factory.Make(),
                    factory.Make("m")
                };

            List<List<string>> splitTokens = SplitTokens(tokens);
            for (int i = 0; i < splitTokens.Count; i++)
            {
                Console.Write("{");
                for (int j = 0; j < splitTokens[i].Count; j++)
                {
                    Console.Write("{0}, ", splitTokens[i][j]);
                }
                Console.Write("}");
            }

            Console.ReadKey(true);
        }
    }
}

In principle, you can create instances of IToken<object> to have it generalized to tokens of multiple classes.

Juliet 2010-02-19 04:31:56

Answer 6

+1 A:

Using LINQ you could try this: (I did not test it...)

    public static IEnumerable<object>[] Split(this  IEnumerable<object> tokens, TokenType type)
    {
        List<List<object>> l = new List<List<object>>();
        l.Add(new List<object>());
        return tokens.Aggregate(l, (c, n) => 
        {
            var t = n as Token;
            if (t != null && t.TokenType == type)
            {
                t.Add(new List<object>());
            }
            else
            {
                l.Last().Add(n);
            }
            return t;
        }).ToArray();
    }

Second try:

public static IEnumerable<object>[] Split(this  IEnumerable<object> tokens, TokenType type)
{
    var indexes = tokens.Select((t, index) => new { token = t, index = index }).OfType<Token>().Where(t => t.token.TokenType == type).Select(t => t.index);
    int prevIndex = 0;
    foreach (int item in indexes)
    {
        yield return tokens.Where((t, index) => (index > prevIndex && index < item));
        prevIndex = item;
    }
}

moi_meme 2010-02-19 20:26:04

Interesting approach! The speed problems seem to come from the creation and enumeration of the generic lists. What I really need is a solution that could somehow just return me the pieces of the Ienumerable<object> that I want in an array, as opposed to constantly adding to a List.

Dested 2010-02-19 20:51:53

@Dested : well, it seems I missed your comment. If you want to do arrays, just replace the list creation and `list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));` section in my last piece of code with `Array.Copy()`.I hope this helps.

andras 2010-02-22 00:20:43

Answer 7

+3 A:

A: An all-lazy implementation will suffice if you just iterate through the results in a nested foreach:

using System;
using System.Collections.Generic;

public static class Splitter
{
    public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, Predicate<T> match)
    {
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                yield return Split(enumerator, match);
            }
        }
    }

    static IEnumerable<T> Split<T>(IEnumerator<T> enumerator, Predicate<T> match)
    {
        do
        {
            if (match(enumerator.Current))
            {
                yield break;
            }
            else
            {
                yield return enumerator.Current;
            }
        } while (enumerator.MoveNext());
    }
}

Use it like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace MyTokenizer
{
    class Program
    {
        enum TokenTypes { SimpleToken, UberToken }

        class Token { public TokenTypes TokenType = TokenTypes.SimpleToken;    }

        class MyUberToken : Token { public MyUberToken() { TokenType = TokenTypes.UberToken; } }

        static void Main(string[] args)
        {
            List<object> objects = new List<object>(new object[] { "A", Guid.NewGuid(), "C", new MyUberToken(), "D", new MyUberToken(), "E", new MyUberToken() });
            var splitOn = TokenTypes.UberToken;
            foreach (var list in objects.Split(x => x is Token && ((Token)x).TokenType == splitOn))
            {
                foreach (var item in list)
                {
                    Console.WriteLine(item);
                }
                Console.WriteLine("==============");
            }
            Console.ReadKey();
        }

    }
}

B: If you need to process the results some time later and you wish to do it out-of-order, or you partition on one thread and then possibly dispatch the segments to multiple threads, then this would probably provide a good starting point:

using System;
using System.Collections.Generic;
using System.Linq;

public static class Splitter2
{
    public static IEnumerable<IEnumerable<T>> SplitToSegments<T>(this IEnumerable<T> source, Predicate<T> match)
    {
        T[] items = source.ToArray();
        for (int startIndex = 0; startIndex < items.Length; startIndex++)
        {
            int endIndex = startIndex;
            for (; endIndex < items.Length; endIndex++)
            {
                if (match(items[endIndex])) break;
            }
            yield return EnumerateArraySegment(items, startIndex, endIndex - 1);
            startIndex = endIndex;
        }
    }

    static IEnumerable<T> EnumerateArraySegment<T>(T[] array, int startIndex, int endIndex)
    {
        for (; startIndex <= endIndex; startIndex++)
        {
            yield return array[startIndex];
        }
    }
}

C: If you really must return a collection of List<T> -s - which I doubt, unless you explicitly want to mutate them some time later on -, then try to initialize them to a given capacity before copying:

public static List<List<T>> SplitToLists<T>(this IEnumerable<T> source, Predicate<T> match)
{
    List<List<T>> lists = new List<List<T>>();
    T[] items = source.ToArray();
    for (int startIndex = 0; startIndex < items.Length; startIndex++)
    {
        int endIndex = startIndex;
        for (; endIndex < items.Length; endIndex++)
        {
            if (match(items[endIndex])) break;
        }
        List<T> list = new List<T>(endIndex - startIndex);
        list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));
        lists.Add(list);
        startIndex = endIndex;
    }
    return lists;
}

D: If this is still not enough, I suggest you roll your own lightweight List implementation that can copy a range directly to its internal array from another instance.

andras 2010-02-19 21:44:16

I like it! That is super clean. Well done.

smaclell 2010-02-20 07:16:27

@Dested: If you need arrays, just replace the part around `list.Addrange` with creating a new array and doing an `Array.Copy`.

andras 2010-02-22 18:58:59

Answer 8

+1 A:

Here is one possibility

The Token class ( could be what ever class )

public class Token
{
    public string Name { get; set; }
    public TokenType TokenType { get; set; }
}

Now the Type enum ( this could be what ever other grouping factor )

public enum  TokenType
{
    Type1,
    Type2,
    Type3,
    Type4,
    Type5,
}

The Extention Method (Declare this anyway you choose)

public static class TokenExtension
{
    public static IEnumerable<Token>[] Split(this IEnumerable<Token> tokens)
    {
        return tokens.GroupBy(token => ((Token)token).TokenType).ToArray();
    }
}

Sample of use ( I used a web project to spin this )

List<Token> tokens = new List<Token>();
        tokens.Add(new Token { Name = "a", TokenType = TokenType.Type1 });
        tokens.Add(new Token { Name = "b", TokenType = TokenType.Type1 });
        tokens.Add(new Token { Name = "c", TokenType = TokenType.Type1 });

        tokens.Add(new Token { Name = "d", TokenType = TokenType.Type2 });
        tokens.Add(new Token { Name = "e", TokenType = TokenType.Type2  });

        tokens.Add(new Token { Name = "f", TokenType = TokenType.Type3 });
        tokens.Add(new Token { Name = "g", TokenType = TokenType.Type3 });
        tokens.Add(new Token { Name = "h", TokenType = TokenType.Type3 });

        tokens.Add(new Token { Name = "i", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "j", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "k", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "l", TokenType = TokenType.Type4 });

        tokens.Add(new Token { Name = "m", TokenType = TokenType.Type5 });

        StringBuilder stringed = new StringBuilder();

        foreach (Token token in tokens)
        {
            stringed.Append(token.Name);
            stringed.Append(", ");
        }

        Response.Write(stringed.ToString());
        Response.Write("</br>");


        var q = tokens.Split();
        foreach (var list in tokens.Split())
        {
            stringed = new StringBuilder();
            foreach (Token token in list)
            {
                stringed.Append(token.Name);
                stringed.Append(", ");
            }
            Response.Write(stringed.ToString());
            Response.Write("</br>");
        }

So all I am soing is using Linq, feel free to add or remove, you can actualy go crazy on this and group on many diferent properties.

Oakcool 2010-02-19 23:27:13

Just heads up, if you want you can declare a List of objects and change the method to test with "as" operator before atempting to read the property.Or you can just go nut and use reflection or even go to very low level with MSIL.

Oakcool 2010-02-19 23:31:00

Answer 9

+1 A:

This is the best I could do to eliminate as much of the allocation times as possible for the function (should only allocate when it goes over the capacity, which should be no more than what is required to create the largest sub list in the results). I've tested this implementation and it works as you described.

Please note that the results of the prior sub list are destroyed when the next list in the group is accessed.

public static IEnumerable<IEnumerable> Split(this  IEnumerable tokens, TokenType type)
{
    ArrayList currentT = new ArrayList();
    foreach (object list in tokens)
    {
        Token token = list as Token;
        if ((token != null) && token.TokenType == type)
        {
            yield return currentT;
            currentT.Clear();
            //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
        }
        else if ((list is TokenType) && ((TokenType)list) == type)
        {
            yield return currentT;
            currentT.Clear();
            //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
        }
        else
        {
            currentT.Add(list);
        }
    }
}

EDIT Here's another version that doesn't make use of another list at all (shouldn't be doing any allocations). Not sure how well this will compare, but it does work as requested (also I've got no idea how this one will go if you try to cache a sub 'array').

Also, both of these require a "using System.Collections" statement (in addition to the Generic namespace).

private static IEnumerable SplitInnerLoop(IEnumerator iter, TokenType type)
{
    do
    {
        Token token = iter.Current as Token;
        if ((token != null) && token.TokenType == type)
        {
            break;
        }
        else if ((iter.Current is TokenType) && ((TokenType)iter.Current) == type)
        {
            break;
        }
        else
        {
            yield return iter.Current;
        }
    } while (iter.MoveNext());
}

public static IEnumerable<IEnumerable> Split(this  IEnumerable tokens, TokenType type)
{
    IEnumerator iter = tokens.GetEnumerator();
    while (iter.MoveNext())
    {
        yield return SplitInnerLoop(iter, type);
    }
}

Grant Peters 2010-02-22 03:44:03

I'd also suggest that if this function seems to be your bottleneck, then you'll probably get a better performance boost by trying to find a better algorithm for the overall task you are trying to perform, or that you should cache the results of each operation (my implementation doesn't lend itself to caching though, if you want that, use the original implementation)

Grant Peters 2010-02-22 03:46:15

ansaurus

tags:

views:

answers:

Optimized Generic List Split

related questions