views:

334

answers:

9

Read the edit below for more information.

I have some code below that I use to split a generic list of Object when the item is of a certain type.

    public static IEnumerable<object>[] Split(this  IEnumerable<object> tokens, TokenType type) {

        List<List<object>> t = new List<List<object>>();
        int currentT = 0;
        t.Add(new List<object>());
        foreach (object list in tokens) {
            if ((list is Token) && (list as Token).TokenType == type) {
                currentT++;
                t.Add(new List<object>());

            }
            else if ((list is TokenType) && ((TokenType)list )== type) {
                currentT++;
                t.Add(new List<object>());

            }
            else {
                t[currentT].Add(list);
            }
        }

        return t.ToArray();
    }

I dont have a clear question as much as I am curious if anyone knows of any ways I can optimize this code. I call it many times and it seems to be quite the beast as far as clock cycles go. Any ideas? I can also make it a Wiki if anyone is interested, maybe we can keep track of the latest changes.

Update: Im trying to parse out specific tokens. Its a list of some other class and Token classes. Token has a property (enum) of TokenType. I need to find all the Token classes and split on each of them.

{a b c T d e T f g h T i j k l T m} 

would split like

{a b c}{d e}{f g h}{i j k l}{m}

EDIT UPDATE: It seems like all of my speed problems come into the constant creation and addition of Generic Lists. Does anyone know how I can go about this without that? This is the profile of what is happening if it helps anyone.

alt text

+4  A: 

Your code looks fine.

My only suggestion would be replacing IEnumerable<object> with the non-generic IEnumerable. (In System.Collections)

EDIT:

On further inspection, you're casting more times than necessary.

Replace the if with the following code:

var token = list as Token;
if (token != null && token.TokenType == type) {

Also, you can get rid your currentT variable by writing t[t.Count - 1] or t.Last(). This will make the code clearer, but might have a tiny negative effect on performance.
Alternatively, you could store a reference to the inner list in a variable and use it directly. (This will slightly improve performance)

Finally, if you can change the return type to List<List<Object>>, you can return t directly; this will avoid an array copy and will be noticeably faster for large lists.

By the way, your variable names are confusing; you should swap the names of t and list.

SLaks
A: 

Do you need to convert it to an array? You could potentially use LINQ and delayed execution to return the results.

EDIT:
With the clarified question it would be hard to bend LINQ to make it return the results the way you want. If you still want to have the execution of each cycle delayed you could write your own enumerator.

I recommend perf testing this compared to the other options to see if there is a performance gain for your scenario if you attempt this approach. It might cause more overhead managing the iterator which would be bad for cases with little data.

I hope this helps.

// This is the easy way to make your own iterator using the C# syntax
// It will return sets of separated tokens in a lazy fashion
// This sample is based on the version provided by @Ants
public static IEnumerable<IEnumerable<object>> Split(this IEnumerable<object> tokens,
                                          TokenType type) {        
    var current = new List<object>();

    foreach (var item in tokens) 
    {
        Token token = item as Token;
        if (token != null && token.TokenType == type) 
        {
            if( current.Count > 0)
            {
                yield return current;
                current = new List<object>();
            }
        }
        else 
        {
            current.Add(item);
        }
    }

    if( current.Count > 0)
        yield return current;
}

Warning: This compiles but has still might have hidden bugs. It is getting late here.

// This is doing the same thing but doing it all by hand. 
// You could use this method as well to lazily iterate through the 'current' list as well
// This is probably overkill and substantially more complex
public class TokenSplitter : IEnumerable<IEnumerable<object>>, IEnumerator<IEnumerable<object>>
{
    IEnumerator<object> _enumerator;
    IEnumerable<object> _tokens;
    TokenType _target;

    List<object> _current;
    bool _isDone = false;

    public TokenSplitter(IEnumerable<object> tokens, TokenType target)
    {
        _tokens = tokens;
        _target = target;
        Reset();
    }        

    // Cruft from the IEnumerable and generic IEnumerator
    public IEnumerator<IEnumerable<object>> GetEnumerator() { return this; }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() 
    {
        return GetEnumerator();
    }

    public IEnumerable<object> Current { get { return _current; } }                
    public void Dispose() { }        

    #region IEnumerator Members

    object System.Collections.IEnumerator.Current { get { return Current; } }

    // See if there is anything left to get
    public bool MoveNext()
    {
        if (_isDone) return false;

        FillCurrent();

        return !_isDone;            
    }

    // Reset the enumerators so that you could reuse this structure if you wanted
    public void Reset()
    {
        _isDone = false; 
        _enumerator = _tokens.GetEnumerator();
        _current = new List<object>();
        FillCurrent();
    }

    // Fills the current set of token and then begins the next set
    private void FillCurrent()
    {
        // Try to accumulate as many tokens as possible, this too could be an enumerator to delay the process more
        bool hasNext = _enumerator.MoveNext();
        for( ; hasNext; hasNext = _enumerator.MoveNext())
        {            
            Token token = _enumerator.Current as Token;
            if (token == null || token.TokenType != _target)
            {
                _current.Add(_enumerator.Current);
            }
            else
            {
                _current = new List<object>();
            }
        }

        // Continue removing matching tokens and begin creating the next set
        for( ; hasNext; hasNext = _enumerator.MoveNext())
        {
            Token token = _enumerator.Current as Token;
            if (token == null || token.TokenType != _target)
            {
                _current.Add(_enumerator.Current);
                break;
            }
        }

        _isDone = !hasNext;
    }

    #endregion
}
smaclell
I may be able to use this method, can you elaborate on how it would be done?
Dested
+2  A: 

My first thought would be instead of looking up t[currentT] all the time, just store a currentList and add directly to that.

Anon.
This would make the code much simpler and more readable.
SLaks
+1  A: 

I think that there are broken cases for these scenarios where assuming that list items are lower case letters, and the item with matching token type is T:

  1. { T a b c ... };
  2. { ... x y z T };
  3. { ... j k l T T m n o ... };
  4. { T }; and
  5. { }

Which will result in:

  1. { { } { a b c ... } };
  2. { { ... x y z } { } };
  3. { { ... j k l } { } { } { m n o ... } };
  4. { { } }; and
  5. { }

Doing a straight refactoring:

public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
                                          TokenType type) {
    var outer = new List<List<object>>();
    var inner = new List<object>();
    foreach (var item in tokens) {
        Token token = item as token;
        if (token != null && token.TokenType == type) {
            outer.Add(inner);
            inner = new List<object>();
            continue;
        }
        inner.Add(item);
    }
    outer.Add(inner);
    return outer.ToArray();
}

To fix the broken cases (assuming that those are truly broken), I recommend:

public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
                                          TokenType type) {
    var outer = new List<List<object>>();
    var inner = new List<object>();
    var enumerator = tokens.GetEnumerator();
    while (enumerator.MoveNext()) {
        Token token = enumerator.Current as token;
        if (token == null || token.TokenType != type) {
            inner.Add(enumerator.Current);
        }
        else if (inner.Count > 0) {
            outer.Add(inner);
            inner = new List<object>();
        }
    }
    return outer.ToArray();
}

Which will result in:

  1. { { a b c ... } };
  2. { { ... x y z } };
  3. { { ... j k l } { m n o ... } };
  4. { }; and
  5. { }
Ants
+3  A: 

Type-testing and casts can be a performance killer. If at all possible, your token types should implement a common interface or abstract class. Instead of passing in and object, you should pass in an IToken which wraps your object.

Here's some concept code you can use to get started:

using System;
using System.Collections.Generic;

namespace Juliet
{
    interface IToken<T>
    {
        bool IsDelimeter { get; }
        T Data { get; }
    }

    class DelimeterToken<T> : IToken<T>
    {
        public bool IsDelimeter { get { return true; } }
        public T Data { get { throw new Exception("No data"); } }
    }

    class DataToken<T> : IToken<T>
    {
        public DataToken(T data)
        {
            this.Data = data;
        }

        public bool IsDelimeter { get { return false; } }
        public T Data { get; private set; }
    }

    class TokenFactory<T>
    {
        public IToken<T> Make()
        {
            return new DelimeterToken<T>();
        }

        public IToken<T> Make(T data)
        {
            return new DataToken<T>(data);
        }
    }

    class Program
    {

        static List<List<T>> SplitTokens<T>(IEnumerable<IToken<T>> tokens)
        {
            List<List<T>> res = new List<List<T>>();
            foreach (IToken<T> token in tokens)
            {
                if (token.IsDelimeter)
                {
                    res.Add(new List<T>());
                }
                else
                {
                    if (res.Count == 0)
                    {
                        res.Add(new List<T>());
                    }

                    res[res.Count - 1].Add(token.Data);
                }
            }

            return res;
        }

        static void Main(string[] args)
        {
            TokenFactory<string> factory = new TokenFactory<string>();
            IToken<string>[] tokens = new IToken<string>[]
                {
                    factory.Make("a"), factory.Make("b"), factory.Make("c"), factory.Make(),
                    factory.Make("d"), factory.Make("e"), factory.Make(),
                    factory.Make("f"), factory.Make("g"), factory.Make("h"), factory.Make(),
                    factory.Make("i"), factory.Make("j"), factory.Make("k"), factory.Make("l"), factory.Make(),
                    factory.Make("m")
                };

            List<List<string>> splitTokens = SplitTokens(tokens);
            for (int i = 0; i < splitTokens.Count; i++)
            {
                Console.Write("{");
                for (int j = 0; j < splitTokens[i].Count; j++)
                {
                    Console.Write("{0}, ", splitTokens[i][j]);
                }
                Console.Write("}");
            }

            Console.ReadKey(true);
        }
    }
}

In principle, you can create instances of IToken<object> to have it generalized to tokens of multiple classes.

Juliet
+1  A: 

Using LINQ you could try this: (I did not test it...)

    public static IEnumerable<object>[] Split(this  IEnumerable<object> tokens, TokenType type)
    {
        List<List<object>> l = new List<List<object>>();
        l.Add(new List<object>());
        return tokens.Aggregate(l, (c, n) => 
        {
            var t = n as Token;
            if (t != null && t.TokenType == type)
            {
                t.Add(new List<object>());
            }
            else
            {
                l.Last().Add(n);
            }
            return t;
        }).ToArray();
    }

Second try:

public static IEnumerable<object>[] Split(this  IEnumerable<object> tokens, TokenType type)
{
    var indexes = tokens.Select((t, index) => new { token = t, index = index }).OfType<Token>().Where(t => t.token.TokenType == type).Select(t => t.index);
    int prevIndex = 0;
    foreach (int item in indexes)
    {
        yield return tokens.Where((t, index) => (index > prevIndex && index < item));
        prevIndex = item;
    }
}
moi_meme
Interesting approach! The speed problems seem to come from the creation and enumeration of the generic lists. What I really need is a solution that could somehow just return me the pieces of the Ienumerable<object> that I want in an array, as opposed to constantly adding to a List.
Dested
@Dested : well, it seems I missed your comment. If you want to do arrays, just replace the list creation and `list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));` section in my last piece of code with `Array.Copy()`.I hope this helps.
andras
+3  A: 

A: An all-lazy implementation will suffice if you just iterate through the results in a nested foreach:

using System;
using System.Collections.Generic;

public static class Splitter
{
    public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, Predicate<T> match)
    {
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                yield return Split(enumerator, match);
            }
        }
    }

    static IEnumerable<T> Split<T>(IEnumerator<T> enumerator, Predicate<T> match)
    {
        do
        {
            if (match(enumerator.Current))
            {
                yield break;
            }
            else
            {
                yield return enumerator.Current;
            }
        } while (enumerator.MoveNext());
    }
}

Use it like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace MyTokenizer
{
    class Program
    {
        enum TokenTypes { SimpleToken, UberToken }

        class Token { public TokenTypes TokenType = TokenTypes.SimpleToken;    }

        class MyUberToken : Token { public MyUberToken() { TokenType = TokenTypes.UberToken; } }

        static void Main(string[] args)
        {
            List<object> objects = new List<object>(new object[] { "A", Guid.NewGuid(), "C", new MyUberToken(), "D", new MyUberToken(), "E", new MyUberToken() });
            var splitOn = TokenTypes.UberToken;
            foreach (var list in objects.Split(x => x is Token && ((Token)x).TokenType == splitOn))
            {
                foreach (var item in list)
                {
                    Console.WriteLine(item);
                }
                Console.WriteLine("==============");
            }
            Console.ReadKey();
        }

    }
}

B: If you need to process the results some time later and you wish to do it out-of-order, or you partition on one thread and then possibly dispatch the segments to multiple threads, then this would probably provide a good starting point:

using System;
using System.Collections.Generic;
using System.Linq;

public static class Splitter2
{
    public static IEnumerable<IEnumerable<T>> SplitToSegments<T>(this IEnumerable<T> source, Predicate<T> match)
    {
        T[] items = source.ToArray();
        for (int startIndex = 0; startIndex < items.Length; startIndex++)
        {
            int endIndex = startIndex;
            for (; endIndex < items.Length; endIndex++)
            {
                if (match(items[endIndex])) break;
            }
            yield return EnumerateArraySegment(items, startIndex, endIndex - 1);
            startIndex = endIndex;
        }
    }

    static IEnumerable<T> EnumerateArraySegment<T>(T[] array, int startIndex, int endIndex)
    {
        for (; startIndex <= endIndex; startIndex++)
        {
            yield return array[startIndex];
        }
    }
}

C: If you really must return a collection of List<T> -s - which I doubt, unless you explicitly want to mutate them some time later on -, then try to initialize them to a given capacity before copying:

public static List<List<T>> SplitToLists<T>(this IEnumerable<T> source, Predicate<T> match)
{
    List<List<T>> lists = new List<List<T>>();
    T[] items = source.ToArray();
    for (int startIndex = 0; startIndex < items.Length; startIndex++)
    {
        int endIndex = startIndex;
        for (; endIndex < items.Length; endIndex++)
        {
            if (match(items[endIndex])) break;
        }
        List<T> list = new List<T>(endIndex - startIndex);
        list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));
        lists.Add(list);
        startIndex = endIndex;
    }
    return lists;
}

D: If this is still not enough, I suggest you roll your own lightweight List implementation that can copy a range directly to its internal array from another instance.

andras
I like it! That is super clean. Well done.
smaclell
@Dested: If you need arrays, just replace the part around `list.Addrange` with creating a new array and doing an `Array.Copy`.
andras
+1  A: 

Here is one possibility

The Token class ( could be what ever class )

public class Token
{
    public string Name { get; set; }
    public TokenType TokenType { get; set; }
}

Now the Type enum ( this could be what ever other grouping factor )

public enum  TokenType
{
    Type1,
    Type2,
    Type3,
    Type4,
    Type5,
}

The Extention Method (Declare this anyway you choose)

public static class TokenExtension
{
    public static IEnumerable<Token>[] Split(this IEnumerable<Token> tokens)
    {
        return tokens.GroupBy(token => ((Token)token).TokenType).ToArray();
    }
}

Sample of use ( I used a web project to spin this )

List<Token> tokens = new List<Token>();
        tokens.Add(new Token { Name = "a", TokenType = TokenType.Type1 });
        tokens.Add(new Token { Name = "b", TokenType = TokenType.Type1 });
        tokens.Add(new Token { Name = "c", TokenType = TokenType.Type1 });

        tokens.Add(new Token { Name = "d", TokenType = TokenType.Type2 });
        tokens.Add(new Token { Name = "e", TokenType = TokenType.Type2  });

        tokens.Add(new Token { Name = "f", TokenType = TokenType.Type3 });
        tokens.Add(new Token { Name = "g", TokenType = TokenType.Type3 });
        tokens.Add(new Token { Name = "h", TokenType = TokenType.Type3 });

        tokens.Add(new Token { Name = "i", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "j", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "k", TokenType = TokenType.Type4 });
        tokens.Add(new Token { Name = "l", TokenType = TokenType.Type4 });

        tokens.Add(new Token { Name = "m", TokenType = TokenType.Type5 });

        StringBuilder stringed = new StringBuilder();

        foreach (Token token in tokens)
        {
            stringed.Append(token.Name);
            stringed.Append(", ");
        }

        Response.Write(stringed.ToString());
        Response.Write("</br>");


        var q = tokens.Split();
        foreach (var list in tokens.Split())
        {
            stringed = new StringBuilder();
            foreach (Token token in list)
            {
                stringed.Append(token.Name);
                stringed.Append(", ");
            }
            Response.Write(stringed.ToString());
            Response.Write("</br>");
        }

So all I am soing is using Linq, feel free to add or remove, you can actualy go crazy on this and group on many diferent properties.

Oakcool
Just heads up, if you want you can declare a List of objects and change the method to test with "as" operator before atempting to read the property.Or you can just go nut and use reflection or even go to very low level with MSIL.
Oakcool
+1  A: 

This is the best I could do to eliminate as much of the allocation times as possible for the function (should only allocate when it goes over the capacity, which should be no more than what is required to create the largest sub list in the results). I've tested this implementation and it works as you described.

Please note that the results of the prior sub list are destroyed when the next list in the group is accessed.

public static IEnumerable<IEnumerable> Split(this  IEnumerable tokens, TokenType type)
{
    ArrayList currentT = new ArrayList();
    foreach (object list in tokens)
    {
        Token token = list as Token;
        if ((token != null) && token.TokenType == type)
        {
            yield return currentT;
            currentT.Clear();
            //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
        }
        else if ((list is TokenType) && ((TokenType)list) == type)
        {
            yield return currentT;
            currentT.Clear();
            //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
        }
        else
        {
            currentT.Add(list);
        }
    }
}

EDIT Here's another version that doesn't make use of another list at all (shouldn't be doing any allocations). Not sure how well this will compare, but it does work as requested (also I've got no idea how this one will go if you try to cache a sub 'array').

Also, both of these require a "using System.Collections" statement (in addition to the Generic namespace).

private static IEnumerable SplitInnerLoop(IEnumerator iter, TokenType type)
{
    do
    {
        Token token = iter.Current as Token;
        if ((token != null) && token.TokenType == type)
        {
            break;
        }
        else if ((iter.Current is TokenType) && ((TokenType)iter.Current) == type)
        {
            break;
        }
        else
        {
            yield return iter.Current;
        }
    } while (iter.MoveNext());
}

public static IEnumerable<IEnumerable> Split(this  IEnumerable tokens, TokenType type)
{
    IEnumerator iter = tokens.GetEnumerator();
    while (iter.MoveNext())
    {
        yield return SplitInnerLoop(iter, type);
    }
}
Grant Peters
I'd also suggest that if this function seems to be your bottleneck, then you'll probably get a better performance boost by trying to find a better algorithm for the overall task you are trying to perform, or that you should cache the results of each operation (my implementation doesn't lend itself to caching though, if you want that, use the original implementation)
Grant Peters