I'm trying to optimize a concurrent collection that tries to minimize lock contention for reads. First pass was using a linked list, which allowed me to only lock on writes while many simultaneous reads could continue unblocked. This used a custom IEnumerator to yield the next link value. Once i started comparing iteration over the collection to a plain List<T> i noticed my implementation was about half as fast (for from x in c select x on a collection of 1*m* items, i got 24ms for List<T> and 49ms for my collection).
So i thought i'd use a ReaderWriteLockSlim and sacrifice a little contention on reads so i could use a List<T> as my internal storage. Since I have to capture the read lock on iteration start and release it upon completion, i first did a yield pattern for my IEnumerable, foreaching over the internal List<T>. Now i was getting only 66ms.
I peeked at what List actually does and it uses an internal store of T[] and a custom IEnumerator that moves the index forward and returns the current index value. Now, manually using T[] as storage means a lot more maintenance work, but wth, i'm chasing microseconds.
However even mimicking the IEnumerator moving the index on an array, the best I could do was about ~38ms. So what gives List<T> its secret sauce or alternatively what's a faster implementation for an iterator?
UPDATE: Turns out my main speed culprit was running Debug compile, while List<T> is obviously a Release compile. In release my implementation is still a hair slower than List<T>, altough on mono it's now faster.
One other suggestion i got from a friend is that the BCL is faster because it's in the GAC and therefore can get pre-compiled by the system. Will have to put my test in the GAC to test that theory.