ansaurus

Question

What's wrong in terms of performance with this code? List.Contains, random usage, threading?

Answer 1

A:

The problem could quite possibly be here:

if (!keysToReturn.Contains(randomKey))
    keysToReturn.Add(randomKey);

This will require iterating over the list to determine if the key is in the return list. However, to be sure, you should try profiling this using a tool. Also, 5ms is pretty fast at .005 seconds, you may want to increase that.

Elie 2009-03-18 15:52:48

This code will be running on a couple of uber web servers, this method may conceivably be hit thousands of times a second. 5ms was the target so anything that'll get me closer to that is great. I know that List.Contains isn't the most performant, just not sure what else to use. Recommended tools?

TreeUK 2009-03-18 16:11:12

There have been several question on SO about performance tuning and performance analysis tools, just do a search and include the language.

Elie 2009-03-18 17:15:22

Answer 2

+8 A:

Why not just take a copy of the list - O(n) - shuffle it, also O(n) - and then return the number of keys that have been requested. In fact, the shuffle only needs to be O(nRequested). Keep swapping a random member of the unshuffled bit of the list with the very start of the unshuffled bit, then expand the shuffled bit by 1 (just a notional counter).

EDIT: Here's some code which yields the results as an IEnumerable<T>. Note that it uses deferred execution, so if you change the source that's passed in before you first start iterating through the results, you'll see those changes. After the first result is fetched, the elements will have been cached.

static IEnumerable<T> TakeRandom<T>(IEnumerable<T> source,
                                    int sizeRequired,
                                    Random rng)
{
    List<T> list = new List<T>(source);

    sizeRequired = Math.Min(sizeRequired, list.Count);

    for (int i=0; i < sizeRequired; i++)
    {
        int index = rng.Next(list.Count-i);            
        T selected = list[i + index];
        list[i + index] = list[i];
        list[i] = selected;
        yield return selected;
    }
}

The idea is that at any point after you've fetched n elements, the first n elements of the list will be those elements - so we make sure that we don't pick those again. When then pick a random element from "the rest", swap it to the right position and yield it.

Hope this helps. If you're using C# 3 you might want to make this an extension method by putting "this" in front of the first parameter.

Jon Skeet 2009-03-18 15:57:49

Nice sample, I've not used the yield keyword yet. How would you go about incorporating the breakout here?

TreeUK 2009-03-18 17:15:07

I wouldn't, without good reason. This is an O(n) solution, and shouldn't take long at all. I suspect the reason your previous solution was sometimes failing was due to the granularity of the timer involved... do you really think it's going to fail to give back enough results in time, in real life?

Jon Skeet 2009-03-18 17:35:41

teedyay 2009-03-19 09:13:03

Building this list was failing to bring back the results in sufficient time by itself and this isn't even the key functionality. The list of strings are actually memcache ids used for a Memcache.Gets call.

TreeUK 2009-03-19 09:53:42

How big is the list in your real code? You may well find that this method - which is more efficient, despite the initial copying - works without the time constraints.

Jon Skeet 2009-03-19 10:31:09

At the moment it only contains 49 items though the estimate is for an increate to 200. Given the efficiency of the algorithms you guys have supplied, I've removed the breakout entirely. Thanks!

TreeUK 2009-03-19 12:51:51

Answer 3

+4 A:

A few thoughts.

First, your keysToReturn list is potentially being added to each time through the loop, right? You're creating an empty list and then adding each new key to the list. Since the list was not pre-sized, each add becomes an O(n) operation (see MSDN documentation). To fix this, try pre-sizing your list like this.

int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
List<string> keysToReturn = new List<string>(numberPossibleToGet);

Second, your breakout time is unrealistic (ok, ok, impossible) on Windows. All of the information I've ever read on Windows timing suggests that the best you can possibly hope for is 10 millisecond resolution, but in practice it's more like 15-18 milliseconds. In fact, try this code:

for (int iv = 0; iv < 10000; iv++) {
    Console.WriteLine( DateTime.Now.Millisecond.ToString() );
}

What you'll see in the output are discrete jumps. Here is a sample output that I just ran on my machine.

13
...
13
28
...
28
44
...
44
59
...
59
75
...

The millisecond value jumps from 13 to 28 to 44 to 59 to 75. That's roughly a 15-16 millisecond resolution in the DateTime.Now function for my machine. This behavior is consistent with what you'd see in the C runtime ftime() call. In other words, it's a systemic trait of the Windows timing mechanism. The point is, you should not rely on a consistent 5 millisecond breakout time because you won't get it.

Third, am I right to assume that the breakout time is prevent the main thread from locking up? If so, then it'd be pretty easy to spawn off your function to a ThreadPool thread and let it run to completion regardless of how long it takes. Your main thread can then operate on the data.

Matt Davis 2009-03-18 16:12:21

Nice tip on the list sizing. Setting the breakout timing to 18 seems to have done the trick for 10000 consecutive hits. Do you have any supporting links I can show people to demonstrate this is a more realistic target?

TreeUK 2009-03-18 16:28:45

The timing information came from an internal study that our project conducted, and I don't have the final report with me. Our researcher got a lot of information from Google and experiments on local machines.

Matt Davis 2009-03-18 16:40:54

Appreciated anyway :)

TreeUK 2009-03-18 17:11:37

Answer 4

+2 A:

Use HashSet instead, HashSet is much faster for lookup than List

HashSet<string> keysToReturn = new HashSet<string>();
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
Random rand = new Random();

DateTime breakoutTime = DateTime.Now.AddMilliseconds(5);
int length = cachedKeys.Length;

while (DateTime.Now < breakoutTime && keysToReturn.Count < numberPossibleToGet) {
    int i = rand.Next(0, length);
    while (!keysToReturn.Add(cachedKeys[i])) {
        i++;
        if (i == length)
            i = 0;
    }
}

chaowman 2009-03-18 16:23:37

Absolutely, I forgot to specify the .NET version (2.0) in my question. Will do so now.

TreeUK 2009-03-18 16:38:57

Then you can use Dictionary instead of HashSet.

chaowman 2009-03-18 17:00:21

Answer 5

+1 A:

Consider using Stopwatch instead of DateTime.Now. It may simply be down to the inaccuracy of DateTime.Now when you're talking about milliseconds.

Daniel Earwicker 2009-03-18 16:25:19

My concern here is over thread safety and the points raised here about unreliable performance on different machine setups (http://kristofverbiest.blogspot.com/2008/10/beware-of-stopwatch.html)

TreeUK 2009-03-19 12:53:32

Indeed, you can only really compare results obtained from within a single-threaded test program, where the whole test runs in a single execution of the program. If you run it twice, it may be equivalent to re-running it on different hardware!

Daniel Earwicker 2009-03-19 13:51:45

Answer 6

+3 A:

The main issue are the using retries in a random scenario to ensure you get unique values. This quickly gets out of control, specially if the amount of items requested is near to the amount of items to get i.e. if you increase the amount of keys, you will see the issue less often but that can be avoided.

The following method does it by keeping a list of the keys remaining.

List<string> GetSomeKeys(string[] cachedKeys, int requestedNumberToGet)
{
    int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
    List<string> keysRemaining = new List<string>(cachedKeys);
    List<string> keysToReturn = new List<string>(numberPossibleToGet);
    Random rand = new Random();
    for (int i = 0; i < numberPossibleToGet; i++)
    {
        int randomIndex = rand.Next(keysRemaining.Count);
        keysToReturn.Add(keysRemaining[randomIndex]);
        keysRemaining.RemoveAt(randomIndex);
    }
    return keysToReturn;
}

The timeout was necessary on your version as you could potentially keep retrying to get a value for a long time. Specially when you wanted to retrieve the whole list, in which case you would almost certainly get a fail with the version that relies on retries.

Update: The above performs better than these variations:

List<string> GetSomeKeysSwapping(string[] cachedKeys, int requestedNumberToGet)
{
    int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
    List<string> keys = new List<string>(cachedKeys);
    List<string> keysToReturn = new List<string>(numberPossibleToGet);
    Random rand = new Random();
    for (int i = 0; i < numberPossibleToGet; i++)
    {
        int index = rand.Next(numberPossibleToGet - i) + i;
        keysToReturn.Add(keys[index]);
        keys[index] = keys[i];
    }
    return keysToReturn;
}
List<string> GetSomeKeysEnumerable(string[] cachedKeys, int requestedNumberToGet)
{
    Random rand = new Random();
    return TakeRandom(cachedKeys, requestedNumberToGet, rand).ToList();
}

Some numbers with 10.000 iterations:

Function Name    Elapsed Inclusive Time Number of Calls
GetSomeKeys              6,190.66       10,000
GetSomeKeysEnumerable     15,617.04       10,000
GetSomeKeysSwapping        8,293.64       10,000

eglasius 2009-03-18 21:20:39

Wow - I'm surprised it's the fastest! I guess List<>.RemoveAt() is more performant than I thought. +1 for thoroughness!

teedyay 2009-03-19 09:50:59

@teedyay to be honest, I was too. The important change is removing the retries, from there on its minor differences - none of the swapping implementations need Lists, so I wonder if simple arrays would mean a difference ... that said, I find remaining keys easier to understand :)

eglasius 2009-03-19 19:05:50

I swapped the keysToReturn to an array to test the performance difference and couldn't see any noticeable change in the stats.

TreeUK 2009-03-20 09:12:01

ok, it was just a guess, good to know :) - just make sure you are doing this out of curiosity/understanding we don't want to be caught over optimizing :) (unless there is a need for it and you have identified it as the slow piece). Any of the no retries versions are good enough for most uses

eglasius 2009-03-20 15:56:11

ansaurus

tags:

views:

answers:

What's wrong in terms of performance with this code? List.Contains, random usage, threading?

related questions