views:

552

answers:

6

I have a local class with a method used to build a list of strings and I'm finding that when I hit this method (in a for loop of 1000 times) often it's not returning the amount I request.

I have a global variable:

string[] cachedKeys

A parameter passed to the method:

int requestedNumberToGet

The method looks similar to this:

List<string> keysToReturn = new List<string>();
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? 
cachedKeys.Length : requestedNumberToGet;
Random rand = new Random();

DateTime breakoutTime = DateTime.Now.AddMilliseconds(5);

//Do we have enough to fill the request within the time? otherwise give 
//however many we currently have
while (DateTime.Now < breakoutTime
    && keysToReturn.Count < numberPossibleToGet
    && cachedKeys.Length >= numberPossibleToGet)
{
    string randomKey = cachedKeys[rand.Next(0, cachedKeys.Length)];
    if (!keysToReturn.Contains(randomKey))
     keysToReturn.Add(randomKey);
}

if (keysToReturn.Count != numberPossibleToGet)
    Debugger.Break();

I have approximately 40 strings in cachedKeys none exceeding 15 characters in length.

I'm no expert with threading so I'm literally just calling this method 1000 times in a loop and consistently hitting that debug there.

The machine this is running on is a fairly beefy desktop so I would expect the breakout time to be realistic, in fact it randomly breaks at any point of the loop (I've seen 20s, 100s, 200s, 300s).

Any one have any ideas where I'm going wrong with this?

Edit: Limited to .NET 2.0

Edit: The purpose of the breakout is so that if the method is taking too long to execute, the client (several web servers using the data for XML feeds) won't have to wait while the other project dependencies initialise, they'll just be given 0 results.

Edit: Thought I'd post the performance stats

Original

  • '0.0042477465711424217323710136' - 10
  • '0.0479597267250446634977350473' - 100
  • '0.4721072091564710039963179678' - 1000

Skeet

  • '0.0007076318358897569383818334' - 10
  • '0.007256508857969378789762386' - 100
  • '0.0749829936486341141122684587' - 1000

Freddy Rios

  • '0.0003765841748043396576939248' - 10
  • '0.0046003053460705201359390649' - 100
  • '0.0417058592642360970458535931' - 1000
A: 

The problem could quite possibly be here:

if (!keysToReturn.Contains(randomKey))
    keysToReturn.Add(randomKey);

This will require iterating over the list to determine if the key is in the return list. However, to be sure, you should try profiling this using a tool. Also, 5ms is pretty fast at .005 seconds, you may want to increase that.

Elie
This code will be running on a couple of uber web servers, this method may conceivably be hit thousands of times a second. 5ms was the target so anything that'll get me closer to that is great. I know that List.Contains isn't the most performant, just not sure what else to use. Recommended tools?
TreeUK
There have been several question on SO about performance tuning and performance analysis tools, just do a search and include the language.
Elie
+8  A: 

Why not just take a copy of the list - O(n) - shuffle it, also O(n) - and then return the number of keys that have been requested. In fact, the shuffle only needs to be O(nRequested). Keep swapping a random member of the unshuffled bit of the list with the very start of the unshuffled bit, then expand the shuffled bit by 1 (just a notional counter).

EDIT: Here's some code which yields the results as an IEnumerable<T>. Note that it uses deferred execution, so if you change the source that's passed in before you first start iterating through the results, you'll see those changes. After the first result is fetched, the elements will have been cached.

static IEnumerable<T> TakeRandom<T>(IEnumerable<T> source,
                                    int sizeRequired,
                                    Random rng)
{
    List<T> list = new List<T>(source);

    sizeRequired = Math.Min(sizeRequired, list.Count);

    for (int i=0; i < sizeRequired; i++)
    {
        int index = rng.Next(list.Count-i);            
        T selected = list[i + index];
        list[i + index] = list[i];
        list[i] = selected;
        yield return selected;
    }
}

The idea is that at any point after you've fetched n elements, the first n elements of the list will be those elements - so we make sure that we don't pick those again. When then pick a random element from "the rest", swap it to the right position and yield it.

Hope this helps. If you're using C# 3 you might want to make this an extension method by putting "this" in front of the first parameter.

Jon Skeet
Nice sample, I've not used the yield keyword yet. How would you go about incorporating the breakout here?
TreeUK
I wouldn't, without good reason. This is an O(n) solution, and shouldn't take long at all. I suspect the reason your previous solution was sometimes failing was due to the granularity of the timer involved... do you really think it's going to fail to give back enough results in time, in real life?
Jon Skeet
teedyay
Building this list was failing to bring back the results in sufficient time by itself and this isn't even the key functionality. The list of strings are actually memcache ids used for a Memcache.Gets call.
TreeUK
How big is the list in your real code? You may well find that this method - which is more efficient, despite the initial copying - works without the time constraints.
Jon Skeet
At the moment it only contains 49 items though the estimate is for an increate to 200. Given the efficiency of the algorithms you guys have supplied, I've removed the breakout entirely. Thanks!
TreeUK
+4  A: 

A few thoughts.

First, your keysToReturn list is potentially being added to each time through the loop, right? You're creating an empty list and then adding each new key to the list. Since the list was not pre-sized, each add becomes an O(n) operation (see MSDN documentation). To fix this, try pre-sizing your list like this.

int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
List<string> keysToReturn = new List<string>(numberPossibleToGet);

Second, your breakout time is unrealistic (ok, ok, impossible) on Windows. All of the information I've ever read on Windows timing suggests that the best you can possibly hope for is 10 millisecond resolution, but in practice it's more like 15-18 milliseconds. In fact, try this code:

for (int iv = 0; iv < 10000; iv++) {
    Console.WriteLine( DateTime.Now.Millisecond.ToString() );
}

What you'll see in the output are discrete jumps. Here is a sample output that I just ran on my machine.

13
...
13
28
...
28
44
...
44
59
...
59
75
...

The millisecond value jumps from 13 to 28 to 44 to 59 to 75. That's roughly a 15-16 millisecond resolution in the DateTime.Now function for my machine. This behavior is consistent with what you'd see in the C runtime ftime() call. In other words, it's a systemic trait of the Windows timing mechanism. The point is, you should not rely on a consistent 5 millisecond breakout time because you won't get it.

Third, am I right to assume that the breakout time is prevent the main thread from locking up? If so, then it'd be pretty easy to spawn off your function to a ThreadPool thread and let it run to completion regardless of how long it takes. Your main thread can then operate on the data.

Matt Davis
Nice tip on the list sizing. Setting the breakout timing to 18 seems to have done the trick for 10000 consecutive hits. Do you have any supporting links I can show people to demonstrate this is a more realistic target?
TreeUK
The timing information came from an internal study that our project conducted, and I don't have the final report with me. Our researcher got a lot of information from Google and experiments on local machines.
Matt Davis
Appreciated anyway :)
TreeUK
+2  A: 

Use HashSet instead, HashSet is much faster for lookup than List

HashSet<string> keysToReturn = new HashSet<string>();
int numberPossibleToGet = (cachedKeys.Length <= requestedNumberToGet) ? cachedKeys.Length : requestedNumberToGet;
Random rand = new Random();

DateTime breakoutTime = DateTime.Now.AddMilliseconds(5);
int length = cachedKeys.Length;

while (DateTime.Now < breakoutTime && keysToReturn.Count < numberPossibleToGet) {
    int i = rand.Next(0, length);
    while (!keysToReturn.Add(cachedKeys[i])) {
        i++;
        if (i == length)
            i = 0;
    }
}
chaowman
Absolutely, I forgot to specify the .NET version (2.0) in my question. Will do so now.
TreeUK
Then you can use Dictionary instead of HashSet.
chaowman
+1  A: 

Consider using Stopwatch instead of DateTime.Now. It may simply be down to the inaccuracy of DateTime.Now when you're talking about milliseconds.

Daniel Earwicker
My concern here is over thread safety and the points raised here about unreliable performance on different machine setups (http://kristofverbiest.blogspot.com/2008/10/beware-of-stopwatch.html)
TreeUK
Indeed, you can only really compare results obtained from within a single-threaded test program, where the whole test runs in a single execution of the program. If you run it twice, it may be equivalent to re-running it on different hardware!
Daniel Earwicker
+3  A: 

The main issue are the using retries in a random scenario to ensure you get unique values. This quickly gets out of control, specially if the amount of items requested is near to the amount of items to get i.e. if you increase the amount of keys, you will see the issue less often but that can be avoided.

The following method does it by keeping a list of the keys remaining.

List<string> GetSomeKeys(string[] cachedKeys, int requestedNumberToGet)
{
    int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
    List<string> keysRemaining = new List<string>(cachedKeys);
    List<string> keysToReturn = new List<string>(numberPossibleToGet);
    Random rand = new Random();
    for (int i = 0; i < numberPossibleToGet; i++)
    {
        int randomIndex = rand.Next(keysRemaining.Count);
        keysToReturn.Add(keysRemaining[randomIndex]);
        keysRemaining.RemoveAt(randomIndex);
    }
    return keysToReturn;
}

The timeout was necessary on your version as you could potentially keep retrying to get a value for a long time. Specially when you wanted to retrieve the whole list, in which case you would almost certainly get a fail with the version that relies on retries.

Update: The above performs better than these variations:

List<string> GetSomeKeysSwapping(string[] cachedKeys, int requestedNumberToGet)
{
    int numberPossibleToGet = Math.Min(cachedKeys.Length, requestedNumberToGet);
    List<string> keys = new List<string>(cachedKeys);
    List<string> keysToReturn = new List<string>(numberPossibleToGet);
    Random rand = new Random();
    for (int i = 0; i < numberPossibleToGet; i++)
    {
        int index = rand.Next(numberPossibleToGet - i) + i;
        keysToReturn.Add(keys[index]);
        keys[index] = keys[i];
    }
    return keysToReturn;
}
List<string> GetSomeKeysEnumerable(string[] cachedKeys, int requestedNumberToGet)
{
    Random rand = new Random();
    return TakeRandom(cachedKeys, requestedNumberToGet, rand).ToList();
}

Some numbers with 10.000 iterations:

Function Name    Elapsed Inclusive Time Number of Calls
GetSomeKeys              6,190.66       10,000
GetSomeKeysEnumerable     15,617.04       10,000
GetSomeKeysSwapping        8,293.64       10,000
eglasius
Wow - I'm surprised it's the fastest! I guess List<>.RemoveAt() is more performant than I thought. +1 for thoroughness!
teedyay
@teedyay to be honest, I was too. The important change is removing the retries, from there on its minor differences - none of the swapping implementations need Lists, so I wonder if simple arrays would mean a difference ... that said, I find remaining keys easier to understand :)
eglasius
I swapped the keysToReturn to an array to test the performance difference and couldn't see any noticeable change in the stats.
TreeUK
ok, it was just a guess, good to know :) - just make sure you are doing this out of curiosity/understanding we don't want to be caught over optimizing :) (unless there is a need for it and you have identified it as the slow piece). Any of the no retries versions are good enough for most uses
eglasius