views:

95

answers:

3

Originally i wanted to ask for the fastest way to query a Datatable for a special row.

I have tested 5 different methods for their performance with a surprising(for me) result.

Background: I have created a View in a MS Sql-Server 2005 Database. This view has current a total count of 6318 rows. Because i must check very often if a given id exists in this view, i wondered what is the most efficient way to do. I created a DataAdapter in a strongly typed dataset which returns all rows and fills a Datatable. My first approach was to create a shared generic List(of Int32) and fill it with the ID's from the view once on Application start. Then use List.Contains to check if the current ID is in this List. Because all rows are distinct i wondered if its not faster to use a SortedList and its ContainsKey-metod. Then i checked also the performance of direct access to the Datable with its Select-Method, its automatically generated(when column is defined as primary-key) FindBy-method and last but not least the DatarowCollection.Contains-Method. So i have 5 Methods to check if my ID is in that View(or mapped List/SortedList).

I measured their performance with the System.Diagnostics.StopWatch and got some interesting results. I thought the SortedList.ContainsKey must be faster than the List.Contains because they're distinct and sorted but the opposite is true. But the most surprising for me was that the DataRowCollection.Contains-Method(that i first had forgotten) is by far the fastest. It is even 50 times faster than the dataTable.FindBy-method.

  1. What caused these differences?
  2. Have i forgotten a better way?
  3. Is my measurement method correct(i think i better should loop them and take that values)?
  4. Are the values transferrable or dependent on the size of the Datatable/Collection?
  5. After my update(1000000 iterations) ContainsKey is the fastest. Is this because i always check for the same id or in general? Is there some kind of SortedList without the overhead of a Dictionary's KeyValue-Pair?

Results [for 1000000 iterations*]

  • Timespan 1 = SortedList.ContainsKey = Ø 0.65634 [238.1095] ms
  • Timespan 2 = List.Contains = Ø 0.06802 [57045.37955] ms
  • Timespan 3 = DataTable.FindByIdData(auto-generated method) = Ø 0.31580 [1542.62345] ms
  • Timespan 4 = DataTable.Select = Ø 0.27790 [26029.39635] ms
  • Timespan 5 = DataRowCollection.Contains = Ø 0.00638 [1202.79735] ms

    1.)
    Timespan 1: 0,6913 ms
    Timespan 2: 0,1053 ms
    Timespan 3: 0,3279 ms
    Timespan 4: 0,1002 ms
    Timespan 5: 0,0056 ms
    
    
    2.)
    Timespan 1: 0,6405 ms
    Timespan 2: 0,0588 ms
    Timespan 3: 0,3112 ms
    Timespan 4: 0,3872 ms
    Timespan 5: 0,0067 ms
    
    
    3.)
    Timespan 1: 0,6502 ms
    Timespan 2: 0,0588 ms
    Timespan 3: 0,3092 ms
    Timespan 4: 0,1268 ms
    Timespan 5: 0,007 ms
    
    
    4.)
    Timespan 1: 0,6504 ms
    Timespan 2: 0,0586 ms
    Timespan 3: 0,3092 ms
    Timespan 4: 0,3893 ms
    Timespan 5: 0,0063 ms
    
    
    5.)
    Timespan 1: 0,6493 ms
    Timespan 2: 0,0586 ms
    Timespan 3: 0,3215 ms
    Timespan 4: 0,386 ms
    Timespan 5: 0,0063 ms
    
    
    Timespan 1: 0,6913 0,6405 0,6502 0,6504 0,6493 = Ø 0,65634
    Timespan 2: 0,1053 0,0588 0,0588 0,0586 0,0586 = Ø 0,06802
    Timespan 3: 0,3279 0,3112 0,3092 0,3092 0,3215 = Ø 0,31580
    Timespan 4: 0,1002 0,3872 0,1268 0,3893 0,3860 = Ø 0,27790
    Timespan 5: 0,0056 0,0067 0,0070 0,0063 0,0063 = Ø 0,00638
    

And for the sake of completeness part of the VB.Net source:

Dim applies As Boolean
Dim clock As New System.Diagnostics.Stopwatch

clock.Start()
For i As Int32 = 1 To 1000000
    applies = sortedListAC17NextClaims.ContainsKey(myClaim.idData)
Next
clock.Stop()
Dim timeSpan1 As String = "Timespan 1: " & clock.Elapsed.TotalMilliseconds.ToString & " ms"

clock.Reset()
clock.Start()
For i As Int32 = 1 To 1000000
    applies = listAC17NextClaims.Contains(myClaim.idData)
Next
clock.Stop()
Dim timeSpan2 As String = "Timespan 2: " & clock.Elapsed.TotalMilliseconds.ToString & " ms"

clock.Reset()
clock.Start()
For i As Int32 = 1 To 1000000
    applies = Not MyDS.AC17NextClaims.FindByIdData(myClaim.idData) Is Nothing
Next
clock.Stop()
Dim timeSpan3 As String = "Timespan 3: " & clock.Elapsed.TotalMilliseconds.ToString & " ms"

clock.Reset()
clock.Start()
For i As Int32 = 1 To 1000000
    applies = MyDS.AC17NextClaims.Select("idData=" & myClaim.idData).Length > 0
Next
clock.Stop()
Dim timeSpan4 As String = "Timespan 4: " & clock.Elapsed.TotalMilliseconds.ToString & " ms"

clock.Reset()
clock.Start()
For i As Int32 = 1 To 1000000
    applies = MyDS.AC17NextClaims.Rows.Contains(myClaim.idData)
Next
clock.Stop()
Dim timeSpan5 As String = "Timespan 5: " & clock.Elapsed.TotalMilliseconds.ToString & " ms"
  • UPDATE: I have changed my results and the source above. In squared brackets are the values for 1000000 iterations. Now the result is completely different. The fastest method now is definitely is the ContainsKey of the SortedList.

  • UPDATE 2: I forgot the alternative to use List.BinarySearch. This seems to be fastest for me:

    clock.Start()
    For i As Int32 = 1 To 1000000
        applies = listAC17NextClaims.BinarySearch(myClaim.idData) > -1
    Next
    clock.Stop()
    

    needs only 219.1805 ms to perform 1000000 iterations and hence is the fastest without the overhead of a SortedList-KeyValue-Pair. I can use it without to sort the list because the DataAdapter filled the datatable with an Order By clause.

+4  A: 

It doesn't seem to me that you're providing nearly enough work to get useful timings here. All your times are sub-millisecond, and are almost certainly just noise - caching, jit-ing, pre-emption, etc.

Make your collections big enough to take seconds to run, or at the very least run each test enough times in a tight loop to take seconds.

Will Dean
The JITing is already covered, he does 5 runs, however yes it's usual practice to do a number of runs and calculate the average time from that.
Adam
He's got times in there sub 70us - I don't think the useful resolution of this sort of timing technique (or any other timing technique on Windows) goes down anything like that far.
Will Dean
It's not clear that those 5 runs are in the same instance of the VM. If he re-executes the code in the question, then it will JIT every time, won't it?
Novelocrat
1000000 Iterations makes really a major difference. Now the result changes to the ContainsKey-method (look at my update).
Tim Schmelter
Right - but don't forget that still only tells you that searching a list of 6K items 1000000 times is faster with that method, not that it would be the fastest method to search a list of 6bn items once. You're probably mainly measuring the start-up time of the algorithm rather than its searching performance. This is really why O(f(n)) notation is only part of the story - because the real time might be better approximated with O(f(n)) + K, and if n is small, K might dominate.
Will Dean
A: 

As has been noted, your code only runs the action once. A usual tactic is to run the code a number of times (say, do 3 searches) to get larger numbers (so if 3 searches take 0.9 seconds, you can say a search takes 0.3). Then loop this a few times to allow you to calculate an average (including standard deviation if you like, to filter out wild results), and then on top of that, run once without paying attention to the record time in order for any JITing to be performed.

Also, run the code in release mode with no debugger attached.

Adam
+3  A: 

Why don't you use a collection that has a HashTable as the underlying data structure (Dictionary<TKey, TValue> or HashSet<T>)? HashTables should provide O(1) look up time if there are no collisions in keys (as you've stated) and doesn't need the overhead to "sort".

EDIT: If you only want to store the keys, you should use HashSet<T> which is available in .NET 3.5 and above.

From MSDN on SortedList:

Operations on a SortedList object tend to be slower than operations on a Hashtable object because of the sorting.

To target .NET 2.0, you could roll your own or use a pre-built one like Wintellect's Power Collections (you could easily just use the source as well).

TheCloudlessSky
The overhead of sorting is only once because the sortedlist is shared. One of my questions was if there is a better collection/dictionary than the SortedList because of its memory-overhead with the KeyValue-Pairs. I dont need a KeyValue-Pair because there are only ID's in the View. Is a generic list faster when it is sorted before i check for Contains or is there another collectiontype which can search in a Non-linear way? In my case the key would be the same as the value what seems to be redundant.
Tim Schmelter
@Tim - I'm pretty sure a `SortedList` would use Binary Search to find an element (which is `O(log(n))`). You could instead use a `HashSet<T>` that *only* stores keys (and would still be `O(1)` to search).
TheCloudlessSky
Thanks for the hint, but unfortunately i'm using framework 2.0 and HashSet is part of 3.5.
Tim Schmelter
I wonder why Array.BinarySearch(myList.ToArray(),myClaim.idData) takes 18404.573 ms for 1000000 iteration and is slower than ContainsKey(1:77) and even if DataTable.FindBy and DataRowCollection.Contains.
Tim Schmelter
@Tim - Because you're calling `.ToArray()` each time, which is going to be pretty expensive. Also, see my edit above about using .NET 2.0. In all honesty you could just use a `Dictionary<TKey, TValue>` with the extra memory overhead...
TheCloudlessSky
omg, i forgot that i dont need to create that array because a generic list has its own.listAC17NextClaims.BinarySearch(myClaim.idData) > -1 needs only 219.1805 ms to perform 1000000 iterations and hence is the fastest.I can use BinarySearch because i sort it per sql with order by before i add the rows to the list.
Tim Schmelter
@Tim - But if you implement your *own* `HashSet<T>` or use one from the link I provided, you can have *even faster time* because searching is `O(1)` versus `O(log(n))` for Binary Search.
TheCloudlessSky