Recently I had to do some very processing heavy stuff with data stored in a DataSet. It was heavy enough that I ended up using a tool to help identify some bottlenecks in my code. When I was analyzing the bottlenecks, I noticed that although DataSet lookups were not terribly slow (they weren't the bottleneck), it was slower than I expected. I always assumed that DataSets used some sort of HashTable style implementation which would make lookups O(1) (or at least thats what I think HashTables are). The speed of my lookups seemed to be significantly slower than this.

I was wondering if anyone who knows anything about the implementation of .NET's DataSet class would care to share what they know.

If I do something like this :

DataTable dt = new DataTable();
    object o = dt.Rows[0]["SomeColumn"];

How fast would the lookup time be for the Contains(...) method, and for retrieving the value to store in Object o? I would have thought it be very fast like a HashTable (assuming what I understand about HashTables is correct) but it doesn't seem like it...

I wrote that code from memory so some things may not be "syntactically correct".


I imagine that any lookups would be O(n), as I don't think they would use any type of hashtable, but would actually use more of an array for finding rows and columns.

That would be O(n^2) since the you are doing string comparison over each item.
David L Morris

Actually, I believe the columns names are stored in a Hashtable. Should be O(1) or constant lookup for case-sensitive lookups. If it had to look through each, then of course it would be O(n).

+2  A: 

Actually it's advisable to use integer when referencing column, which can improve a lot in terms of performance. To keep things manageable, you could declare constant integer. So instead of what you did, you could do

const int SomeTable_SomeColumn = 0;

DataTable dt = new DataTable();
    object o = dt.Rows[0][SomeTable_SomeColumn];
+1  A: 

Via Reflector the steps for DataRow["ColumnName"] are:

  1. Get the DataColumn from ColumnName. Uses the row's DataColumnCollection["ColumnName"]. Internally, DataColumnCollection stores its DataColumns in a Hastable. O(1)
  2. Get the DataRow's row index. The index is stored in an internal member. O(1)
  3. Get the DataColumn's value at the index using DataColumn[index]. DataColumn stores its data in a System.Data.Common.DataStorage (internal, abstract) member:

    return dataColumnInstance._storage.Get(recordIndex);

    A sample concrete implementation is System.Data.Common.StringStorage (internal, sealed). StringStorage (and the other concrete DataStorages I checked) store their values in an array. Get(recordIndex) simply grabs the object in the value array at the recordIndex. O(1)

So overall you're O(1) but that doesn't mean the hashing and function calling during the operation is without cost. It just means it doesn't cost more as the number of DataRows or DataColumns increases.

Interesting that DataStorage uses an array for values. Can't imagine that's easy to rebuild when you add or remove rows.

Corbin March