views:

1369

answers:

3

I'm trying to get a handle on the amount of memory overhead associated with a .NET DataTable, and with individual DataRows within a table.
In other words, how much more memory does a data table occupy than what would be needed simply to store a properly typed array of each column of data?
I guess there will be some basic table overhead, plus some amount per column, and then again an additional amount per row.

So can anyone give an estimate (and, I guess, explanation!) of each/any of these three kinds of overhead?

+9  A: 

Well, don't forget that a DataTable stores 2? 3? versions of the data - original and updated (possibly one other?). It also has a lot of references since it is cell-based, and boxing for any value-types. It would be hard to quantify the exact memory...

Personally, I very rarely use DataTable - typed POCO classes are a much more sensible bet in my view. I wouldn't use an array (directly), though - List<T> or BindingList<T> or similar would be far more common.

As a crude measure, you could create a lot of tables etc and look at the memory usage; for example, the following shows a ~4.3 factor - i.e. more than 4 times as expensive, but obviously that depends a lot on the number of columns vs rows vs tables etc:

    // takes **roughly** 112Mb  (taskman)
    List<DataTable> tables = new List<DataTable>();
    for (int j = 0; j < 5000; j++)
    {
        DataTable table = new DataTable("foo");
        for (int i = 0; i < 10; i++)
        {
            table.Columns.Add("Col " + i, i % 2 == 0 ? typeof(int)
                                : typeof(string));
        }
        for (int i = 0; i < 100; i++)
        {
            table.Rows.Add(i, "a", i, "b", i, "c", i, "d", i, "e");
        }
        tables.Add(table);
    }
    Console.WriteLine("done");
    Console.ReadLine();

vs

    // takes **roughly** 26Mb (taskman)
    List<List<Foo>> lists = new List<List<Foo>>(5000);
    for (int j = 0; j < 5000; j++)
    {
        List<Foo> list = new List<Foo>(100);
        for (int i = 0; i < 100; i++)
        {
            Foo foo = new Foo { Prop1 = "a", Prop3 = "b",
                 Prop5 = "c", Prop7 = "d", Prop9 = "e"};
            foo.Prop0 = foo.Prop2 = foo.Prop4 = foo.Prop6 = foo.Prop8 = i;
            list.Add(foo);
        }
        lists.Add(list);
    }
    Console.WriteLine("done");
    Console.ReadLine();

(based on)

class Foo
{
    public int Prop0 { get; set; }
    public string Prop1 { get; set; }
    public int Prop2 { get; set; }
    public string Prop3 { get; set; }
    public int Prop4 { get; set; }
    public string Prop5 { get; set; }
    public int Prop6 { get; set; }
    public string Prop7 { get; set; }
    public int Prop8 { get; set; }
    public string Prop9 { get; set; }
}
Marc Gravell
@Marc - No harm to include a ref mention AcceptChanges and friends can be used to manipulate the stock of versions being stored.@Nick: Bottom line is if you want light, DataX is not not the place to look, and you dont even need to measure to get there.Why not write some tests to measure?
Ruben Bartelink
@Ruben - Working on it ;-p
Marc Gravell
Marc DataTable doesn't box values, it stores valuetypes in typed arrays.
Pop Catalin
... Also I did some profiling long time ago to test POCO vs DataTable, a datatable is faster to fill and uses less memory than a collection of objects, for a reasonable number of rows ... (10K plus). Also I'm not sure of this but i thing a datatable inters allot of strings(i thing the shorter ones)
Pop Catalin
Re the "faster to fill", that would be interesting to qualify... for raw data, the simplest option (POCO) should be faster. Any data on this?
Marc Gravell
I removed the boxing point - good catch.
Marc Gravell
I'm preparing some tests, I'll get back to you with my findings :)
Pop Catalin
@Marc - you're clearly not able to leave well enough alone - I was prodding Nick in the direction of doing the tests! :P
Ruben Bartelink
Yes, thanks for the prod. I would if I had the time but this is short right now, and for this application I really need the DataTable facilities. But it's something to follow up.Also I realised that any indexing implies binary trees with additional overhead!
Nick
+5  A: 

Overhead is pretty low if you don't define indexes on columns. You can get a pretty low memory footprint if you use string caching: Use a HashSet or Dictionary to use just 1 string instance of every string value. This sounds weird, but if you fetch data from a database, and you have multiple rows with the same string value (e.g. "ALFKI"), the string values are equal, but the string instances are not: the string is stored multiple times in memory. If you first use a HashSet for filtering out duplicate instances, you effectively use the same string instance for 1 string value everywhere in your datatable. This can greatly reduce memory footprint. Of course, if the string values are already statically defined somewhere (so not read from an outside source), it's not worth the effort.

Frans Bouma
+2  A: 

It depends on how much data and what kind of data you are storing. Obviously the more data, the more memory. There is some overhead associated with the datatable which makes it a little more expensive. You also need to be aware of the Large Object Heap. If you store objects over 85 kb, the object will be stored in the LOH. This can wreak havoc on your garbage collection as it requires a full collection. If you are up for testing it, look into a memory profiler to watch the memory footprint of the datatable.

Eric Brown
Good tip with the strings; +1 for that alone ;-p
Marc Gravell