tags:

views:

52

answers:

7

I want to put millions of records into memory. The records have fields that cannot be determined at compile-time. The fields may have different types, some doubles, some ints, some strings, etc. Since I have to store so many of them, I want the in-memory representation of these records to be as efficient as possible.

In C++, I'd make each record a fixed-size buffer that holds all the data, and determine where in the buffer to read from to get the data back out. In C#, I cannot do that (can I?).

What's the way to go about this? Build structs at runtime, using ILGenerator? Managed C++? Use an array of byte[]?

A: 

Using List<List<object>> would make your task a lot simpler unless I am completely missing something. You can choose a better collection type based off the kind of manipulation you need to do.

ChaosPandion
Allocating millions of Lists and then putting boxed doubles and ints into each of them results in a lot of wasted space, because of all the pointers, and all the individual allocations. Even if I had the space to do it, it would likely make the garbage collector freak out.
Marvin
@Marvin - The garbage collector is very efficient. I have built massive collections is memory with zero issues. Don't take my word for it, just try it.
ChaosPandion
A: 

Could you just use generics? I would think that's reasonably efficient. As in:

class Record
{
    UntypedSpecialField f;
}

...

abstract class UntypedSpecialField f { }
class SpecialField<T> : UntypedSpecialField { }

...
List<Record> database;

This has the advantage that if there are code commonalities related to the field in question, they can go in UntypedSpecialField and you can get a well-structured OO system.

Reinderien
I think that would give me similar memory usage as using a List<object>. Each Record would still be a collection of pointers to the actual values.
Marvin
I think you're right that the memory usage would be the same. Surely, though, there is some meaning to each of your columns, even if the types vary? And based on that meaning you might want column-specific behaviour, something that a generic class could help with.
Reinderien
A: 

I agree with ChaosPandion: List> would probably be the simplest, and yet efficient, construct to use in .NET. A single List<> can hold about 2 billion records (int.MaxValue), or 2GB (the max data size for a single object in 32-bit .NET). Since the records are themselves references to other objects, each element is a 4- (or 8-) byte IntPtr, making the outer List's maximum size about 500 million elements (it uses an array under the hood). The inner List is a cakewalk for most records unless one of the fields is BLOB-ish.

A better answer would be easier to give if you told us how this data was being brought into the system. Is it from a file? A database? A data stream of an external peripheral like a weather station?

Seriously, I would just let the managed runtime do its job. Even arrays in .NET have some overhead before index zero, unlike unmanaged C/C++ where an array is just a block o' memory where &array[0] = &array. You'll pretty much have to be in a 64-bit architecture to handle the type of memory you're talking about (in 32-bit architectures, the entire process's memory usage must be less than 2GB, including the CLR, assemblies, call stack, object overhead and actual data), and when you're in 64-bit, memory is limited by RAM and page file space; the ability of the system to address memory is several orders of magnitude greater than current hardware. You should have plenty of space even with the added overhead of managed objects.

KeithS
Tempting as that is, consider the case of Weka (http://www.cs.waikato.ac.nz/ml/weka/). I believe they did exactly what you suggested, and the result is that keeping a 300MB text file's worth of data in memory takes 7GB. If I want to run on 600MB source files, I need to buy bigger machines. Admittedly, that's Java, and .net might be better about that.
Marvin
A List<T> cannot "hold about 2 billion records." A List<T> can hold about (2 billion)/sizeof(T) items. When T is a reference type (an object), sizeof(T) is the size of a reference: 4 bytes in 32-bit, and 8 bytes in 64-bit. So the maximum number of references you can store in a List<T> is around 536 million in the 32-bit runtime, or 268 million in the 64-bit runtime. Of course, in 32-bit you wouldn't have space for the items that are pointed to because each object requires a minimum of 16 bytes.
Jim Mischel
@Jim: A List can index as many records as int.MaxValue. However, you are right. I did mention the 2GB size limit, so I should have said 2 billion elements or 2GB, *whichever is less*. In 64-bit architectures, pointers can exist for up to 2 exabytes of data, making Int32.MaxValue (still the largest possible index of a list) the limiting factor by far in a List's capacity.
KeithS
+1  A: 

You can generate dynamic types using emiting IL. The pretty nice article about this technique is on codeproject: http://www.codeproject.com/KB/cs/Creating_Dynamic_Types2.aspx

TcKs
+1  A: 

I'm not sure if the limitations make it unusable, but you can use a fixed-sized buffer in C# (unsafe code). See the MSDN docs

Richard Hein
+1  A: 

This sounds like something you'd use a C/C++ union for. That is (if I remember my C):

union Thing
{
  int iThing;
  uint uThing;
  char * stringThing;
  double doubleThing;
};

That takes up as much memory as the largest type defined in it. So here I guess it'd be 8 bytes (for the double).

If you know what the type of the thing is, you can access the corresponding field:

Thing myThing = GetThing();
int i = myThing.iThing;  // if you know it's an int

How you know its type is up to you.

Anyway, as you probably know by now, there's no such thing as a union in C#, but you can simulate one very effectively with the StructLayout attribute found in System.Runtime.InteropServices:

[StructLayout(LayoutKind.Explicit)]
struct Thing
{
    [FieldOffset(0)]
    int iThing;
    [FieldOffset(0)]
    uint uThing;
    [FieldOffset(0)]
    string stringThing;
    [FieldOffset(0)]
    double doubleThing;
}

You can create an array or a List of those, no problem. Of course, this is a value type, so you have to keep in mind the value type semantics. Also note that although this structure is only 8 bytes in size (or however large the biggest value type you store is), it contains a reference to a string, which is stored on the heap. That is, the cost of a string is 4 bytes (8 in 64-bit) plus the storage for the string itself.

There are more efficient ways to store strings, by the way. How you store them depends on if you'll want to modify them and how quickly you need to reference them, but you can easily save very close to 50% of the space required by .NET to store strings in English and most western European languages.

Jim Mischel
A: 

Are you sure you really need to put whole 1 million records into memory? I would double-think if it is really necessary... probably you need to put only part of each record, probably you need to put only some part of those records...

Sorry, that is not an answer on asked questions... But... probably you need to look not on the memory usage optimization (that will give you 10-30% efficiency), but on your algorithm/architecture... that will increase efficiency in twice... ?

Hope that helps.

Thanks.

Budda