views:

125

answers:

4

I have an application that holds huge number of instances in-memory for performance reasons, and I don't want to write it to disk or any other place, just hold it all in-memory.

public class MyObject
{
    public string Name;
    public object Tag;
    public DateTime DateTime1;
    public DateTime DateTime2;
    public DateTime DateTime3;
    public long Num1;
    public uint Num2;
    public uint Num3;
    public ushort Num4;
}

In many cases i actually don't using all the fields, or not taking the advantage of the field's whole size. so I Thought maybe transfer this whole class into an interface with properties, and make many implement classes that stores data in different ways: uses smaller fields (for example int instead of long) and omit some unused fields.

example:

public interface IMyObject
{
    string Name { get; set; }
    object Tag { get; set; }
    DateTime DateTime1 { get; set; }
    DateTime DateTime2 { get; set; }
    DateTime DateTime3 { get; set; }
    long Num1 { get; set; }
    uint Num2 { get; set; }
    uint Num3 { get; set; }
    ushort Num4 { get; set; }
}

public class MyObject1 : IMyObject
{
    public string Name { get; set; }
    public object Tag { get; set; }
    public DateTime DateTime1 { get; set; }
    public DateTime DateTime2 { get; set; }
    public DateTime DateTime3 { get; set; }
    public long Num1 { get; set; }
    public uint Num2 { get; set; }
    public uint Num3 { get; set; }
    public ushort Num4 { get; set; }
}

public class MyObject2 : IMyObject
{
    private int _num1;

    public string Name { get; set; }
    public object Tag { get; set; }
    public DateTime DateTime1 { get; set; }
    public DateTime DateTime2 { get; set; }
    public DateTime DateTime3 { get; set; }
    public long Num1
    {
        get { return _num1; }
        set { _num1 = (int)value; }
    }
    public uint Num2 { get; set; }
    public uint Num3 { get; set; }
    public ushort Num4 { get; set; }
}

public class MyObject3 : IMyObject
{
    public string Name { get; set; }
    public object Tag { get; set; }
    public DateTime DateTime1
    {
        get { return DateTime.MinValue; }
        set { throw new NotSupportedException(); }
    } 
    public DateTime DateTime2 { get; set; }
    public DateTime DateTime3 { get; set; }
    public long Num1 { get; set; }
    public uint Num2 { get; set; }
    public uint Num3 { get; set; }
    public ushort Num4 { get; set; }
}

// ...

Theoretically, with this method I can actually reduce memory footprint, But practically as you see, the problem with that approach is that it will result in Cartesian product of all cases with smaller and omitted fields, ugly and big code that can't be maintained after written in the future.

Another thought about the strings:

All strings in a .NET application represented in UTF-16 encoding. If i only could made it to be encoded in UTF-8 it would reduce x2 times the memory used by the strings.

A: 

What about using System.Tuple? You could dynamically specify which fields you want to use.

edit:
I'd definitey look into String interning.

Also, there is System.Dynamic.ExpandoObject

Chris Bednarski
I need a standard interface to access the all the fields. Are tuple can provide that?
DxCK
Define your meaning of standard interface?
Chris Bednarski
see interface IMyObject in the question. i want to have an array, IMyObject[]
DxCK
No, it will not work with Tuple. Also, Tuples are read-only, once created.
Chris Bednarski
Read-only is good enough, but the problem is that i need to put them all into an array, i dont want to test types and cast for each access.
DxCK
Thanks. I looked into System.Dynamic.ExpandoObject, is seems to be a good pattern, but it uses more memory and reduces the performance as it stores everything as an Object = boxing everything, storing a pointer for each value, and increase GC time for collecting all those boxed valued.
DxCK
+1  A: 

From looking at your profile I am going to take a punt and guess that the "Name" property is in fact a file path. If space is more important than time, then you could use an encoding scheme to represent the path, where there is likely to be a lot of repeating data.

Represent your file path as a Path which is an array of ints, and a FileName which is a string and the actual file name (this is likely to be more unique so not worth encoding). You can split the path into its constituent parts and then use a couple of dictionaries to store forward and reverse lookups. In this way you can reduce a path to an array of ints. Much smaller than a string.

chibacity
+1  A: 

Storing strings in UTF8:

byte[] asciiStr = System.Text.Encoding.UTF8.GetBytes("asdf");

string text = System.Text.Encoding.UTF8.GetString(asciiStr);

(edit: thought op wanted ASCII at first)

Idea 1: If you expect that most values won't be filled in most of the time, you could store each field in a separate key-value lookup data structure of some sort -- a dictionary, an ordered list with binary search, a binary tree etc.. An ordered list with binary search would probably be the most space efficient, though lookup would be O(log n).

So instead of MyObject[] objects, you would have

Dictionary<int, string> names; // or List<Tuple<int,string>> names;
Dictionary<int, object> tags;
Dictionary<int, DateTime> datetime1s;
...

Where the int key in each value is the ID of an entry.

Idea 2: If you're confident that those DateTimes are within a reasonably small range (about 30 years) from, say January 1 2010, you could convert it to a 32 bit int value representing how many seconds it's been since/before that date. That'll shave down 4 bytes per DateTime.

Idea 3: You might consider making a really space-efficient serialization scheme, where the first byte of each field specified which field in the class the subsequent bytes hold. String values might just be delimited with a \n or something. Store this whole thing in a byte array, and deserialize it on-demand.

So something like this, without the whitespaces, and in binary values where appropriate:

1 //indicates field 1 (Name)

beck.asf\n //the value

6 //indicates field 6 (Num1)

3545623 //the value, in a 64-bit binary int

If Tag refers to a live object, you might need to just throw that in a wrapper struct separately outside the serialization. Or, like in the first idea, you could store just an int, identifying the tag, and then have a List> outside that holds the actual references to the tags.

Rei Miyasaka
Concerning the "storing strings as ASCII" -- the OP actually wants to store them as UTF-8 instead of the default UTF-16.
stakx
Whoops, my bad.
Rei Miyasaka
+1  A: 

Thoughts:

  • are any of the strings shared? You could use a custom interner when loading your data, to ensure that none of these are duplicated (note: don't use the inbuilt interner, as you will saturate it; even a Dictonary<string,string> would do)
  • are there any other common elements that could sensibly be likely to be duplicates, and might be moved into objects? You still have the cost of the reference field, but hopefully this (and the new object) is a net gain
  • if you have vast numbers of similar entities, could any of them be modelled as immutable value-types? This isn't usually my preferred option, but an advantage then is that you can stick them in an array:
    • you gain the price of an object-reference and object-header per entity
    • you can use the offset in the array (int) rather than a reference; for 64-bit, this is a decent saving when added up
  • you seem to be suggest a sparse object approach; and indeed you want to avoid the cartesian product, but at the same time for the low number of members you describe a property-bag would probably be more expensive in memory; plus since you mention you are doing this for performance, I suspect it'll hurt CPU too
  • the DateTimes - are they (for example) always whole days? You'd be surprised what you can gain by using just the int number of days into an epoch
Marc Gravell