views:

776

answers:

7

I'm running into problems serializing lots of objects in .NET. The object graph is pretty big with some of the new data sets being used, so I'm getting:

System.Runtime.Serialization.SerializationException
"The internal array cannot expand to greater than Int32.MaxValue elements."

Has anyone else hit this limit? How have you solved it?

It would be good if I can still use the built in serialization mechanism if possible, but it seems like have to just roll my own (and maintain backwards compatibility with the existing data files)

The objects are all POCO and are being serialized using BinaryFormatter. Each object being serialized implements ISerializable to selectively serialize its members (some of them are recalculated during loading).

It looks like this an open issue for MS (details here), but it's been resolved as Wont Fix. The details are (from the link):

Binary serialization fails for object graphs with more than ~13.2 million objects. The attempt to do so causes an exception in ObjectIDGenerator.Rehash with a misleading error message referencing Int32.MaxValue.

Upon examination of ObjectIDGenerator.cs in the SSCLI source code, it appears that larger object graphs could be handled by adding additional entries into the sizes array. See the following lines:

// Table of prime numbers to use as hash table sizes. Each entry is the
// smallest prime number larger than twice the previous entry.
private static readonly int[] sizes = {5, 11, 29, 47, 97, 197, 397,
797, 1597, 3203, 6421, 12853, 25717, 51437, 102877, 205759, 
411527, 823117, 1646237, 3292489, 6584983};

However, it would be nice if serialization worked for any reasonable size of the object graph.

A: 

I'm guessing... serialize less objects at a time?

2 main questions:

  • what objects are they?
    • POCO?
    • DataTable?
  • what type of serialization is it?
    • xml?
      • XmlSerializer?
      • DataContractSerializer?
    • binary?
      • BinaryFormatter?
      • SoapFormatter?
    • other?
      • json?
      • bespoke?

Serialization needs to have some consideration of what the data volume is; for example, some serialization frameworks support streaming of both the objects and the serialized data, rather than relying on a complete object graph or temporary storage.

Another option is to serialize homogeneous sets of data rather than full graphs - i.e. serialize all the "customers" separately the "orders"; this would usually reduce volumes, at the expense of having more complexity.

So: what is the scenario here?

Marc Gravell
I've updated the question to (hopefully) cover your 2 questions.
Wilka
A: 

Sounds like you ran up against an internal limitation in the framework. You could write your own serialization using BinaryReader/Writer or DataContractSerializer or whatever, but it's not ideal I know.

annakata
A: 

Dude, you have reached the end of .net!

I haven't hit this limit, but here are a few pointers:

  1. use [XmlIgnore] to skip some of the objects - maybe you don't need to serialize everything

  2. you could use the serializer manually (i.e. not with attributes, but by implementing Serialize() ) and partition the models into more files.

Bogdan Gavril
XmlIgnore only works for XML Serialization.
John Saunders
+1  A: 

Have you thought about the fact that Int32.MaxValue is 2,147,483,647 - over 2 billion.

You'd need 16GB of memory just to store the pointers (assuming a 64 bit machine), let alone the objects themselves. Half that on a 32bit machine, though squeezing 8GB of pointer data into the maximum of 3GB or so usable space would be a good trick.

I strongly suspect that your problem is not the number of objects, but that the serialization framework is going into some kind of infinite loop because you have referential loops in your data structure.

Consider this simple class:

public class Node
{
    public string Name {get; set;}
    public IList<Node> Children {get;}
    public Node Parent {get; set;}
    ...
}

This simple class can't be serialised, because the presence of the Parent property means that serialisation will go into an infinite loop.

Since you're already implementing ISerializable, you are 75% of the way to solving this - you just need to ensure you remove any cycles from the object graph you are storing, to store an object tree instead.

One technique that is often used is to store the name (or id) of a referenced object instead of the actual reference, resolving the name back to the object on load.

Bevan
Question states BinaryFormatter; which handles references/recursion correctly.
Marc Gravell
The Int32.MaxValue message is misleading, I'll add a bit more details to my question to expand on that.
Wilka
A: 

Do you need to fetch all the data at the same time? Thirteen million objects is a lot of information to handle at once.

You could implement a paging mechanism and fetch the data in smaller chunks. And it might increase the responsiveness of the application, since you wouldn't have to wait for all those objects to finish serializing.

dthrasher
It needs most of the data at once (some could maybe be shaved off). It's needed for statistical analysis. I'm not to concerned with the memory usage of this part of the program (it's running on 64bit with a decent amount of ram). Swapping things out to disk would prob make analysis very slow.
Wilka
+1  A: 

Depending on the structure of the data, maybe you can serialize / deserialize subgraphs of your large object graph? If the data could be somehow partitioned, you could get away with it, creating only small duplication of serialized data.

qbeuek
+2  A: 

I tried reproducing the problem, but the code just takes forever to run even when each of the 13+ million objects is only 2 bytes. So I suspect you could not only fix the problem, but also significantly improve performance if you pack your data a little better in your custom ISerialize implementations. Don't let the serializer see so deep into your structure, but cut it off at the point where your object graph blows up into hundreds of thousands of array elements or more (because presumably if you have that many objects, they're pretty small or you wouldn't be able to hold them in memory anyway). Take this example, which allows the serializer to see classes B and C, but manually manages the collection of class A:

class Program
{
    static void Main(string[] args)
    {
        C c = new C(8, 2000000);
        System.Runtime.Serialization.Formatters.Binary.BinaryFormatter bf = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
        System.IO.MemoryStream ms = new System.IO.MemoryStream();
        bf.Serialize(ms, c);
        ms.Seek(0, System.IO.SeekOrigin.Begin);
        for (int i = 0; i < 3; i++)
            for (int j = i; j < i + 3; j++)
                Console.WriteLine("{0}, {1}", c.all[i][j].b1, c.all[i][j].b2);
        Console.WriteLine("=====");
        c = null;
        c = (C)(bf.Deserialize(ms));
        for (int i = 0; i < 3; i++)
            for (int j = i; j < i + 3; j++)
                Console.WriteLine("{0}, {1}", c.all[i][j].b1, c.all[i][j].b2);
        Console.WriteLine("=====");
    }
}

class A
{
    byte dataByte1;
    byte dataByte2;
    public A(byte b1, byte b2)
    {
        dataByte1 = b1;
        dataByte2 = b2;
    }

    public UInt16 GetAllData()
    {
        return (UInt16)((dataByte1 << 8) | dataByte2);
    }

    public A(UInt16 allData)
    {
        dataByte1 = (byte)(allData >> 8);
        dataByte2 = (byte)(allData & 0xff);
    }

    public byte b1
    {
        get
        {
            return dataByte1;
        }
    }

    public byte b2
    {
        get
        {
            return dataByte2;
        }
    }
}

[Serializable()]
class B : System.Runtime.Serialization.ISerializable
{
    string name;
    List<A> myList;

    public B(int size)
    {
        myList = new List<A>(size);

        for (int i = 0; i < size; i++)
        {
            myList.Add(new A((byte)(i % 255), (byte)((i + 1) % 255)));
        }
        name = "List of " + size.ToString();
    }

    public A this[int index]
    {
        get
        {
            return myList[index];
        }
    }

    #region ISerializable Members

    public void GetObjectData(System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context)
    {
        UInt16[] packed = new UInt16[myList.Count];
        info.AddValue("name", name);
        for (int i = 0; i < myList.Count; i++)
        {
            packed[i] = myList[i].GetAllData();
        }
        info.AddValue("packedData", packed);
    }

    protected B(System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context)
    {
        name = info.GetString("name");
        UInt16[] packed = (UInt16[])(info.GetValue("packedData", typeof(UInt16[])));
        myList = new List<A>(packed.Length);
        for (int i = 0; i < packed.Length; i++)
            myList.Add(new A(packed[i]));
    }

    #endregion
}

[Serializable()]
class C
{
    public List<B> all;
    public C(int count, int size)
    {
        all = new List<B>(count);
        for (int i = 0; i < count; i++)
        {
            all.Add(new B(size));
        }
    }
}
BlueMonkMN
Packing the data like seems like a very good idea. I may even be able to use a MemoryStream to do the packing - so a lot of the code wont need to change (it can just continue saving the current way). And maybe just for the 'popular' classes to get the number of objects saved to a sensible number.
Wilka