views:

78

answers:

4

I have a very large graph stored in a single dimensional array (about 1.1 GB) which I am able to store in memory on my machine which is running Windows XP with 2GB of ram and 2GB of virtual memory. I am able to generate the entire data set in memory, however when I try to serialize it to disk using the BinaryFormatter, the file size gets to about 50MB and then gives me an out of memory exception. The code I am using to write this is the same I use amongst all of my smaller problems:

StateInformation[] diskReady = GenerateStateGraph();
BinaryFormatter bf = new BinaryFormatter();
using (Stream file = File.OpenWrite(@"C:\temp\states.dat"))
{
    bf.Serialize(file, diskReady);
}

The search algorithm is very lightweight, and I am able to perform searches on this graph with no problems once it is in memory.

I really have 3 questions:

  1. Is there a more reliable way to write a large data set to disk. I guess you can define large as when the size of the data set approaches the amount of available memory, though I am not sure how accurate that is.

  2. Should I move to a more database centric approach?

  3. Can anyone point me to some literature on reading portions of a large data set from a disk file in C#?

+1  A: 

My experience of larger sets of information like this is to manually write it to disk, rather than using built in serialization.

This may not be pratical depending on how complex you're StateInformation class is, but if it is fairly simple you could write/read the binary data manually using a BinaryReader and BinaryWriter instead. These will allow you to read/write most value types directly to the stream, in an expected predetermined order dictated by your code.

This option should allow you to read/write your data quickly, although it is awkward if you then wish to add information into the StateInformation at a later date, or to take it out as you'll have to manage upgrading your files.

Ian
There are about 600,000 states and `StateInformation` basically contains a bunch of standard data types (strings and decimal values). Each `StateInformation ranges in size from about one to three kilobytes. Also, it will never need to be changed once created; the data set is complete.
NickLarsen
Sounds like this would be a good option then, Jon Hanna's answer is similar, more of a halfway house to serialize an object at a time rather than manually writing out the member values.
Ian
Yes, though that said mine is mostly a halfway house as its the only full example that can be given with the information in the question. When I say it can be considerably optimised with a specialised format it becomes your solution. Even whether such specialisation is worth the effort depends on knowledge of the type.
Jon Hanna
A: 

See memory mapped files in .net 4.0

saurabh
+1  A: 

Write entries to file yourself. One simple solution would be like:

StateInformation[] diskReady = GenerateStateGraph();
BinaryFormatter bf = new BinaryFormatter();
using (Stream file = File.OpenWrite(@"C:\temp\states.dat"))
{
  foreach(StateInformation si in diskReady)
    using(MemoryStream ms = new MemoryStream())
    {
      bf.Serialize(ms, diskReady);
      byte[] ser = ms.ToArray();
      int len = ser.Length;
      file.WriteByte((byte) len & 0x000000FF);
      file.WriteByte((byte) (len & 0x0000FF00) >> 8);
      file.WriteByte((byte) (len & 0x00FF0000) >> 16);
      file.WriteByte((byte) (len & 0x7F000000) >> 24);
      file.Write(ser, 0, len);
    }
}

No more than the memory for a single StateInformation object's memory is needed at a time, and to deserialise you read four bytes, construct the length, create a buffer of that size, fill it, and deserialise.

All of the above could be seriously optimised for speed, memory use and disk-size if you create a more specialised format, but the above goes to show the principle.

Jon Hanna
This worked, though in researching the problem, we found that the major issue is with Windows 32 bit and how it manages memory. This became apparent because we moved the application to a machine with 4 gig of physical memory and 4 gig of virtual memory and the exact same out of memory exception problem happened at the same point in the process.
NickLarsen
If 2GB of memory is too small to work with a collection of items where each is less than around half a gig, the issue isn't with the amount of memory available. There are techniques to get at more memory than that, but doing so will never be as efficient as never using that much memory.
Jon Hanna
A: 

What is contained in StateInformation? Is it a class? struct?

If you are simply worried about an easy to use container format that is easily serializable to disk - created a typed DataSet, store the information into the DataSet, then use the WriteXml() method on the DataSet to persist it to disk. You can then create the empty DataSet, and then use ReadXml() to load the contents back into memory.

If StateInformation is in a struct with value types, you can look at MemoryMappedFile to store/use the contents of the array by referencing the file directly, treating it as memory. This approach is quite a bit more complicated than the DataSet, but has its own set of advantage.

Michael