views:

1082

answers:

3
+2  Q: 

c# serialized data

I have been using BinaryFormatter to serialise data to disk but it doesn't seem very scalable. I've created a 200Mb data file but am unable to read it back in (End of Stream encountered before parsing was completed). It tries for about 30 minutes to deserialise and then gives up. This is on a fairly decent quad-cpu box with 8Gb RAM.

I'm serialising a fairly large complicated structure.

htCacheItems is a Hashtable of CacheItems. Each CacheItem has several simple members (strings + ints etc) and also contains a Hashtable and a custom implementation of a linked list. The sub-hashtable points to CacheItemValue structures which is currently a simple DTO which contains a key and a value. The linked list items are also equally simple.

The data file that fails contains about 400,000 CacheItemValues.

Smaller datasets work well (though takes longer than i'd expect to deserialize and use a hell of a lot of memory).

    public virtual bool Save(String sBinaryFile)
    {
        bool bSuccess = false;
        FileStream fs = new FileStream(sBinaryFile, FileMode.Create);

        try
        {
            BinaryFormatter formatter = new BinaryFormatter();
            formatter.Serialize(fs, htCacheItems);
            bSuccess = true;
        }
        catch (Exception e)
        {
            bSuccess = false;
        }
        finally
        {
            fs.Close();
        }
        return bSuccess;
    }

    public virtual bool Load(String sBinaryFile)
    {
        bool bSuccess = false;

        FileStream fs = null;
        GZipStream gzfs = null;

        try
        {
            fs = new FileStream(sBinaryFile, FileMode.OpenOrCreate);

            if (sBinaryFile.EndsWith("gz"))
            {
                gzfs = new GZipStream(fs, CompressionMode.Decompress);
            }

            //add the event handler
            ResolveEventHandler resolveEventHandler = new ResolveEventHandler(AssemblyResolveEventHandler);
            AppDomain.CurrentDomain.AssemblyResolve += resolveEventHandler;

            BinaryFormatter formatter = new BinaryFormatter();
            htCacheItems = (Hashtable)formatter.Deserialize(gzfs != null ? (Stream)gzfs : (Stream)fs);

            //remove the event handler
            AppDomain.CurrentDomain.AssemblyResolve -= resolveEventHandler;

            bSuccess = true;
        }
        catch (Exception e)
        {
            Logger.Write(new ExceptionLogEntry("Failed to populate cache from file " + sBinaryFile + ". Message is " + e.Message));
            bSuccess = false;
        }
        finally
        {
            if (fs != null)
            {
                fs.Close();
            }
            if (gzfs != null)
            {
                gzfs.Close();
            }
        }
        return bSuccess;
    }

The resolveEventHandler is just a work around because i'm serialising the data in one application and loading it in another (http://social.msdn.microsoft.com/Forums/en-US/netfxbcl/thread/e5f0c371-b900-41d8-9a5b-1052739f2521)

The question is, how can I improve this? Is data serialisation always going to be inefficient, am i better off writing my own routines?

+1  A: 

Something that could help is cascade serializing.

You call mainHashtable.serialize(), which return a XML string for example. This method call everyItemInYourHashtable.serialize(), and so on.

You do the same with a static method in every class, called 'unserialize(String xml)', which unserialize your objetcs and return an object, or a list of objects. You get the point ?

Of course, you need to implement this method in every of your class you want to be serializable.

Take a look at ISerializable interface, which represent exaclty what I'm describing. IMO, this interface looks too "Microsoft" (no use of DOM, etc), so i created mine, but principle is the same : cascade.

Clement Herreman
thanks, i'll check it out
Gordon Carpenter-Thompson
+2  A: 

Serialization is tricky, particularly when you want to have some degree of flexibility when it comes to versioning.

Usually there's a trade-off between portability and flexibility of what you can serialize. For example, you might want to use Protocol Buffers (disclaimer: I wrote one of the C# ports) as a pretty efficient solution with good portability and versioning - but then you'll need to translate whatever your natural data structure is into something supported by Protocol Buffers.

Having said that, I'm surprised that binary serialization is failing here - at least in that particular way. Can you get it to fail with a large file with a very, very simple piece of serialization code? (No resolution handlers, no compression etc.)

Jon Skeet
the file isn't compressed in this instance, it was something i put in to try and speed up loading. I can't easily turn off the resolution handlers because the data was generated by a seperate utility (which took around 8 hours to run). I'll take a look at protocol buffers and see if that helps. thanks
Gordon Carpenter-Thompson
+2  A: 

I would personally try to avoid the need for the assembly-resolve; that has a certain smell about it. If you must use BinaryFormatter, then I'd simply put the DTOs into a separate library (dll) that can be used in both applications.

If you don't want to share the dll, then IMO you shouldn't be using BinaryFormatter - you should be using a contract-based serializer, such as XmlSerializer or DataContractSerializer, or one of the "protocol buffers" implementations (and to repeat Jon's disclaimer: I wrote one of the others).

200MB does seem pretty big, but I wouldn't have expected it to fail. One possible cause here is the object tracking it does for the references; but even then, this surprises me.

I'd love to see a simplified object model to see if it is a "fit" for any of the above.


Here's an example that attempts to mirror your setup from the description using protobuf-net. Oddly enough there seems to be a glitch working with the linked-list, which I'll investigate; but the rest seems to work:

using System;
using System.Collections.Generic;
using System.IO;
using ProtoBuf;
[ProtoContract]
class CacheItem
{
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public int AnotherNumber { get; set; }
    private readonly Dictionary<string, CacheItemValue> data
        = new Dictionary<string,CacheItemValue>();
    [ProtoMember(3)]
    public Dictionary<string, CacheItemValue> Data { get { return data; } }

    //[ProtoMember(4)] // commented out while I investigate...
    public ListNode Nodes { get; set; }
}
[ProtoContract]
class ListNode // I'd probably expose this as a simple list, though
{
    [ProtoMember(1)]
    public double Head { get; set; }
    [ProtoMember(2)]
    public ListNode Tail { get; set; }
}
[ProtoContract]
class CacheItemValue
{
    [ProtoMember(1)]
    public string Key { get; set; }
    [ProtoMember(2)]
    public float Value { get; set; }
}
static class Program
{
    static void Main()
    {
        // invent 400k CacheItemValue records
        Dictionary<string, CacheItem> htCacheItems = new Dictionary<string, CacheItem>();
        Random rand = new Random(123456);
        for (int i = 0; i < 400; i++)
        {
            string key;
            CacheItem ci = new CacheItem {
                Id = rand.Next(10000),
                AnotherNumber = rand.Next(10000)
            };
            while (htCacheItems.ContainsKey(key = rand.NextString())) {}
            htCacheItems.Add(key, ci);
            for (int j = 0; j < 1000; j++)
            {
                while (ci.Data.ContainsKey(key = rand.NextString())) { }
                ci.Data.Add(key,
                    new CacheItemValue {
                        Key = key,
                        Value = (float)rand.NextDouble()
                    });
                int tail = rand.Next(1, 50);
                ListNode node = null;
                while (tail-- > 0)
                {
                    node = new ListNode
                    {
                        Tail = node,
                        Head = rand.NextDouble()
                    };
                }
                ci.Nodes = node;
            }
        }
        Console.WriteLine(GetChecksum(htCacheItems));
        using (Stream outfile = File.Create("raw.bin"))
        {
            Serializer.Serialize(outfile, htCacheItems);
        }
        htCacheItems = null;
        using (Stream inFile = File.OpenRead("raw.bin"))
        {
            htCacheItems = Serializer.Deserialize<Dictionary<string, CacheItem>>(inFile);
        }
        Console.WriteLine(GetChecksum(htCacheItems));
    }
    static int GetChecksum(Dictionary<string, CacheItem> data)
    {
        int chk = data.Count;
        foreach (var item in data)
        {
            chk += item.Key.GetHashCode()
                + item.Value.AnotherNumber + item.Value.Id;
            foreach (var subItem in item.Value.Data.Values)
            {
                chk += subItem.Key.GetHashCode()
                    + subItem.Value.GetHashCode();
            }
        }
        return chk;
    }
    static string NextString(this Random random)
    {
        const string alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 ";
        int len = random.Next(4, 10);
        char[] buffer = new char[len];
        for (int i = 0; i < len; i++)
        {
            buffer[i] = alphabet[random.Next(0, alphabet.Length)];
        }
        return new string(buffer);
    }
}
Marc Gravell
i've expanded the detail to give an overview of the structures that i want to serialise. thanks
Gordon Carpenter-Thompson