views:

233

answers:

4

I have a solution where I need to read objects into memory very quickly, however the binary stream might be cached compressed in memory to save time on disk io.

I've tinkered around with different solutions, obviously XmlTextWriter and XmlTextReader wasnt so good and neither was the built-in binary serialization. Protobuf-net is excellent but still a little bit too slow. Here are some stats:

File Size XML: 217 kb

File Size Binary: 87 kb

Compressed Binary: 26 KB

Compressed XML: 26 KB

Deserialize with XML (XmlTextReader) : 8.4 sek

Deserialize with Binary (Protobuf-net): 6.2 sek

Deserialize with Binary wo string.interning (Protobuf-net): 5.2 sek

Deserialize with Binary From memory: 5.9 Sek

Time to decompress binary file into memory: 1.8 sek

Serialize With Xml (XmlTextWriter) : 11 sek

Serialize With Binary (Protobuf): 4 sek

Serialize With Binary length prefix (Protobuf-net): 3.8 sek

That got me thinking, it seems (correct me if I'm wrong) that the major culprit of deserialization is the actual byte conversion rather than the IO. If thats' the case then it should be a candidate for using the new Parallel extensions.

Since I'm bit of a novice when it comes to binary IO I'd appreciate some input before I commit time to solution though :)

For simplicity sake, say we want to deserialize a list of objects with no optional field. My first idea was simply to store each with a length prefix. Read the byte[] of each into a list of byte[] and use PLINQ to do the byte[] -> object deserialization.

However with that method I still need to read the byte[] singlethreadedly, so perhaps one could read the whole binary stream into memory instead (how large binary files are feasible for that btw?) and in the beginning of the binary file instead store how many objects there are and each of their length and offset. Then I should be able to just create ArraySegments or something and do the chunking in paralllel too.

So what do you guys think , is it feasible?

A: 

Whene i Deserialize list of object larger then 1 MB xml i Deserialize les then 2 seconds with this code:

public static List<T> FromXML<T>(this string s) where T : class
        {
            var ls = new List<T>();
            var xml = new XmlSerializer(typeof(List<T>));
            var sr = new StringReader(s);
            var xmltxt = new XmlTextReader(sr);
            if (xml.CanDeserialize(xmltxt))
            {
                ls = (List<T>)xml.Deserialize(xmltxt);
            }
            return ls;
        }

Try this if is beter for XML case?

Florim Maxhuni
The xml serialization works like that, but part of the overhead is probably the large amount of relative small objects, so object creation becomes an issue. Nevertheless XML serialization is almost never fast than binary, and much more verbose which leads to more time due to file io
MattiasK
A: 

Binary file can be read simultaneously by several threads. To do that it must be opened with appropriate access/share modifiers. And then each thread can get its own offset and length in that file. Thus reading in parallel is not a problem.

Let us assume that you will stick to simple binary format: each object is prefixed with its length. Knowing that you can "scroll" the file and know the offset where to put the deserializing thread.

Deserializing algoritm can look like this: 1) analyze file (divide it into several relatively large chunks, chunk border should coinside with object border) 2) spawn necessary amount of deserializer threads and "instruct" them with appropriate offset and length to read 3) combine results of all deserializer threads into one list

Vadmyst
Hi, I working on a solution like this but rather to read from the disk in parallel I decided on retrieving the whole file to a byte buffer first and then deserialize/serialize it. Seems much faster, if you read from disk in parallel you'd be limited by the speed of the disk. I'll post some info on my solution here when I'm done, thanks
MattiasK
Indeed, for small data sizes it is more feasible to cache it in the memory. I proposed file reading as it is more general solution. Memory caching can be added as optimization (never know how much data someone will want to deserialize :) )
Vadmyst
A: 

That got me thinking, it seems (correct me if I'm wrong) that the major culprit of deserialization is the actual byte conversion rather than the IO.

Don't assume where the time is being spent, get yourself a profiler and find out.

Paolo
+1  A: 

I do things like this quite a lot, and nothing really beats using BinaryReader to read things in. As far as I know, there is no faster way than using BinaryReader.ReadInt32 to read in a 32 bit integer.

You may also find that the overhead of making it parallel and joining back together to be too much. If you really want to go the parallel route, I would advise using multiple threads to read in multiple files, rather than multiple threads to read one file in multiple blocks.

You could also play around with the block size to make it match disk block size, but there are so many levels of abstraction in between your application and the disk that could make that a waste of time.

Nick R
+1. I do this a lot too and BinaryReader is the way to go.
zebrabox