views:

205

answers:

5

I am developing an app that utilizes very large lookup tables to speed up mathematical computations. The largest of these tables is an int[] that has ~10 million entries. Not all of the lookup tables are int[]. For example, one is a Dictionary with ~200,000 entries. Currently, I generate each lookup table once (which takes several minutes) and serialize it to disk (with compression) using the following snippet:

    int[] lut = GenerateLUT();
    lut.Serialize("lut");

where Serialize is defined as follows:

 public static void Serialize(this object obj, string file)
 {
  using (FileStream stream = File.Open(file, FileMode.Create))
  {
   using (var gz = new GZipStream(stream, CompressionMode.Compress))
   {
    var formatter = new BinaryFormatter();
    formatter.Serialize(gz, obj);
   }
  }
 }

The annoyance I am having is when launching the application, is that the Deserialization of these lookup tables is taking very long (upwards of 15 seconds). This type of delay will annoy users as the app will be unusable until all the lookup tables are loaded. Currently the Deserialization is as follows:

     int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
     int[] lut2 = (int[]) Deserialize("lut2");
 ...

where Deserialize is defined as:

 public static object Deserialize(string file)
 {
  using (FileStream stream = File.Open(file, FileMode.Open))
  {
   using (var gz = new GZipStream(stream, CompressionMode.Decompress))
   {
    var formatter = new BinaryFormatter();
    return formatter.Deserialize(gz);
   }
  }
 }

At first, I thought it might have been the gzip compression that was causing the slowdown, but removing it only skimmed a few hundred milliseconds from the Serialization/Deserialization routines.

Can anyone suggest a way of speeding up the load times of these lookup tables upon the app's initial startup?

A: 

I guess the obvious suggestion is to load them in the background. Once the app has started, the user has opened their project, and selected whatever operation they want, there won't be much of that 15 seconds left to wait.

Draemon
I agree with this, but it's still somewhat of a workaround imo. Regarding my app, the gui is simple enough that a user will be ready to perform a computation in less than 5 seconds. So at the moment, i'm striving for load times of 5 seconds or less (where the lookup tables will be loaded in the background in less than 5 seconds).
snazzer
+2  A: 

First, deserializing in a background thread will prevent the app from "hanging" while this happens. That alone may be enough to take care of your problem.

However, Serialization and deserialization (especially of large dictionaries) tends to be very slow, in general. Depending on the data structure, writing your own serialization code can dramatically speed this up, particularly if there are no shared references in the data structures.

That being said, depending on the usage pattern of this, a database might be a better approach. You could always make something that was more database oriented, and build the lookup table in a lazy fashion from the DB (ie: a lookup is lookup in the LUT, but if the lookup doesn't exist, load it from the DB and save it in the table). This would make startup instantaneous (at least in terms of the LUT), and probably still keep lookups fairly snappy.

Reed Copsey
A: 

Just how much data are we talking about here? In my experience, it takes about 20 seconds to read a gigabyte from disk into memory. So if you're reading upwards of half a gigabyte, you're almost certainly running into hardware limitations.

If data transfer rate isn't the problem, then the actual deserialization is taking time. If you have enough memory, you can load all of the tables into memory buffers (using File.ReadAllBytes()) and then deserialize from a memory stream. That will allow you to determine how much time reading is taking, and how much time deserialization is taking.

If deserialization is taking a lot of time, you could, if you have multiple processors, spawn multiple threds to do the serialization in parallel. With such a system, you could potentially be deserializing one or more tables while loading the data for another. That pipelined approach could make your entire load/deserialization time be almost as fast as load only.

Jim Mischel
The total data on disk of the lookup tables is less than 100 megabytes, so I think data transfer limitations can be ruled out.
snazzer
A: 

Another option is to put your tables into, well, tables: real database tables. Even an engine like Access should yield pretty good performance, because you have an obvious index for every query. Now the app only has to read in data when it's actually about to use it, and even then it's going to know exactly where to look inside the file.

This might make the app's actual performance a bit lower, because you have to do a disk read for every calculation. But it would make the app's perceived performance much better, because there's never a long wait. And, like it or not, the perception is probably more important than the reality.

Joel Coehoorn
A: 

Why zip them?

Disk is bigger than RAM.

A straight binary read should be pretty quick.

Mike Dunlavey