views:

64

answers:

2

MyClass is all about providing access to a single file. It must CheckHeader(), ReadSomeData(), UpdateHeader(WithInfo), etc.

But since the file that this class represents is very complex, it requires special design considerations.

That file contains a potentially huge folder-like tree structure with various node types and is block/cell based to handle fragmentation better. Size is usually smaller than 20 MB. It is not of my design.

How would you design such a class?

  • Read a ~20MB stream into memory?
  • Put a copy on temp dir and keep its path as property?

  • Keep a copy of big things on memory and expose them as read-only properties?
  • GetThings() from the file with exception-throwing code?

This class(es) will be used only by me at first, but if it ends good enough I might open-source it.

(This is a question on design, but platform is .NET and class is about offline registry access for XP)

+3  A: 

It depends what you need to do with this data. If you only need to process it linearly one time, then it might be faster to just take the performance hit of a large file in memory.

If however you need to do various things with the file beyond a single, linear parsing, I would parse the data into a lightweight database such as SQLite and then operate on that. This way all of your file's structure is preserved and all subsequent operations on the file will be faster.

lfalin
Considering files in a temporary directory, what would you do if these suddenly are deleted? Would you take measures everywhere to fail gracefully or it isn't worth the hassle? (is it safe to assume files won't get randomly deleted at run-time in an application-specific temp dir?)
Camilo Martin
+1  A: 

Registry access is quite complex. You are basically reading a large binary tree. The class design should rely heavily on the stored data structures. Only then you can choose an appropriate class design. To stay flexible you should model the primitives such as REG_SZ, REG_EXPAND_SZ, DWORD, SubKey, .... Don Syme has in his book Expert F# a nice section about binary parsing with binary combinators. The basic idea is that your objects know by themself how to deserialize from a binary representation. When you have a stream of bytes which is structured like this

<Header> <Node1/>
<Node2> <Directory1>
</Node2> </Header>

you start with a BinaryReader to read the binary objects byte by byte. Since you know that the first thing must be the header you can pass it to the Header object

public class Header
{
   static Header Deserialize(BinaryReader reader)
   {
      Header header = new Header();

      int magic = reader.ReadByte();
      if( magic == 0xf4 ) // we have a node entry
         header.Insert(Node.Read( reader );
      else if( magic == 0xf3 ) // directory entry
         header.Insert(DirectoryEntry.Read(reader))
      else 
         throw NotSupportedException("Invalid data");

      return header;
   }
}

To stay performant you can e.g. delay parsing the data up to a later time when specific properties of this or that instance are actually accessed.

Since the registry in Windows can get quite big it is not possible to read it completely into memory at once. You will need to chunk it. One solution that Windows applies is that the whole file is allocated in paged pool memory which can span several gigabytes but only the actually accessed parts are swapped out from disk into memory. That allows Windows to deal with a very large registry file in an efficient manner. You will need something similar for your reader as well. Lazy parsing is one aspect and the ability to jump around in the file without the need to read the data in between is cruical to stay performant.

More infos about paged pool and the registry can be found there: http://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx

Your Api design will depend on how you read the data to stay efficient (e.g. use a memory mapped file and read from different mapped regions). With .NET 4 a Memory Mapped file implementation has arrived that is quite good now but wrappers around the OS APIs exist as well.

Yours, Alois Kraus

To support delayed loading from a memory mapped file it would make sense not to read the byte array into the object and parse it later but go one step furhter and store only the offset and length of the memory chunk from the memory mapped file. Later when the object is actually accessed you can read and deserialize the data. This way you can traverse the whole file and build a tree of objects which contain only the offsets and the reference to the memory mapped file. That should save huge amounts of memory.

Alois Kraus
It's awesome to find helpful answers such as yours, thanks for the support! :) My first draft is a class that has getters such as GetHeader, the header has a GetHash, etc. and these don't get cached so the user of the class can choose to cache or not - Do you think this makes sense? I want to make it easy to mantain/refactor later because I'm afraid I'll have to at some point.
Camilo Martin
Actually I think that is what you mean, but just to make sure, my current idea is that the objects won't keep any cache besides their serialized binary representation and return component objects from error-throwing getters. So this means keeping a 1-50 MB byte array as a read-only property (such a size benefits from Memory Mapped Files?) and creating the objects from it.
Camilo Martin
To read the byte arrays into the objects is problematic since you are already using quite a lot of memory. It would be better to get only the offset relative to the starting point of the memory mapped file and read the data only when the object is actually accessed.
Alois Kraus
Thanks for the insight, now I'm ready to design this and other classes that deal with complex files, and the question is answered. :) The concept of "objects that know how to deserialize themselves" is the killer point here, and in conjunction with memory mapped files will help in writing much cleaner (and optimizeable) code too.
Camilo Martin
I just hope the `Deserialize()` approach will work as well with `Serialize()` (It's my first time working with a complex binary file like this).
Camilo Martin
The granularity of your object model is defined by the binary structure. If you have a container object and some ohher header data you need to update you either need to take on the references (memory consumption) or make it traversable in your tree. It could be solved by a linked list where you can always navigate to your parent enclosing object.
Alois Kraus
This makes sense, and from what I've been reading it seems the registry works with linked lists, I'll look further into it. I think a registry API should be as simple as a filesystem API for the user though, so I'll try to make it as simple as that while possible.
Camilo Martin