views:

226

answers:

4

I was wondering as to the practicalities of storing an in memory tree structure as directory tree for persistence purposes. In my case he target filesystem will be ZFS, and once the structure has been created it will be accessed by multiple processes infrequently.

How performant is using a Directory Tree as a mechanism of persistence for Data Trees?

+1  A: 

If I understand it correctly you're talking about building a tree structure that would give an in-code representation of your filesystem, so I suspect you'll incur overhead at the start where you're reading in your tree structure, but subsequent lookups and traversals of the tree would likely be quicker than hitting disk storage each time.

theraccoonbear
+3  A: 

In order to read and write your tree, you will be calling the filesystem several times per node. This is much more expensive than any sane code you could devise to walk a memory image.

Whether it is a sensible approach depends on what your usage pattern is expected to be. If in a typical invocation of your code you expect to read in the entire tree structure, work on it, then write it out in full - you are better off marshalling it up into a single file. If, however, you expect to read/work on/mutate just a few nodes, without reading in most of the tree, the difference in performance between walking the directory structure and doing multiple seeks/reads to traverse a tree stored in a single file will be much smaller, and it may well become worth doing the former for simplicity / clarity / avoiding reinventing wheels. Moreover, if multiple processes are doing this simultaneously, locking nodes and subtrees becomes much easier with the directory-based approach.

Be aware that for some commonly used filesystems the time to open a directory entry depends on the total number of entries in the directory.

EDIT: I've done similar things with ext3 for a site's CGI backend; not reinventing the wheel made prototyping quicker and maintenance simpler, reads/writes/locking scaled pretty well, but very frequent changes - on the order of hundreds per second - to the directory structure itself worked poorly on real storage; in the end I restructured things so that the sections of directory tree to which directory entries would very frequently be added/removed ended up on a tmpfs volume - for me this set of state could (expensively) be reconstructed from that stored in less volatile storage following a reboot. I have little experience of ZFS, and don't know your intended usage pattern, so don't know if this would be a problem for you. Were I now doing this for a very heavily used site, I would probably roll my own named lock library instead.

moonshadow
@moonshadow: Thanks for the Edit. I don't think I will be anywhere need the limits of what you described! So I think I will use this model for the foreseeable future. Thanks. :)
_ande_turner_
+2  A: 

Most filesystems are optimised for access to an open file, so opening/closing a file takes a significant time. If each leaf of your tree is small, reading/writing the whole structure would take many times longer than necessary.

Also, most filesystems have a minimal allocation block, usually around 2-8KB. if your leaves are much smaller than that, you'll be wasting a lot of space.

In short, the smaller your leaves, the worse the idea.

Javier
+1  A: 

Possible issues:

  • It may make inefficient use of disk space (in many file systems a directory is a file and as such occupies an entire block on the disk...)
  • It will be slow to read/write because you make many file-system accesses
  • The file system may/will impose limits on the length of each item name and/or characters you can use for names
  • It will be easy for other processes to corrupt your data and/or require considerable locking cost
  • When using solid-state ``disks'' this may result in more writes than another methods and shorten the life of the media

Bottom line: it may not be worth it.

dmckee