tags:

views:

231

answers:

1

I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?

Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:

  • (Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
  • Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
  • Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.

Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?

+2  A: 

HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.

Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.

An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)

Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.

Stefano Borini
Thanks for your thoughts. It could indeed be done in many ways. Hopefully real deletion capabilities will be added to HDF5 someday, having to mess around like this to just delete stuff is almost ridiculous... But otherwise, it's a good file format :-)
Joonas Pulakka
deletion and packing is not easy. HDF5 is like a filesystem. Even if you free the blocks, marking them as deleted, packing the file while "live" is hard and a performance bottleneck. You face the same issue with the encrypted filesystem on MacOSX, so my guess is that I would not expect a solution to your problem at the HDF5 level any time in the future.
Stefano Borini
True, it's certainly not easy, and maybe not the primary interest to HDF5 developers. But it would certainly be possible - most databases can do it, I think.
Joonas Pulakka
well, in PostgreSQL you have the VACUUM command that does exactly that, and it is normally invoked by hand once in a while.
Stefano Borini