tags:

views:

742

answers:

4

Hello, It appears to me that one way of storing data in a B-tree as a file can be done efficiently with C using binary file with a sequence (array) of structs, with each struct representing a node. One can thus connect the individual nodes with approach that will be similar to creating linked lists using arrays. But then the problem that props up would be deletion of a node, as erasing only a few bytes in the middle in a huge file is not possible.

One way of deleting could be to keep track of 'empty' nodes until a threshold cutoff is reached and then make another file that will discard the empty nodes. But this is tedious.

Is there a better approach from the simplicity/efficiency point of view for deleting, or even representing a B-tree in a file?

TIA, -Sviiya

+1  A: 

I did a very quick search and dug up this: http://people.csail.mit.edu/jaffer/WB C source: http://cvs.savannah.gnu.org/viewvc/wb/wb/c/ - it seems to offer disk-based B-tree style databases - although taking a look at "delete.c" it seemed to imply if you delete a node everything down from it would be taken out - if that's the correct behaviour then it sounds like something that might help?

Also - B-trees are often used in filesystems - could you not take a look at some filesystem code?

My own inclination is that of a file-system - if you have a B-tree of fixed-size, whenever you "delete" a node rather than attempting to remove the reference, just set the value to whatever means nothing in your code. Then, have a clean-up thread running that checks if anyone has the file open for reading and if all's quiet blocks the file and tidies up.

Ninefingers
Thanks for the reference, Ninefingers. :) Will certainly have to read it.Because deletion can be frequent, accounting for them should be efficient. I would expect some of these operations could perhaps be delayed, but I would need to read the code to see if there is a better option.I also intend to use it for a filesystem later, but then the implementation would be different as the size would be constant. So the design will need to take that into account.
Hmm I agree. That code claims to do what you need and a cursory glance at viewcvs suggests it might - without sitting down and rebuilding your problem though it's hard to tell... I think filesystems do just "zero" elements they wish to delete and assign to any zero element but I could have that wrong. Either way, if this doesn't answer it please do open up the question again!
Ninefingers
The questions does answer what I was looking for, and I already found out about file truncating and hence the problem of deleting data from the middle is circumvented. Thanks. :)
+1  A: 

You can use Berkley DB as well. It works well with C programs and implements B+ tree.

Jack
Yep, but I want to write my own code to get the real feel. :)
Agree. Writing on your own is fine to get the real feel.BBD is very sophisticated database and provides many features that normal code wouldn't. In case of actual product deployment, I would choose BDB. Reinventing wheel would be difficult here.
Jack
+1  A: 

For implementing B-Trees in a file, you can use the file offset instead of pointers. Also, you can implement a "file memory manager", so that you can re-use deleted items in the file.

In order to fully recover the deleted blocks in a B-Tree file, you will have to recreate the B-Tree in a new file. Also remember the most OSes have no methods for truncating files. A portable method for truncating a file is to write a new file and destroy the old.

Another suggestion is to partition the file into B-Tree partition and data (item) partition. A B-Tree partition will contain the pages. The leaf pages will contain offsets to the data items. The data partition will be a section in the file containing data items. You may end up creating more than one of each partition and the partitions may be interleaved.

I spent much time playing with a file based B-Tree, until I gave up and decided to let a database program (or server) handle the data for me.

Thomas Matthews
Sounds interesting. This exercise of mine is to get some exposure to low level coding. I am mainly interested in Linux based systems and it supports file truncation. :)
A: 

The primary reason B-Trees were invented was to be able to store the index in files so that the number of disk access is very small.

Usually B-Tree node sizes are selected so as to be a multiple of the hard disk block size. In order to not waste the space within the node, the concept of 'fullness' was introduced and there are variants of the B-Tree which guarantee 2/3 fullness etc. Unused memory within a node is not such a big deal in normal scenarios.

You could do a few things to reduce amount of node fragementation due to deletion of nodes (not keys):

  • Use a fixed set of node sizes.
  • Keep the data separate from the index.
  • Work with offsets (or pointers) to the nodes instead relying on an implicit array like structure.
  • Maintain a list of deleted nodes (i.e. pointers), and allocate any new nodes needed from there. Having a fixed node sizes helps here.

To reduce the fragmentation within the node, you could pick an implementation which guarantees a reasonable 'fullness' factor. I believe B*-Trees give you 2/3rd fullness.

It is a tough problem to have optimal memory usage. Even the most powerful databases of today need to defrag themselves sometime.

Good luck and have fun!

Moron