views:

376

answers:

2

I am in the preparation phase of the design a graph-based (or key-value) database library for C++ which many here will find similar to projects such as http://neo4j.org/.

Since this is a very early phase of design, my requirements are simple, unrefined and (I admit) probably still rather naive:

  • A directed acyclic graph
    • A tree-like structure with few roots and many leaves
    • Branches may contain references to other branches
    • But no cycles
    • The graph is represented by key-value pair where keys and values are simple types (integers) for the most part but some may refer to more complex types such as strings
  • Queries
    • Simple queries will usually return edges. I.e. What edges starting from this root, corresponds with (key / value / key-value tupple)?
    • Queries using strings of keys (key, key, key, value)
  • Access patterns and performance
    • Fast lookups should be emphasized
    • Addition of edges
    • But no removal of edges/nodes from the graph. I.e. The graph will grow but it will never shrink.
    • Optimizations may be performed on the graph in order to optimize the memory layout for cache usage
    • Size of the graph is in the order of around 1 MB - 2 GB and should for the most part fit into primary memory

Given these rough requirements as a challenge, what would your primary concerns be with regard to:

  • Memory storage: Layout, allocation
    • E.g. Pool of fixed size blocks?
    • Memory assignment by a clustering algorithm?
  • Fast queries
  • Dynamic reorganization
    • How to handle addition of edges/nodes?
    • Updates for optimization (e.g. improving memory layout)
  • Concurrent access
    • E.g. Handling changes to memory layout by an optimizing thread?

I am looking for good starting points from which to work, hence I'm more than happy to receive references to existing work. Most importantly: What should I be thinking about that I am not thinking about?

+2  A: 

A fast key/value database you might be interested in for reference is TinyCDB http://www.corpit.ru/mjt/tinycdb.html

Matt H
+3  A: 

But no cycles

This is a heavy requirement if you need fast edge inserts. Verifying that a new edge does not introduce a cycle is O(v+e) in the worst case (where v is the number of vertices and e is the number of edges). It probably also rules out concurrent insertion of edges. Consider to make this requirement optional.

Another option is to distinguish between two insert operations: CheapInsert and ExpensiveInsert. Let each vertice have a "rank" integer and only allow cheap inserts of edges from lower to higher rank vertices. The expensive insert will not have this constraint and will automatically rewrite the ranks if necessary. Clients can inspect and change the rank of any vertice (as long as the low-to-high rule isn't broken). This way they can implement their own insert strategy, possibly by exploiting some specific properties of the graph to avoid expensive inserts.

Wim Coenen
This is food for thought, thank you. I only added this "requirement" because I'd thought it might simplify some algorithms related to queries, but I never considered the implications for insertion.
Rehno Lindeque