I am in the preparation phase of the design a graph-based (or key-value) database library for C++ which many here will find similar to projects such as http://neo4j.org/.
Since this is a very early phase of design, my requirements are simple, unrefined and (I admit) probably still rather naive:
- A directed acyclic graph
- A tree-like structure with few roots and many leaves
- Branches may contain references to other branches
- But no cycles
- The graph is represented by key-value pair where keys and values are simple types (integers) for the most part but some may refer to more complex types such as strings
- Queries
- Simple queries will usually return edges. I.e. What edges starting from this root, corresponds with (key / value / key-value tupple)?
- Queries using strings of keys (key, key, key, value)
- Access patterns and performance
- Fast lookups should be emphasized
- Addition of edges
- But no removal of edges/nodes from the graph. I.e. The graph will grow but it will never shrink.
- Optimizations may be performed on the graph in order to optimize the memory layout for cache usage
- Size of the graph is in the order of around 1 MB - 2 GB and should for the most part fit into primary memory
Given these rough requirements as a challenge, what would your primary concerns be with regard to:
- Memory storage: Layout, allocation
- E.g. Pool of fixed size blocks?
- Memory assignment by a clustering algorithm?
- Fast queries
- Dynamic reorganization
- How to handle addition of edges/nodes?
- Updates for optimization (e.g. improving memory layout)
- Concurrent access
- E.g. Handling changes to memory layout by an optimizing thread?
I am looking for good starting points from which to work, hence I'm more than happy to receive references to existing work. Most importantly: What should I be thinking about that I am not thinking about?