Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.
The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.
The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.
OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?
Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?
As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.
Some things I have looked into already:
- http://stackoverflow.com/questions/1526479/how-to-store-a-large-directed-unweighted-graph-with-billions-of-nodes-and-vertice
- http://neo4j.org/ - I found zero information on how does it do graph layout on disk
Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):
- http://glaros.dtc.umn.edu/gkhome/views/metis
- http://www.sandia.gov/~bahendr/chaco.html
- http://staffweb.cms.gre.ac.uk/~c.walshaw/jostle/
- http://www.cerfacs.fr/algor/Softs/MESHPART/
EDIT: info on how the graphs looks like and that BFS can start everywhere. EDIT: idea on partitioning subgraphs