views:

129

answers:

4

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).

It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.

One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.

Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?

Which technology would you use to store and query such a huge readonly graph?

+1  A: 

LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.

Brian Agnew
A: 

I have a directed graph for which I (mis)used Lucene.

Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.

It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.

However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.

Joel
Creative solution, but wouldn't a relational DB of edges do just as well? Or am I missing some free functionality you get from using lucene?
Ranieri
Yes, it probably would. I used Lucene just because I was using already it using at the time, and I wanted a self-contained, portable, solution that could run entirely in process with my application (like bdb).
Joel
A: 

Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage. It also sound like Google App Engine would be right up your alley:

  • It's optimized for reading - and there's memcached if you want it even faster
  • it's distributed - so the size doesn't affect efficiency

Of course if you somehow rely on Relational DB to find the path, it won't work for you...

And I just noticed that the q is 4 months old

Nick
+1  A: 

There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.

The SQL has extensions for trees and graphs. Example:

select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')

To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.

Lvca