views:

417

answers:

7

I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges) in memory. The edges representation will contain some attributes of the relation.

I have tried a memory map representation, arrays, dictionaries and strings to represent that structure in memory, but these always crash because of the memory limit.

I would like to get an advice of how I can represent this, or something similar.

By the way, I'm using python.

+8  A: 
  1. If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
  2. Why does this have to be all in memory?
  3. Can you show us the structures you are currently using?
  4. How much memory are we allowed (what is the memory limit you are hitting?)

If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.

tster
+1: this isnt an answer, but all these questions have to be answered before we can proceed
Claudiu
ok... each node has between 100 and 600 edges.I have to keep this in memory (ram) because it has to be constantly accessed . Also it's going to be updated constantly, changing the weights of the relations, adding and removing nodes and edges.All this thing must have a good performance (response time).
Harph
I have tried many differents structures like dictionaries, lists, and object combinations for representing the adjacency matrix. I, also have tried to map this intro a file, using and xt4 partion, but when I have to write random, it takes a lot of time and expends a lot of memory because the block writing.Right now I'm testing this in my own machine with 4 gb. When I go implement it, I will use a server with some memory (I hope less than 1 Tera).
Harph
Use a database if it's read/write performance you are worried about.
tster
Yeah... but what kind or which database are you talking about?
Harph
+2  A: 

You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.

The edges can be stored using an adjacency list on the nodes.

Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.

Michael Madsen
+2  A: 

I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:

Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).

Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB

That's without any other overheads or any additional data in the nodes (or edges).

Cade Roux
A: 

Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.

Colorado
He certainly needs some kind of datastore that isn't memory, but what's "an iterator over the results"? Graph algorithms generally aren't going to be satisfied with a simple iterator.
Nick Bastin
I might mean "generator", as in `(grape[stem] for grape in bunch)` or a function that uses `yield`.
Colorado
A: 

If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.

miles82
A: 

Assuming you mean 600 per node, you could try something like this:

import os.path
import cPickle
class LazyGraph:
    def __init__(self,folder):
        self.folder = folder

    def get_node(self,id):
        f = open(os.path.join(self.folder,str(id)),'rb')
        node = cPickle.load(f)
        f.close() # just being paranoid
        return node

    def set_node(self,id,node):
        f = open(os.path.join(self.folder,str(id)),'wb')
        cPickle.dump(node,f,-1) # use highest protocol
        f.close() # just being paranoid

Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.

Note, this will be very very slow.

You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.

wisty
+1  A: 

Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:

  • Neo4j - claims to easily handle billions of nodes and its been in development a long time.
  • FlockDB - newly released by Twitter this is a distributed graph database.

Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?

Binary Nerd
Thanks for your answer... I have tried networkx but it does fit my requirements (use a lot of memory). Neo4j is too expensive. I will tried with flockDB.
Harph
I have read about FlockDB... it looks pretty cool, but I have a lot of problems to install it. I think FlockDB is in alpha version. It doesn't have good documentation / support.
Harph