I need to be able to manipulate a large (10^7 nodes) graph in python. The data corresponding to each node/edge is minimal, say, a small number of strings. What is the most efficient, in terms of memory and speed, way of doing this?

A dict of dicts is more flexible and simpler to implement, but I intuitively expect a list of lists to be faster. The list option would also require that I keep the data separate from the structure, while dicts would allow for something of the sort:


What would you suggest?

+2  A: 

A dictionary may also contain overhead, depending on the actual implementation. A hashtable usually contain some prime number of available nodes to begin with, even though you might only use a couple of the nodes.

Judging by your example, "Property", would you be better of with a class approach for the final level and real properties? Or is the names of the properties changing a lot from node to node?

I'd say that what "efficient" means depends on a lot of things, like:

  • speed of updates (insert, update, delete)
  • speed of random access retrieval
  • speed of sequential retrieval
  • memory used

I think that you'll find that a data structure that is speedy will generally consume more memory than one that is slow. This isn't always the case, but most data structures seems to follow this.

A dictionary might be easy to use, and give you relatively uniformly fast access, it will most likely use more memory than, as you suggest, lists. Lists, however, generally tend to contain more overhead when you insert data into it, unless they preallocate X nodes, in which they will again use more memory.

My suggestion, in general, would be to just use the method that seems the most natural to you, and then do a "stress test" of the system, adding a substantial amount of data to it and see if it becomes a problem.

You might also consider adding a layer of abstraction to your system, so that you don't have to change the programming interface if you later on need to change the internal data structure.

Lasse V. Karlsen

Yes, I should have been a bit clearer on what I mean by efficiency. In this particular case I mean it in terms of random access retrieval.

Loading the data in to memory isn't a huge problem. That's done once and for all. The time consuming part is visiting the nodes so I can extract the information and measure the metrics I'm interested in.

I hadn't considered making each node a class (properties are the same for all nodes) but it seems like that would add an extra layer of overhead? I was hoping someone would have some direct experience with a similar case that they could share. After all, graphs are one of the most common abstractions in CS.


Making a class-based structure would probably have more overhead than the dict-based structure, since in python classes actually use dicts when they are implemented.

Matthew Schinckel
... except if you use `__slots__`, which is what you'd probably want to do here.
Daniel Pryden
+1  A: 

As I understand it, random access is in constant time for both Python's dicts and lists, the difference is that you can only do random access of integer indexes with lists. I'm assuming that you need to lookup a node by its label, so you want a dict of dicts.

However, on the performance front, loading it into memory may not be a problem, but if you use too much you'll end up swapping to disk, which will kill the performance of even Python's highly efficient dicts. Try to keep memory usage down as much as possible. Also, RAM is amazingly cheap right now; if you do this kind of thing a lot, there's no reason not to have at least 4GB.

If you'd like advice on keeping memory usage down, give some more information about the kind of information you're tracking for each node.

Peter Burns
Having 4GiB of RAM requires a 64-bit OS in order for them to be truly and efficiently used.
+17  A: 
Ryan Cox
+1  A: 

As already mentioned, NetworkX is very good, with another option being igraph. Both modules will have most (if not all) the analysis tools you're likely to need, and both libraries are routinely used with large networks.