views:

634

answers:

11

I am looking for good recommendations for scalable and/or parallel large graph analysis libraries in various languages. The problems I am working on involve significant computational analysis of graphs/networks with 1-100 million nodes and 10 million to 1+ billion edges. The largest SMP computer I am using has 256 GB memory, but I also have access to an HPC cluster with 1000 cores, 2 TB aggregate memory, and MPI for communication.

I am primarily looking for scalable, high-performance graph libraries that could be used in either single or multi-threaded scenarios, but parallel analysis libraries based on MPI or a similar protocol for communication and/or distributed memory are also of interest for high-end problems. Target programming languages include C++, C, Java, and Python.

My research to-date has come up with the following possible solutions for these languages:

Other topics here on SO that I've looked at have discussed graph libraries in C++, Java, Python, and other languages. However, none of these topics focused significantly on scalability.

Does anyone have recommendations they can offer based on experience with any of the above or other library packages when applied to large graph analysis problems? Performance, scalability, and code stability/maturity are my primary concerns. Most of the specialized algorithms will be developed by my team with the exception of any graph-oriented parallel communication or distributed memory frameworks (where the graph state is distributed across a cluster).

+3  A: 

You don't seem to be getting a lot of answers wrt C, C++, Java or Python, so I'll jump in.

We have a parallel programming language, PARLANSE. PARLANSE is intended to handle irregular but parallel computations.... like graph processing. PARLANSE is designed for SMP, not distributed, processing. We mostly use for program analyses, where the graph structures are syntax trees, symbol tables, control and data flow graphs, etc.

As one way to measure performance, we run a graphsearch program, that just visits all the nodes in a connected graph, in parallel, marking each as visited. Here's runs with 1-8 processors on an Intel I7:

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p1 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 23.919828 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p2 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 12.680212 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p3 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 9.032862 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p4 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 7.935573 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p5 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 6.079331 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p6 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 5.360442 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p7 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 4.952811 seconds

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run -p8 graphsearch
Parallel GraphSearch of size 1000000 with neighbor density: 5.308947...Runtime: 4.626209 seconds

The speedup isn't linear but it isn't too bad either. Note that the i7 doesn't have 8 equal power processors; rather it has 4 hyperthreaded CPUs.

It is worth noting that the times provided included each node visit including a loop of 10,000 iterations to simulate work at that node. If the actual work is only a 100 machine instructions, the total execution time should be considerably reduced. (One of these days when I remember I'll go run that experiment and post the result here).

The PARLANSE code follows at the end.

The parallelism is caused by the pure-parallel (|| which runs all sub computations in parallel and the partial order operators (|; that run children according to a time sequence defined by the embedded partial order formed by naming the subcomputations and providing ordering constraiints with "before" operators (>>. This program is only intended as a performance test and so it doesn't do anything useful, and the performance is limited somewhat by the partial order operators.

PARLANSE presently works in 32 bit images. The million-nodes this program processes uses about 2Gb of the 4Gb available so this may be too small for what you have in mind. We are in early planning stages of a 64 bit implementation.

`GraphSearch.par -- Tests performance of parallel graph search
 Copyright (C) 2008-2010 Semantic Designs; All Rights Reserved.

 Generates a random graph from a seed so it is repeatable.
 Explores graph from fixed starting place, marking each
 node as visited.  Uses parallelism, partial orders,
 and teams to manage neighbor visits depending on node
 outdegree.

Note: Graphsize when large may cause paging.
  Compiling with -d aggravates this due to large activation record sizes'

(define Parallel ~t)
(define Seed 0)
(define GraphSize 1000000)
(define GraphDensity 5.5) ; must be float
(define NodeSimulatedWork 10000)

(includeunique `Timer.par')
(includeunique `Random.par')

[random_state Random:GeneratorState]

(define randomreset (action (procedure natural)
  (;; (= random_state (Random:create 2 1 1 ?))
  );;
)action
)define

(define random (lambda (function float void)
(Random:random random_state)
   )lambda
)define

(define GraphNode
   (structure
 [visited boolean] 
 [owned semaphore]
 [neighbors (array natural 1 dynamic)]
)structure
)define

[graph (array GraphNode 1 dynamic)]

(define VisitGraph
   (action (procedure [node natural])
   (;;
  (ifthenelse graph:node:visited
     (return) ; nothing to do, fast path
     (;; `Visit the node'
     (consume (lock graph:node:owned))
     (ifthenelse graph:node:visited
         (;; `Ooops, timing splinter, somebody else got here first'
         (consume (unlock graph:node:owned))
         (return)
         );;
         (;;
(compileifthen ~f
         (;; (Console:Put (. `Processing '))
             (Console:PutNatural node)
             (Console:PutNewline)
         );;
)compileifthen
         (= graph:node:visited ~t) ; mark as visited
         (consume (unlock graph:node:owned))
         );;
     )ifthenelse
     (do [i natural] 1 NodeSimulatedWork 1 (;; `Simulate work on node' );; )do
(compileifthen (~ Parallel) ; parallelism is unsafe?
     (;;  (do [i integer] +1 (upperbound graph:node:neighbors 1) +1
         (VisitGraph graph:node:neighbors:i) ; sequential ick
          )do
          (return)
     );;
)compileifthen
     (case (coerce natural (upperbound graph:node:neighbors 1))
        1  (VisitGraph graph:node:neighbors:1)
        2  (||            (VisitGraph graph:node:neighbors:1)
                  (VisitGraph graph:node:neighbors:2)
           )||
        3  (;| a          (VisitGraph graph:node:neighbors:1)
           b          (VisitGraph graph:node:neighbors:2)
           c (>> a b) (VisitGraph graph:node:neighbors:3)
           );|
        4  (;| a          (VisitGraph graph:node:neighbors:1)
           b          (VisitGraph graph:node:neighbors:2)
           c (>> a b) (VisitGraph graph:node:neighbors:3)
           d (>> a)   (VisitGraph graph:node:neighbors:4)
           );|
        5  (;| a          (VisitGraph graph:node:neighbors:1)
           b          (VisitGraph graph:node:neighbors:2)
           c (>> a b) (VisitGraph graph:node:neighbors:3)
           d (>> a)   (VisitGraph graph:node:neighbors:4)
           e (>> b d) (VisitGraph graph:node:neighbors:5)
           );|
        else  
(compileifthenelse ~f ; draft doesnt work?
         (do [i integer] +1 (upperbound graph:node:neighbors 1) +1
             (VisitGraph graph:node:neighbors:i) ; sequential ick
         )do
         (local [workers team]
          (;; (do [i integer] +1 (upperbound graph:node:neighbors 1) +1
              (consume (draft workers VisitGraph graph:node:neighbors:i))
              )do
              (consume (wait (@ (event workers)))) ; wait for terminated
          );;
         )local
)compileifthenelse
       )case
      );;
  )ifthenelse
  );;
   )action
)define

(define main
   (action (procedure void)
  (local (|| [timer Timer:Timer]
     )||
     (;; (compileifthenelse Parallel
        (Console:Put (. `Parallel GraphSearch of size '))
        (Console:Put (. `Sequential GraphSearch of size '))
     )compileifthenelse
     (Console:PutNatural GraphSize)
     (Console:PutSpace)
     (;; `Construct a random graph'
         (randomreset Seed)
         (resize graph 1 GraphSize)
         (do [victim natural] 1 GraphSize 1
          (;; (= graph:victim:visited ~f) ; mark "unvisited"
              (consume (addresource graph:victim:owned 1))
              (ifthen (~= victim GraphSize)
             (;; `Ensure there is a spanning tree for the graph'
                 (resize graph:victim:neighbors 1 1)
                 (= graph:victim:neighbors:1 (++ victim)) 
             );;
              )ifthen
          );;
         )do
(compileifthen ~f
         (do [victim natural] 1 GraphSize 1
          (;; (Console:PutNatural victim)
              (Console:PutInteger (upperbound graph:victim:neighbors 1))
              (Console:PutNewline)
          );;
         )do
)compileifthen
         (do [i natural] 1 (coerce natural (* (+ (* (random) .2) .9) (coerce float GraphSize) (- GraphDensity 1.0))) 1
         (local (;; [victim natural]
                [new_neighbor_count natural]
                [new_neighbor natural]
            );;
           (;;
; (Console:PutNatural i)(Console:PutCharacter ":")
               (= victim (++ (coerce natural (floor (* (random) (coerce float GraphSize))))))
; (Console:PutNatural victim)(Console:PutSpace)
               (= new_neighbor_count (++ (coerce natural (upperbound graph:victim:neighbors 1))))
; (Console:PutNatural new_neighbor_count)(Console:PutSpace)
               (= new_neighbor (coerce natural (++ (floor (* (random) (coerce float GraphSize))))))
; (Console:PutNatural new_neighbor)(Console:PutNewline)
               (resize graph:victim:neighbors 1 new_neighbor_count)
               (= graph:victim:neighbors:new_neighbor_count new_neighbor)
           );;
         )local
         )do
     );;
     (local (|| [neighbor_count natural]
            (= [total_neighbors natural] 0)
         )||
         (;; (do [victim natural] 1 GraphSize 1
             (;; (= neighbor_count (coerce natural  (upperbound graph:victim:neighbors 1)))
             (+= total_neighbors neighbor_count)
(compileifthen ~f
             (;; 
                (Console:PutNatural victim)
                (Console:PutCharacter ":")
                (Console:PutNatural neighbor_count)
                (Console:PutNewline)
             );;
)compileifthen
             );;
         )do
         (Console:Put (. `with neighbor density: '))
         (Console:PutFloatDecimalFormat (/ (coerce float total_neighbors) (coerce float GraphSize)))
         );;
     )local
     (Console:Put (. `...'))
     (Timer:Reset (. timer))
     (VisitGraph 1)
     (Timer:Stop (. timer))
     (Timer:PrintElapsedTime (. timer) (. `Runtime: '))
     (Console:PutNewline)
     );;
  )local
   )action
)define
Ira Baxter
Thanks for the post. My problem space and development environment currently prevents me from considering unconventional languages, but parallel programming languages are certainly relevant to the larger domain of graph algorithms and scaling. Would PARLANSE be appropriate for massively multithreaded hardware architectures like the Cray XMT? (Latter is not something I'm using; just curious.)
Joel Hoff
PARLANSE is designed for shared address space, irregular computation. It tries to manufacture tons of medium size grains, so that each processor/thread has lots of work to steal. I think the Cray T3D offers the shared address space.
Ira Baxter
+1  A: 

Not exactly what you asked for, but might be interesting anyways.

See/Contact Stanford PPL Liszt and or OptiML Scala DSL might be avaliable to you. Still in research mode I guess but... Also see Scala Days Keynote Address: Kunle Olukotun A Domain Specific Language Approach to Heterogeneous Parallel Programming Using Scala

Also have a look at the AKKA Aktor concurrency libraries. (but no specific graph libraries there). Has system in production but changes fast.

oluies
Thanks, those are some interesting options I had not heard of. My solution options are limited to graph libraries in the near term, but these type of general parallel processing frameworks oriented around exploiting fine-to-medium grain parallelism opportunities within problems may be viable solutions in the longer term as the problem characteristics change in my area.
Joel Hoff
+1  A: 

This is not really an answer but I'm not sure that hyperlinks work in comments. This article might interest you if you haven't seen it already.

High Performance Mark
Thanks for posting the article from the High Scalability blog. I actually came across that late last week, though it only cropped up after an extensive web search for possible solutions to this problem. It's impressive to see the broad range of radically different solutions that are being pursued at the extremely high-end of graph/network data problems, e.g., 300 billion edges in one case. That article I think is a good short summary of the current state-of-the-art frontier for large graphs.
Joel Hoff
Yes, hyperlinks work in comments. See [this question](http://meta.stackoverflow.com/questions/37758/inline-links-in-comments) for the syntax.
ire_and_curses
+2  A: 

I'll throw a tentative answer to my own question into the ring: the LEMON library is looking increasingly promising so far. It is implemented in C++ and is available for Windows, Linux, Solaris, and Mac OS X.

Although it does not include any support for parallelism or multithreading, it has a clean, easy-to-use API with several options for graph implementations with different performance tradeoffs, including a static graph variant. The code is stable and builds cleanly on the three platforms I've tested: Solaris, Linux, and MacOS. The static graph variant appears to be highly optimized for both space and computational efficiency.

So far I've tested LEMON on graphs of up to 1 million nodes and 100 million edges with no performance issues for fully dynamic directed graphs (LEMON's ListDigraph class) and expandable directed graphs (SmartDigraph class). Additionally, I've tested a static directed graph (StaticDigraph class) with 100 million nodes and 1 billion edges with good results. The latter when accompanied by the storage-equivalent of node/edge weight data required about 55-60 GB of RAM, and edge traversals are extremely fast on this form of graph due to a tight implementation within the LEMON library.

LEMON also includes bindings for Python and Perl, but I've not tested these yet. I don't know whether they are in sync with the most current release of the LEMON C++ source.

It turns out that MTGL is still under initial development and has yet to have an official first release; my guess is that the source archives have been made publicly available to support some type of quasi-beta-test phase. This makes it too unstable as a viable solution for my group's needs just now, but the API looks promising as it is much simpler than that of the Boost Graph Library. Also, it looks like the project's goals include supporting the needs of more conventional SMP servers and workstations, not just massively multithreaded hardware architectures. MTGL is worth keeping an eye on as it progresses closer to an official release.

I plan on providing an additional, more-detailed update next week after more testing with LEMON.

Joel Hoff
+1: This seems to be the answer which best fits OP's requirements -- which is perhaps not a surprise. Certainly not as surprising as the suggestion to use PARLANSE getting 2 upvotes.
High Performance Mark
Raghava
@Raghava - Yes, I have been actively using LEMON for the past several months with very good results in performance and code reliability for a complex algorithm in network analysis. In doing performance optimization on the latter, the bottlenecks I've encountered have all been with my algorithm, not with the LEMON library. I'm very pleased with it overall in most respects, including ease-of-use and decent documentation.
Joel Hoff
@Joel: Thank you for the reply. Can you provide some numbers on performance? I saw the numbers on scalability part. Did you run some general algorithms like shortest path, transitive closure? And as some other poster mentioned, if it is not parallel, was there any difference between using LEMON over LEDA or BGL?
Raghava
@Raghava - My graph traversal tests were not formal benchmarks but rather an informal combination of shortest-path tests (breadth, depth, and Dijkstra). I've not determined what would be a useful way to present this data. I did not compare performance with any other graph library but rather simply determined whether or not the time to traverse a very large graph suggested any kind of bottleneck. Since the traversals for some 100-million edge graphs all completed within periods of 1-10 minutes, I did not see a bottleneck. This verdict has been confirmed by later project work.
Joel Hoff
@Raghava - To clarify, I don't know how many times the various algorithms I used to test graph traversal performance visited each edge, so I don't know what the edges/second rate would be. My interest was simply to expose a large graph to various algorithmic search processes and see whether any obvious performance problems arose.
Joel Hoff
@Joel: Thank you for sharing the numbers. That certainly looks good :).
Raghava
A: 

Just to throw it out... C# and LINQ: DryadLINQ.

Some sample programs in DryadLINQ

BotNet detection PPT that can process a graph of 8.6 billion edges in 7 minutes

Also, the Dryad part is C++, and it is believed they will be supporting a Dryad-only (C++) API for those that would want it.

GalacticJello
A: 

Have you considered a distributed framework, to distribute your work across many machines? You might investigate CloudIQ Platform from Appistry. It has a module called CloudIQ Engine which takes applications written in C/C++, Java, or .NET, and deploys them across any number of machines.

It has built in Process Flow support, and all jobs submitted to Engine are monitored for execution, so if the machine they are runing on dies, the job can be automatically restarted on another machine in the platform.

If you are running a large number of independent jobs, this platform can abstract many of the complexities of a distributed framework. If you want your job to run faster, add more machines. The platform will automatically distribute your applications to the new machines and start sending work to them. All on the fly.

Brett McCann
Thanks for the suggestion, but this is too general a framework to meet my needs so far as I can tell (tried to get past the marketing fluff on the website but couldn't find any technical details of relevance, which is rarely a good sign). Most graph algorithms by their very nature are difficult to distribute, and either low-level or graph-oriented distributed frameworks are probably the most useful. These fall into three main categories: (1) message-passing frameworks, (2) distributed memory, or (3) distributed graph frameworks in which the global state is layered across either #1 or #2.
Joel Hoff
+1  A: 

Java-based HyperGraphDB is distributed, but I've never tried it: http://www.kobrix.com/hgdb.jsp

I've had good luck with Neo4j ( http://www.neo4j.org/ ) (also Java-based). Tight, transactional API, disk-based I/O, can be run as a REST server, Python bindings, etc. It is not distributed, but their website advertises the ability to handle billions of nodes.

gilesc
The graph database tools are interesting, though not quite what I'm searching for just now. In [this answer](http://stackoverflow.com/questions/3010805/scalable-parallel-large-graph-analysis-library/3036506#3036506 "title") a blog article was mentioned discussing several radical approaches to solving very high-end graph problems, such as over 100 billion edges. One of those experiments involved a relational database, Netezza. What size graphs have you tested with Neo4j, and what was the processing model you employed?.
Joel Hoff
+1  A: 

You can try the Combinatorial BLAS

It has a sparse linear algebra view of graphs and emphasizes scalability through a 2D distribution of the sparse adjacency matrix of the graph. I achieved almost 400 MegaTEPS (millions of edges traversed per second) on a betweenness centrality benchmark. It is able to handle graphs with hundreds of millions of vertices/edges. Your main concerns will be (a) casting your problem into the language of linear algebra. (b) code maturity. I can try to help you in both of those if you contact me. A comparison of various HPC graph libraries exist in Section 1.2 of my dissertation.

Good luck in your search.

Aydin Buluc
Thanks for the suggestion of your library, as well as pointing out the relevant comparison section from your thesis. I actually have selected LEMON for my current projects now on the basis of extensive scaling and performance testing, but I've not yet published the full test results here on SO. There will be ongoing and changing needs within my group over the next several years, however, so alternative answers are most welcome.
Joel Hoff
LEMON looks like a good attempt and your experiments show good performance. It doesn't really satisfy the "Scalable/Parallel" part of the title of your post though. In that sense, I don't know what makes it an innovation over BGL or LEDA.
Aydin Buluc
@Aydin Although LEMON does not provide any support for parallelism, this was not a strict requirement for my current needs. It is, however, quite scalable. I posted some numbers for memory performance; my experiments for edge traversals were not formal benchmarks so I've not quite determined yet what would be the right to present them here. I've never used LEDA, and for my most immediate project I needed an open-source solution. However, LEMON is *much* easier to use than BGL; in trying to test the latter I found it to have such an appallingly arcane API as to be almost unusable.
Joel Hoff
+1  A: 

The mcl network analysis software is a command-line driven set of tools for graph analysis written in C, of which I am the author. It is probably not as mature nor as comprehensive as some of the other packages you found, but it is very focused on scalability, using parallelization where possible. I am interested in further improving and extending it in this direction. Currently its emphasis is on the computation of just widespread and basic graph traits, such as shortest paths, betweenness centrality, and clustering coefficient, using both CPU and machine parallelisation (with threads and job dispatching respectively). Internally it uses a sparse matrix implementation library that I currently use in a very straightforward way (i.e. the data structures are not hidden from the algorithms). MCL was originally just a clustering algorithm. To give an indication, the clustering algorithm is nowadays routinely run on datasets with 3M nodes and 300M edges, taking roughly 4 hours on a 4-CPU machine, where the biggest step computationally is the computation of products of a graph incidence matrix with itself.

Stijn van Dongen
A: 

Ruby library that is much influenced by the Boost Graph Library. http://rgl.rubyforge.org/rgl/index.html

arikan
A: 

No one has mentioned graph-tool yet (http://projects.forked.de/graph-tool/).

It would seem to merit some consideration as it implements some algorithms using OpenMP directives on top of the boost graph library (see the download page for some info on the parallelism)

It looks like you've decided on LEMON already, but I'm including it here in case others run into this question in the future.

dgleich