views:

258

answers:

1

Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.

I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.

For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples.

I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks.

  1. I would like to "zip" together the graph file and the page ranks, such that each line represents a page, it's rank, and it's out links.
  2. Since this graph file is separated into N chunks, I need to split the pagerank vector into N parallel chunks, and zip the regions of the pagerank vectors to the graph chunks

This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach.

Any thoughts?

+1  A: 

Looks like you'll want to use an init_map for the first pass and then a iter_map for each subsequent iteration.

See: http://discoproject.org/doc/faq.html#id7

Can you output python object that include the outlinks, instead of just the (page,rank) tuples?

Another option would be to have the outlinks keyed by page somewhere (dict, memcache, kyotocabinet, etc...) and look them up from the mapping function. If you're chaining things with Disco, I don't think you'll want to zip things together in the middle of the workflow.

Aaron Croyle