Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.
I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.
For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples.
I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks.
- I would like to "zip" together the graph file and the page ranks, such that each line represents a page, it's rank, and it's out links.
- Since this graph file is separated into N chunks, I need to split the pagerank vector into N parallel chunks, and zip the regions of the pagerank vectors to the graph chunks
This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach.
Any thoughts?