views:

2308

answers:

7

It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix?

+4  A: 

PREAMBLE:

Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.

Take for example the following loop:

for (int i = 0; i < m[].length; i++)
{
    for (int j = 0; j < m[i].length; j++)
    {
        m[i][j]++; 
    }
}

And given a matrix of the following layout:

       j=0   j=1   j=2
 i=0  [   ] [   ] [   ]
 i=1  [   ] [   ] [   ]
 i=2  [   ] [   ] [   ]

Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.

for (int i = 0; i < m[].length; i++)
{
    for (int j = 0; j < m[i].length; j++)
    {
        //For obvious reasons, matrix index verification code removed
        m[i][j] = m[i/2][j] + m[i][j+7]; 
    }
}

Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.

ANSWER:

It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.

In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.

With that said, Google is good at surprising the programmer community and their implementation could be entirely different.

POSTAMBLE:

Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)

Gavin Miller
"It is entirely possible for an eigenvalue to be calculated without maintaining a copy of the matrix on all slave computers." ??? How do you come to that conclusion? PageRank's matrices are sparse.
Jason S
@Jason - My meaning and how I wrote it were not the same. I made an edit to that effect. Thanks for pointing that out.
Gavin Miller
+1  A: 

I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.

PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)

Jason S
+4  A: 

This looks pertinent:

http://www.gg.caltech.edu/~zhukov/papers/ParallelPageRank-paper.pdf

So does this:

http://plato.asu.edu/slides/golub.pdf

duffymo
Section 2.3 - Parallel Implementation describes the solution; Good find!
Gavin Miller
The implementation uses the PETSc library with MPI, not MapReduce. Also, the link does not work (anymore), but this summary by a co-author does: http://www.stanford.edu/~dgleich/projects/ppagerank/index.html
bubaker
+1  A: 

I can answer myself now. The PageRank algorithm take advantage of sparse matrix where it should converge at the eigenvalue with several self-multiply. Thus, in PageRank practice, the Map/Reduce procedure is valid. You can perform matrix multiply in Map procedure and form a sparse matrix in Reduce procedure. But for general matrix eigenvalue finding, it is still a tricky problem.

+1  A: 

video http://nl.youtube.com/watch?v=BT-piFBP4fE explains mapreduce and page rank. The used matrix is indeed very sparse.

tuinstoel
+6  A: 

PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.

If NxM matrix A describes the link weight (amount of flow) from node n to node m, then

p_{n+1} = A . p_{n}

In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.

The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.

You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)

(disclaimer: I am an author of wukong.)

mrflip
+1  A: 

The apache hama project has some interesting implementation of the Jacobi eigenvalue algorithm. It runs on hadoop. Note the rotation happens in the scan of the matrix not in the map reduce.

dspiteself