views:

104

answers:

1

I am attempting to understand the concepts behind Google PageRank, and am attempting to implement a similar (though rudimentary) version in Python. I have spent the last few hours familiarizing myself with the algorithm, however it's still not all that clear.

I've located a particularly interesting website that outlines the implementation of PageRank in Python. However, I can't quite seem to understand the purpose of all of the functions shown on this page. Could anyone clarify what exactly the functions are doing, particularly pageRankeGenerator?

+1  A: 

I'll try to give a simple explanation (definition) of the PageRank algorithm from my personal notes.

Let us say that pages T1, T2, ... Tn are pointing to page A, then

PR(A) = (1-d) + d * (PR(T1) / C(T1) + ... + PR(Tn) / C(Tn))

where

  • PR(Ti) is the PageRank of Ti
  • C(Ti) is the number of outgoing links from page Ti
  • d is the dumping factor in the range 0 < d < 1, usually set to 0.85

Every PR(x) can have start value 1 and we adjust the page ranks by repeating the algorithm ~10-20 times for each page.

Example for pages A, B, C:

   A <--> B
   ^     /
    \   v
      C

Round 1
A = 0.15 + 0.85 (1/2 + 1/1) = 1.425
B = 0.15 + 0.85 (1/1) = 1
C = 0.15 + 0.85 (1/2) = 0.575

round's sum = 3

Round 2
A = 0.15 + 0.85 (1/2 + 0.575) = 1.06375
B = 0.15 + 0.85 (1.425) = 1.36125
C = 0.15 + 0.85 (1/2) = 0.575

round's sum = 3

Round 3
A = 0.15 + 0.85 (1.36125/2 + 0.575) = 1.217
B = 0.15 + 0.85 (1.06375) = 1.054
C = 0.728

round's sum = 3

...

Nick D
Alright this makes more sense. I take it the "Rounds" are the same thing as the convergence steps? (i.e., limit the number of rounds). Also, many papers i've been reading make use of eigenvalues, an outgoing link matrix (A), and a incoming link matrix (A^T). Where do these fit in above, and are they even necessary?
Louis
@Louis, yes "rounds" are the convergence steps. My linear algebra is a bit rusty, but I think eigenvalues are just the calculated page ranks. In my example I used a formula for one page. If we rewrite it as formula for n pages we get n-dimensional (or matrix) representation. IMO matrix representation is a bit more difficult to grasp the first time.
Nick D
One more question: Why do you not propogate the updated values to round 2? For instance, in Round 1 you found B = 1, C = 0.575. However in round 2, you have A = 0.15 + 0.85 (1/2 + 0.575). Why did you still use 1/2 instead of 1, as was found in round 1? I'm working on a class, and Round 1 is correct, however subsequent rounds do not match what you have shown. http://pastebin.com/raw.php?i=mQXjXyRS
Louis
@Louis I use the updated values of B and C divided by their number of outgoing links. On round 2: (1/2 + 0.575/1) = (P1(B)/C(B) + P1(C)/C(C)).
Nick D