ansaurus

Question

How to optimize my PageRank calculation?

Answer 1

A:

Do you have enough RAM to hold the sparse matrix (fromid, toid) in some form? That would allow big optimizations (with big algorithmic changes). At least, caching in memory the (fromid, numlinks) that you now do with a select count(*) in your innermost loop should help (I'd imagine that cache, being O(N) in space if you're dealing with N URLs, would be more likely to fit in memory).

Alex Martelli 2010-03-21 01:06:48

Answer 2

+1 A:

If you have a very large database (e.g. # records ~ # pages in the WWW) using the database in a manner similar to what's suggested in the book makes sense, because you're not going to be able to keep all that data in memory.

If your dataset is small enough, you can (probably) improve your second version by not doing so many queries. Try replacing your first loop with something like this:

for urlid, in self.con.execute('select rowid from urllist'):
    inlinks[urlid] = []
    numoutlinks[urlid] = 0
    pagerank[urlid] = 1.0

for src, dest in self.con.execute('select fromid, toid from link'):
    inlinks[dest].append(src)
    numoutlinks[src] += 1

This version does exactly 2 queries instead of O(n^2) queries.

allyourcode 2010-03-21 01:23:30

Answer 3

+1 A:

unutbu 2010-03-21 01:27:03

Answer 4

A:

I'm answering my own question, since in the end it came out that a mixture of all answers worked best for me:

    def calculatepagerank4(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0

    for src,dest in self.con.execute("select distinct fromid, toid from link"):
        inlinks[dest].append(src)
        numoutlinks[src]+=1          

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr

    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany("update pagerank set score=? where urlid=?" , args)
    self.dbcommit()

So I replaced the first two loops as suggested by allyourcode, but in addition also used executemany() as in the solution from ˜unutbu. But unlike ˜unutbu I use a generator expression for args, to not waste too much memory, although using a list comprehension was a little bit faster. In the end the routine was 100 times faster than the routine suggested in the book:

>>> cProfile.run("crawler.calculatepagerank4()")
     33512 function calls in 1.377 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    1.377    1.377 <string>:1(<module>)
     2    0.000    0.000    0.073    0.036 searchengine.py:27(dbcommit)
     1    0.693    0.693    1.373    1.373 searchengine.py:286(calculatepagerank4
 10432    0.011    0.000    0.011    0.000 searchengine.py:321(<genexpr>)
 23065    0.009    0.000    0.009    0.000 {method 'append' of 'list' objects}
     2    0.073    0.036    0.073    0.036 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
     6    0.379    0.063    0.379    0.063 {method 'execute' of 'sqlite3.Connecti
     1    0.209    0.209    0.220    0.220 {method 'executemany' of 'sqlite3.Conn
     1    0.000    0.000    0.000    0.000 {range}

One should also be aware of the following problems:

If you use string formating with %f instead of using a placeholder ? for constructing the SQL statement you will loose precision (e.g. I got 2.9796095721920315 using ? but 2.9796100000000001 using %f.
Duplicate links from one page to another are treated as only one link in the default PageRank algorithm. However the solution from the book didn't take that into account.
The whole algorithm from the book is flawed: The reason is, that in each iteration, the pagerank score is not stored in a second table. But this means the outcome of an iteration depends on the order of the pages looped through and this might change the result after several iterations quite drastically. To fix this problem one either has to use an additional table/dictionary to store the pagerank for the next iteration or to use a completely different algorithm like Power Iteration.

asmaier 2010-03-24 22:25:33

ansaurus

tags:

views:

answers:

How to optimize my PageRank calculation?

related questions