ansaurus

Question

Kindly review the python code to boost its performance

Answer 1

+15 A:

It's definitely your code, but since you choose to hide it from us it's impossible for us to help any further. All I can tell you based on the very scarce info you choose to supply is that unpickling a dict (in the right way) is much faster, and indexing into it (assuming that's what you mean by "searches it for query") is blazingly fast. It's from this data that I deduce that the cause of your slowdown must be something else you do, or do wrong, in your code.

Edit: now that you have posted you code I notice, just at a glance, that a lot of nontrivial code is running at module top level. Really horrible practice, and detrimental to performance: put all your nontrivial code into a function, and call that function -- that by itself will give you a tens-of-percent speedup, at zero cost of complication. I must have mentioned that crucial fact at least 20 times just in my Stack Overflow posts, not to mention "Python in a Nutshell" etc -- surely if you care about performance you cannot blithely ignore such easily available and widespread information?!

More easily-fixed wastes of runtime:

import pickle

use cPickle (if you're not on Python 2.6 or 2.7, but rather on 3.1, then there may be other causes of performance issues -- I don't know how finely tuned 3.1 is at this time compared with the awesome performance of 2.6 and 2.7).

All of your global statements are useless except the one in load (not a serious performance hit, but redundant and useless code should be eliminated on principle). You only need a global if you want to bind a module-global variable from within a function, and load is the only one where you're doing that.

More editing:

Now we get to more important stuff: the values in InvertedIndex appear to be lists of docs, so to know how many times a doc appears in one you have to loop throughout it. Why not make each of the values be instead a dict from doc to number of occurrences? No looping (and no looping where you now do the len(set(...)) of a value in InvertedIndex -- just the len will then be equivalent and you save the set(...) operation which implicitly has to loop to perform its job). This is a big-O optimization, not "just" the kind of speedup by 20% or so that things I've mentioned so far may account for -- i.e., this is more important stuff, as I said. Use the right data structures and algorithms and a lot of minor inefficiencies may become relatively unimportant; use the wrong ones, and there's no saving your code's performance as the input size grows, no matter how cleverly you micro-optimize the wrong data structures and algorithms;-).

And more: you repeat your computations a lot, "from scratch" each time -- e.g., look how many times you call TF(term, doc) for each given term and doc (and each call has the crucial inefficiency I just explained too). As the quickest way to fix this huge inefficiency, use memoization -- e.g., with the memoized decorator found here.

OK, it's getting late for me and I'd better head off to bed -- I hope some or all of the above suggestions were of some use to you!

Alex Martelli 2010-09-27 04:21:31

@Alex: Isn't it better anyway, to completely get rid of all calls to global variables, where possible? In Python I have no working experience. But in PHP I also see this often but remember clearly that my boss would cut my head of, if he would see me using global variables (even when nessesary).

erikb 2010-09-27 04:41:30

@erikb, yep, removing globals would indeed be better, but it requires substantially more refactoring effort than the changes I've mentioned so far, and mostly for code maintainability advantages, not so much performance (unless all the globals could be turned into local variables of a single function and passed as arguments to all other functions requiring them, but that's even _more_ code refactoring for the resulting minor performance gain -- I've been focusing mostly on performance in this answer because that was the OP's crucial current problem.

Alex Martelli 2010-09-27 04:45:35

@Alex: why is function level code executed faster than module level code? Forgive me, I'm unfamiliar with the inner workings of Python.

Rafe Kettler 2010-09-27 04:48:26

@Rafe Kettler: module-level *variables* are stored in a python dict(ionary); while function-level *variables* are internally compiled into C's array indexing. Dictionary retrieval is slightly slower than array indexing, though they are both O(1) level.

Lie Ryan 2010-09-27 06:06:27

claws 2010-09-27 07:44:38

@claws - That is what they're for. But this means poorly organized code will perform poorly in Python, just like poorly indented code won't work.

Omnifarious 2010-09-27 12:44:32

@Lie: thanks so much! This is a Protip for sure.

Rafe Kettler 2010-09-27 13:51:40

Answer 2

+4 A:

aaronasterling 2010-09-27 04:26:35

I think it might also have some learning effects to mention, that even small changes in this function can improve the code. Like replacing `idx=idx+1` with `idx+=1`, replacing the `while` with a meaningful `for` loop and last but not least a change like yours to bring it all together in a understandable oneliner.

erikb 2010-09-27 04:53:37

claws 2010-09-27 08:08:05

@erikb: `replacing the while with a meaningful for loop` does this also contribute to performance improvement? :O

claws 2010-09-27 09:06:33

@claws: Yes, having some basic programming skills certainly increases the average performance of your programs. My suggestions should work in all languages like C or of a higher level. About Assembly I don't know, because I never had to use it. To your book question: Nearly all good programming books can help you there. Look for example [here](http://stackoverflow.com/questions/194812/list-of-freely-available-programming-books)

erikb 2010-09-27 09:12:20

@claws: I recommend The Python Cookbook (http://oreilly.com/catalog/9780596001674) by our own great Alex Martelli. Thanks a lot, Alex.

pillmuncher 2010-09-27 11:34:45

Answer 3

+1 A:

Does Python cache function results? I don't think so. In this case it might be a bad idea to run a looping function like TF(term,doc) many times in getValues4Term(). When you put the result in a variable, you probably already get a huge speed upgrade. That combined with

for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)

might already be the biggest speed problem, you have.

erikb 2010-09-27 04:45:35

Good point! Can you tell me what should I look into to do the caching?

claws 2010-09-27 09:12:00

Alex describes already [how to use memoization](http://stackoverflow.com/questions/3801072/kindly-review-the-python-code-to-boost-its-performance/3801100#3801100). That works fine for python. A solution that works in all languages is to store the result of `TF(term,doc)` in a variable in the beginning of your function and then just use that variable instead of `TF(term,doc)` calls.

erikb 2010-09-27 10:49:23

Answer 4

+1 A:

It's nice that you post your code in response to people asking you to, but unless they run it and profile it, the best they can do is guess. I can guess too, but even if guesses are "good" or "educated", they are not a good way to find performance problems.

I would rather refer you to a technique that will pinpoint the problem. That will work better than guessing or asking others to guess. Once you've found out for yourself where the problem is exactly, you can decide whether to use memoization or whatever to fix it.

Usually there's more than one problem. If you repeat the process of finding and removing performance problems, you'll approach a real optimum.

Mike Dunlavey 2010-09-27 16:50:53

Answer 5

+2 A:

This is more micro-optimization, in the spirit of @aaronasterling, above. Still, I think these observations are worth considering.

Use appropriate datatypes

stopwords should be a set. You can't repeatedly search a list repeatedly and expect it to be fast.

Use more sets. They are iterable, like lists, but when you have to search them, they are way faster than lists.

List comprehensions

resultSquares = [k*k for k in resultDocVector]
qSquares = [k*k for k in qVector]
TFs = [TF(term,doc) for term in docs[doc]]

Generators

Turn this:

for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))

into this:

qworditer = (qword.lower() for qword in qwords if qword not in stopwords)
qtermiter = (stemmer.stem(qword,0,len(qword)-1) for qword in qworditer)
qterms1 = set([qterm for qterm in qtermiter])

Use generators and reduce():

Turn this:

prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)

Into this:

qtermiter = (set(InvertedIndex[qterm]) for qterm in qterms if qterm in InvertedIndex)
results = reduce(set.intersection, qtermiter)

Use list comprehensions

Instead of this:

i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1

Write this:

docComponents = [getValues4Term(term,doc) for doc in results for term in docs[doc]]

This code makes no sense:

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

The value len(docs[doc]) depends on doc, and the value of doc is whatever was last reached in the loop for doc in results.

Use collections.defaultdict

Instead of this:

relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]

Write this (assuming your file has only three fields per line):

from collections import defaultdict
relevances = defaultdict(list)
with open("qrels.txt") as fh:
    lineiter = (line.strip().split() for line in fh)
    for queryID, _, docID in lineiter:
        relevances[queryID].append(docID)

As many other people have said, memoize your computations.

2010-10-21: An update on the stopwords thing.

from datetime import datetime

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
print len(stopwords)
dictfile = '/usr/share/dict/american-english-huge'
with open(dictfile) as f:
    words = [line.strip() for line in f]

print len(words)

s = datetime.now()
total = sum(1 for word in words if word in stopwords)
e = datetime.now()
elapsed = e - s
print elapsed, total

s = datetime.now()
stopwords_set = set(stopwords)
total = sum(1 for word in words if word in stopwords_set)
e = datetime.now()
elapsed = e - s
print elapsed, total

I get these results:

# Using list
>>> print elapsed, total
0:00:06.902529 542

# Using set
>>> print elapsed, total
0:00:00.050676 542

Same number of results, but one runs almost 140 times faster. Granted, you probably don't have so many words to compare against your stopwords, and 6 seconds is negligible against your 30+ minute runtime. It does underline, though, that using appropriate data structures speeds up your code.

hughdbrown 2010-10-21 03:42:13

ansaurus

tags:

views:

answers:

Kindly review the python code to boost its performance

related questions