views:

498

answers:

5

Hello,

I'm doing an Information Retrieval task. I built a simple searchengine. The InvertedIndex is a python dictionary object which is serialized (pickled in python terminology) to a file. Size of this file is InvertedIndex is just 6.5MB.

So, my Code just unpickles it and searches it for query & ranks the matching documents according to TF-IDF score. Doesn't sound anything huge right?

It started running 30min ago and still running. Private Bytes & Virtual Size usage of pythonw.exe running my 100line python script is 88MB and 168MB respectively.

When I tried it with index of smaller size it was fast. Is it the python or my code? Why is it so slow?

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
import PorterStemmer
import math
import pickle

def TF(term,doc):
    #Term Frequency: No. of times `term` occured in `doc`
    global InvertedIndex
    idx = InvertedIndex[term].index(doc)
    count = 0
    while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
        count= count+1
        idx = idx+1
    return count

def DF(term):
    #Document Frequency: No. of documents containing `term`
    global InvertedIndex
    return len(set(InvertedIndex[term]))

def avgTF(term, doc):
    global docs
    TFs = []
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return sum(TFs)/len(TFs)

def maxTF(term, doc):
    global docs
    TFs = []    
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return max(TFs)

def getValues4Term(term, doc):
    TermFrequency = {}
    TermFrequency['natural'] = TF(term,doc)
    TermFrequency['log'] = 1+math.log( TF(term,doc) )
    TermFrequency['aug'] = 0.5+float(0.5*TF(term,doc)/maxTF(term,doc))
    TermFrequency['bool'] = 1 if TF(term,doc)>0 else 0
    TermFrequency['log_avg'] = float(1+math.log( TF(term,doc) ))/(1+math.log( avgTF(term,doc) ))

    DocumentFrequency = {}
    DocumentFrequency['no'] = 1
    DocumentFrequency['idf'] = math.log( len(docs)/DF(term) )
    DocumentFrequency['probIDF'] = max( [0, math.log( float(len(docs)-DF(term))/DF(term) )] )
    return [TermFrequency, DocumentFrequency]

def Cosine(resultDocVector, qVector, doc):
    #`doc` parameter is the document number corresponding to resultDocVector
    global qterms,docs
    # Defining Cosine similarity : cos(a) = A.B/|A||B|

    dotProduct = 0
    commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
    for cmnTerm in commonTerms_q_d:
       dotProduct =  dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]

    resultSquares = []
    for k in resultDocVector:
        resultSquares.append(k*k)

    qSquares = []
    for k in qVector:
        qSquares.append(k*k)

    denominator = math.sqrt(sum(resultSquares)) * math.sqrt(sum(qSquares))
    return dotProduct/denominator

def load():
    #load index from a file
    global InvertedIndex, docIDs, docs
    PICKLE_InvertedIndex_FILE = open("InvertedIndex.db", 'rb')
    InvertedIndex = pickle.load(PICKLE_InvertedIndex_FILE)
    PICKLE_InvertedIndex_FILE.close()

    PICKLE_docIDs_FILE = open("docIDs.db", 'rb')
    docIDs = pickle.load(PICKLE_docIDs_FILE)
    PICKLE_docIDs_FILE.close()

    PICKLE_docs_FILE = open("docs.db", 'rb')
    docs = pickle.load(PICKLE_docs_FILE)
    PICKLE_docs_FILE.close()
########################
docs = []
docIDs = []
InvertedIndex = {}
load()

stemmer = PorterStemmer.PorterStemmer()
#<getting results for a query
query = 'Antarctica exploration'
qwords = query.strip().split()
qterms = []
qterms1 = []
for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))


#getting posting lists for each qterms & merging them
prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)
#</getting results for a query

#We've got the results. Now lets rank them using Cosine similarity.
i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1
#Normalization = {}

# forming vectors for each document in the result
i = 0 #document iterator
j = 0 #term iterator
resultDocVectors = []#contains document vector for each result.

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

#forming vector for query
qVector = []
qTF = []
qDF = []
for qterm in qterms:
    count = 0
    idx = qterms1.index(qterm)
    while idx < len(qterms1) and qterms1[idx] == qterm:
        count= count+1
        idx = idx+1
    qTF.append(count)
qVector = qTF    


#compuing Cosine similarities of all resultDocVectors w.r.t qVector
i = 0
CosineVals = []
for resultDocVector in resultDocVectors:
    doc = results[i]
    CosineVals.append(Cosine(resultDocVector, qVector, doc))
    i = i+1

#ranking as per Cosine Similarities
#this is not "perfect" sorting. As it may not give 100% correct results when it multiple docs have same cosine similarities.
CosineValsCopy = CosineVals
CosineVals.sort()
sortedCosineVals = CosineVals
CosineVals = CosineValsCopy
rankedResults = []
for cval in sortedCosineVals:    
    rankedResults.append(results[CosineVals.index(cval)])
rankedResults.reverse()

#<Evaluation of the system:>

#parsing qrels.txt & getting relevances
# qrels.txt contains columns of the form:
#       qid  iter  docno  rel
#2nd column `iter` can be ignored.
relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]
fh.close()

#precision = no. of relevant docs retrieved/total no. of docs retrieved
no_of_relevant_docs_retrieved = set(rankedResults).intersection( set(relevances[queryID]) )
Precision = no_of_relevant_docs_retrieved/len(rankedResults)

#recall = no. of relevant docs retrieved/ total no. of relevant docs
Recall = no_of_relevant_docs_retrieved/len(relevances[queryID])
+15  A: 

It's definitely your code, but since you choose to hide it from us it's impossible for us to help any further. All I can tell you based on the very scarce info you choose to supply is that unpickling a dict (in the right way) is much faster, and indexing into it (assuming that's what you mean by "searches it for query") is blazingly fast. It's from this data that I deduce that the cause of your slowdown must be something else you do, or do wrong, in your code.

Edit: now that you have posted you code I notice, just at a glance, that a lot of nontrivial code is running at module top level. Really horrible practice, and detrimental to performance: put all your nontrivial code into a function, and call that function -- that by itself will give you a tens-of-percent speedup, at zero cost of complication. I must have mentioned that crucial fact at least 20 times just in my Stack Overflow posts, not to mention "Python in a Nutshell" etc -- surely if you care about performance you cannot blithely ignore such easily available and widespread information?!

More easily-fixed wastes of runtime:

import pickle

use cPickle (if you're not on Python 2.6 or 2.7, but rather on 3.1, then there may be other causes of performance issues -- I don't know how finely tuned 3.1 is at this time compared with the awesome performance of 2.6 and 2.7).

All of your global statements are useless except the one in load (not a serious performance hit, but redundant and useless code should be eliminated on principle). You only need a global if you want to bind a module-global variable from within a function, and load is the only one where you're doing that.

More editing:

Now we get to more important stuff: the values in InvertedIndex appear to be lists of docs, so to know how many times a doc appears in one you have to loop throughout it. Why not make each of the values be instead a dict from doc to number of occurrences? No looping (and no looping where you now do the len(set(...)) of a value in InvertedIndex -- just the len will then be equivalent and you save the set(...) operation which implicitly has to loop to perform its job). This is a big-O optimization, not "just" the kind of speedup by 20% or so that things I've mentioned so far may account for -- i.e., this is more important stuff, as I said. Use the right data structures and algorithms and a lot of minor inefficiencies may become relatively unimportant; use the wrong ones, and there's no saving your code's performance as the input size grows, no matter how cleverly you micro-optimize the wrong data structures and algorithms;-).

And more: you repeat your computations a lot, "from scratch" each time -- e.g., look how many times you call TF(term, doc) for each given term and doc (and each call has the crucial inefficiency I just explained too). As the quickest way to fix this huge inefficiency, use memoization -- e.g., with the memoized decorator found here.

OK, it's getting late for me and I'd better head off to bed -- I hope some or all of the above suggestions were of some use to you!

Alex Martelli
@Alex: Isn't it better anyway, to completely get rid of all calls to global variables, where possible? In Python I have no working experience. But in PHP I also see this often but remember clearly that my boss would cut my head of, if he would see me using global variables (even when nessesary).
erikb
@erikb, yep, removing globals would indeed be better, but it requires substantially more refactoring effort than the changes I've mentioned so far, and mostly for code maintainability advantages, not so much performance (unless all the globals could be turned into local variables of a single function and passed as arguments to all other functions requiring them, but that's even _more_ code refactoring for the resulting minor performance gain -- I've been focusing mostly on performance in this answer because that was the OP's crucial current problem.
Alex Martelli
@Alex: why is function level code executed faster than module level code? Forgive me, I'm unfamiliar with the inner workings of Python.
Rafe Kettler
@Rafe Kettler: module-level *variables* are stored in a python dict(ionary); while function-level *variables* are internally compiled into C's array indexing. Dictionary retrieval is slightly slower than array indexing, though they are both O(1) level.
Lie Ryan
claws
@claws - That is what they're for. But this means poorly organized code will perform poorly in Python, just like poorly indented code won't work.
Omnifarious
@Lie: thanks so much! This is a Protip for sure.
Rafe Kettler
+4  A: 
aaronasterling
I think it might also have some learning effects to mention, that even small changes in this function can improve the code. Like replacing `idx=idx+1` with `idx+=1`, replacing the `while` with a meaningful `for` loop and last but not least a change like yours to bring it all together in a understandable oneliner.
erikb
claws
@erikb: `replacing the while with a meaningful for loop` does this also contribute to performance improvement? :O
claws
@claws: Yes, having some basic programming skills certainly increases the average performance of your programs. My suggestions should work in all languages like C or of a higher level. About Assembly I don't know, because I never had to use it. To your book question: Nearly all good programming books can help you there. Look for example [here](http://stackoverflow.com/questions/194812/list-of-freely-available-programming-books)
erikb
@claws: I recommend The Python Cookbook (http://oreilly.com/catalog/9780596001674) by our own great Alex Martelli. Thanks a lot, Alex.
pillmuncher
+1  A: 

Does Python cache function results? I don't think so. In this case it might be a bad idea to run a looping function like TF(term,doc) many times in getValues4Term(). When you put the result in a variable, you probably already get a huge speed upgrade. That combined with

for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)

might already be the biggest speed problem, you have.

erikb
Good point! Can you tell me what should I look into to do the caching?
claws
Alex describes already [how to use memoization](http://stackoverflow.com/questions/3801072/kindly-review-the-python-code-to-boost-its-performance/3801100#3801100). That works fine for python. A solution that works in all languages is to store the result of `TF(term,doc)` in a variable in the beginning of your function and then just use that variable instead of `TF(term,doc)` calls.
erikb
+1  A: 

It's nice that you post your code in response to people asking you to, but unless they run it and profile it, the best they can do is guess. I can guess too, but even if guesses are "good" or "educated", they are not a good way to find performance problems.

I would rather refer you to a technique that will pinpoint the problem. That will work better than guessing or asking others to guess. Once you've found out for yourself where the problem is exactly, you can decide whether to use memoization or whatever to fix it.

Usually there's more than one problem. If you repeat the process of finding and removing performance problems, you'll approach a real optimum.

Mike Dunlavey
+2  A: 

This is more micro-optimization, in the spirit of @aaronasterling, above. Still, I think these observations are worth considering.

Use appropriate datatypes

stopwords should be a set. You can't repeatedly search a list repeatedly and expect it to be fast.

Use more sets. They are iterable, like lists, but when you have to search them, they are way faster than lists.


List comprehensions

resultSquares = [k*k for k in resultDocVector]
qSquares = [k*k for k in qVector]
TFs = [TF(term,doc) for term in docs[doc]]

Generators

Turn this:

for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))

into this:

qworditer = (qword.lower() for qword in qwords if qword not in stopwords)
qtermiter = (stemmer.stem(qword,0,len(qword)-1) for qword in qworditer)
qterms1 = set([qterm for qterm in qtermiter])

Use generators and reduce():

Turn this:

prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)

Into this:

qtermiter = (set(InvertedIndex[qterm]) for qterm in qterms if qterm in InvertedIndex)
results = reduce(set.intersection, qtermiter)

Use list comprehensions

Instead of this:

i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1

Write this:

docComponents = [getValues4Term(term,doc) for doc in results for term in docs[doc]]

This code makes no sense:

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

The value len(docs[doc]) depends on doc, and the value of doc is whatever was last reached in the loop for doc in results.


Use collections.defaultdict

Instead of this:

relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]

Write this (assuming your file has only three fields per line):

from collections import defaultdict
relevances = defaultdict(list)
with open("qrels.txt") as fh:
    lineiter = (line.strip().split() for line in fh)
    for queryID, _, docID in lineiter:
        relevances[queryID].append(docID)

As many other people have said, memoize your computations.


2010-10-21: An update on the stopwords thing.

from datetime import datetime

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
print len(stopwords)
dictfile = '/usr/share/dict/american-english-huge'
with open(dictfile) as f:
    words = [line.strip() for line in f]

print len(words)

s = datetime.now()
total = sum(1 for word in words if word in stopwords)
e = datetime.now()
elapsed = e - s
print elapsed, total

s = datetime.now()
stopwords_set = set(stopwords)
total = sum(1 for word in words if word in stopwords_set)
e = datetime.now()
elapsed = e - s
print elapsed, total

I get these results:

# Using list
>>> print elapsed, total
0:00:06.902529 542

# Using set
>>> print elapsed, total
0:00:00.050676 542

Same number of results, but one runs almost 140 times faster. Granted, you probably don't have so many words to compare against your stopwords, and 6 seconds is negligible against your 30+ minute runtime. It does underline, though, that using appropriate data structures speeds up your code.

hughdbrown