views:

1787

answers:

11

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

"distance" is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from "word" by exactly 1 letter.

I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.

Edit: Forgot to mention that wordlist is pre-filtered to only have words containing the same number of letters as "word" so it's guaranteed that "word1" and "word2" have the same number of chars.

Results:

Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)

I tested with 2 sets of inputs which I'll call A and B Optimization1: From

for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

to

for x,y in zip (word1, word2):
        if x != y:
            difference += 1
    return difference

Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B

Optimization2: Added a separate method for differs by one in addition to the distance method (which I still needed elsewhere for the A* heuristics)

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in zip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B

Optimization3: Big winner here was to use izip instead of zip

Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B

I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!

Edit again: More results Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82

Building on that and incorporating izip instead of range, I ended up with this:

def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different

Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88

Even more results!

Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:

def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return ( w for w in oneoff if w in wordlist )

Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact

( w for w in oneoff if w in wordlist )

I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with:

return set(oneoff) & set(wordlist)

Bam! 3.74 -> 0.23 and 34.40 -> 2.25

This is truely amazing, total speed difference from my original naive implementation: 23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.

If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).

The Great Debate:

Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.

from itertools import izip
def unique(seq):  
    seen = {} 
    result = [] 
    for item in seq: 
        if item in seen:
            continue 
        seen[item] = 1 
        result.append(item) 
    return result
def cleanentries(li):
    pass
    return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
    difference = 0
    for x,y in izip (word1, word2):
        if x != y:
            difference += 1
    return difference
def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different
def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
    import Queue
    closedset = []
    openset = [start]
    pqueue = Queue.PriorityQueue(0)
    g_score = {start:0}         #Distance from start along optimal path.
    h_score = {start:distance(start, goal)}
    f_score = {start:h_score[start]}
    pqueue.put((f_score[start], start))
    parent_dict = {}
    while len(openset) > 0:
        x = pqueue.get(False)[1]
        if x == goal:
            return reconstruct_path(parent_dict,goal)
        openset.remove(x)
        closedset.append(x)
        sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
        sortedOpen.sort()
        for y in getchildren(x, wordlist):
            if y in closedset:
                continue
            temp_g_score = g_score[x] + 1
            temp_is_better = False
            appended = False
            if (not y in openset): 
                openset.append(y)
                appended = True
                h_score[y] = distance(y, goal)
                temp_is_better = True
            elif temp_g_score < g_score[y] :
                temp_is_better = True
            else :
                pass
            if temp_is_better:
                parent_dict[y] = x
                g_score[y] = temp_g_score
                f_score[y] = g_score[y] + h_score[y]
                if appended :
                    pqueue.put((f_score[y], y))
    return None


def reconstruct_path(parent_dict,node):
     if node in parent_dict.keys():
         p = reconstruct_path(parent_dict,parent_dict[node])
         p.append(node)
         return p
     else:
         return []        

wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
    userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
    words = [w.lower() for w in userentry.split()]
    if(len(words) == 2 and len(words[0]) == len(words[1])):
        break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
    print "Minimum number of steps is %s" % (len(answer))
    reply = raw_input("Would you like the answer(y/n)? ")
    if(reply.lower() == "y"):
        answer.insert(0, words[0])
        print "\n".join(answer)
    else:
        print "Good luck!"
else:
    print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")

I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:

def getchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w))

As for getting it to work as a C module I didn't get that far, but this is what I came up with:

#include "Python.h"

static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
    int length;
    const char *word1, *word2;
    if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
        return NULL;

    int i;
    int different = 0;
    for (i =0; i < length; i++)
    {
        if (*(word1 + i) != *(word2 + i))
        {
            if (different)
            {
                return Py_BuildValue("i", different);
            }
            different = 1;
        }
    }
    return Py_BuildValue("i", different);
}

PyMethodDef methods[] = {
    {"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initIsNeighbor(void)
{
    Py_InitModule("isneighbor", methods);
}

I profiled this using:

python -m cProfile "Wordgame.py"

And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.

+2  A: 
Unknown
Won't "i in word1" return the characters, rather than the indexes?
Mark Ransom
Yes, I consider it more pythonic to iterate over a string that way instead of doing range(len(string)).
Unknown
Hmm, for the first point you're right, although I'd have to rename the function. As for iterating the string instead of the indexes, how will I compare whether that letter in word1 matches the corresponding letter in word2?
Davy8
I suppose you have a point there: You could do for i,c in enumerate(word1):c != word2[i]that might look a bit clunky but then I would suggest that you use "for i in xrange(len(word1))" because xrange generates numbers on demand instead of a list that takes up memory.
Unknown
I read your explanation a few times but I didn't quite follow it, would you mind explaining it a little more indepth? Also I'm not terribly familiar with C and I've been working in high-level languages for the past few years (closest has been learning C++ in high school 6 years ago), so if you could provide me with a code sample I could try it out and let you know the results. That being said I don't have any idea how sets are implemented under the hood in Python but they seem to be super optimized, and I'd be surprised if a not too complicated C module could do better. But gotta profile.
Davy8
Also, I don't know if you're counting breaking on difference of 2 or more in your optimization, but from Mark's answer I already break on a difference of 2 or more, and actually, if the difference in the first letter, it breaks after the first letter, and even in spite of all those optimizations, changing to using sets just blew them out of the water. I'm interested in trying your idea though
Davy8
After reading over what you said a few times I realize why it my results appear to go against your numbers (after all, I can't possibly imagine that comparing char by char in Python can be 10 times slower than comparing char by char in C). You're comparing the cost of calling Distance to the cost of finding the 1-letter-off words and doing a set intersection on them in the getchildren method. That isn't the right comparison to make. If you look at the previous algorithm, the distance method is called from the getchildren method foreach word in the wordlist.
Davy8
That means that for each invocation of getchildren, the distance (or get_neighbors) method was called len(wordlist) times. The filtered wordlist for 5 letter words is about 4500 words, and getchildren is called about 1100 times and that's where the 5 mil calls come from. Sumudu's method bypasses that altogether, and the creating of the list and finding the intersection only occurs about 1100 times. So even though the operation may be slower, that's dwarfed by only being called 1/4500 times as often.
Davy8
Apples to apples comparison the checking char by char with all the shortcuts is 597 times faster than generating the sets + getting the intersection, however that doesn't make up for it being called 4500 times more often.
Davy8
@Davy8 , it is the right comparison to make. How do you think finding the intersection is done? How do you think a filtered wordlist is made? Its done by comparing each of the 5 million words, to each of the words in your set. There is no magic shortcut.
Unknown
There's only 4500 total words (because the wordlist is pre-filtered for 5 letter words before even being passed to the algorithm), being intersected with 125 one-letter-off words. It seemed that you were saying intersection is log(smaller) or in otherwords log(125, 2). Compare this to again assuming what you said, where comparing a word breaks in L/2 letters, I'll round this down to 2, even though for a 5 letter word it's more likely to be 3. This comparison is done 4500 times, so 9000. log(125,2) is about 6.9, and log(4500,2) is about 12. Lemme know if I misinterpreted your numbers.
Davy8
Furthermore, ignoring the theoretical and looking at the results. How would you implement that is_neighbor method with char-by-char comparison in C? Would it be significantly different from my Python implementation shown above? I have trouble believing that the implementation in Python could be 10 times less efficient than whatever implementation I could come up with in C. Do you believe that it's possible for a C implementation of a char-by-char string comparison to be 10x faster than the same thing in Python? My gut tells me no, but I haven't bothered to figure out how to try it yet.
Davy8
@ Davy8 I have to go now, but I guarantee you that implementing my method in C will be faster than using sets.
Unknown
Spent about half an hour relearning pointers and writing it in C and then spent another 2 hours unsuccesfully getting it to build an integrate into Python, finding out about an hour in that the 2 tutorials I looked at apparently only work in *nix. It keeps telling me error: none, but nothing happens. Anyway since I can't get that working, back to theoretical. The word list is presorted so going with what you said then yields 1517. Compare to 9000, which is assuming best case that each comparison terminates after 2 chars. Assuming the set operation needs to check char by char as well for
Davy8
all 5 chars in the 5 letter word, that still yields 7585. It should be able to do better than a char by char since the list is sorted, so it should be faster than that even. Let me know if I missed something.
Davy8
@Davy8 its not 9000, because that would mean each comparison is searching 200% of a word. This is not the case. Its more like 4500/2 which is a little bigger, but I believe that differences of 2+ will be found more often before the half way mark and without the overhead of creating the 150 different words and binary search. Where is the dictionary you used?
Unknown
@ Davy8 also note that at a wordlist of 3000, with 5 letter words, the value would be an estimated 1500 comparisons while the set method would be 150* log(1500,2) = 1582
Unknown
Your assumption is false: looking up a string in a set isn't O(log n), it's O(1), because sets are based on a hash table. Check out http://blog.tplus1.com/index.php/2008/03/17/pythons-hash-lookup-is-insanely-good/
Mark Ransom
@Unknown I'm counting char comparisons, and by definition you can't find a 2 char difference sooner than counting 2 chars. And the 7585 number is 125 * log(4500,2) * 5 (char/word), which is an overestimate (again using your numbers), compared to 4500 words of which at least 2 chars need to be compared yielding 9000 char comparisons.That aside though, in my own testing just now following the link Mark gave, it appears as if searching a set for a string is O(1). Searching 10000 times for first element of set and list of size 9000 took 20ms each. Searching for 4500th element took 18ms vs 1.8s
Davy8
The test code is as follows: l = ["foo_%s" % i for i in range(900000) ] s = set(l) def searchset(term, set): return term in set def searchlist(term, list): return term in list for i in xrange(10000): searchset("foo_4500", s); for i in xrange(10000): searchlist("foo_4500", l);Timing done using cProfile and only counting time spent int searchset and searchlist methods. The first test was searching for "foo_0" since that's the first element in the list, times were identical. Searching an arbitrary string in set of any size seems const
Davy8
Hmm, thought if I code formatted in a question and pasted in comment it'd work, but I guess not. Anyway, searching set of size 900,000 for element # 85714 is 21ms whereas the same search in a list is 45.85 seconds.
Davy8
@ Davy8, linear searching a list like that is far too different from what we are doing. One difference is that the numbers are very close together, while most of your words are very different. In your python tests, searching for the presence of a string requires iterating almost the whole word. Compare with testing if a word differs by 2+, you are likely to detect this very soon. As for Mark's remark about using a hash table, it does diminish my point a bit, but think about this: the complexity is O(1) with respect to the number of words in the dictionary, but it is O(n) where n is the
Unknown
word length, because you need to enter every char into a hash function. I am not going to go in further about the details about the operations of a hash table (which has many other hidden costs besides being O(n) for key length, Ex. collided buckets which bring performance closer to O(n)), but I am sure my point still stands for higher values of L (letter length) and lower values of D (words in the dictionary).
Unknown
A: 

I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression. It's still iterable so it shouldn't be much different in usage:

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

to

def getchildren(word, wordlist):
    return ( w for w in wordlist if distance(word, w) == 1 )

The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.

Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():

def distance(word1, word2):
    difference = 0
    for x,y in zip (word1, word2):
        if x == y:
            difference += 1
    return difference

But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)

Dan Olson
+8  A: 

Your function distance is calculating the total distance, when you really only care about distance=1. The majority of cases you'll know it's >1 within a few characters, so you could return early and save a lot of time.

Beyond that, there might be a better algorithm, but I can't think of it.

Edit: Another idea.

You can make 2 cases, depending on whether the first character matches. If it doesn't match, the rest of the word has to match exactly, and you can test for that in one shot. Otherwise, do it similarly to what you were doing. You could even do it recursively, but I don't think that would be faster.

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    same = True
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if same:
                same = False
            else:
                return False
    return not same

Edit 2: I've deleted the check to see if the strings are the same length, since you say it's redundant. Running Ryan's tests on my own code and on the is_neighbors function provided by MizardX, I get the following:

  • Original distance(): 3.7 seconds
  • My DifferentByOne(): 1.1 seconds
  • MizardX's is_neighbors(): 3.7 seconds

Edit 3: (Probably getting into community wiki territory here, but...)

Trying your final definition of is_neighbors() with izip instead of zip: 2.9 seconds.

Here's my latest version, which still times at 1.1 seconds:

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if different:
                return False
            different = True
    return different
Mark Ransom
if word1[0] != word2[0]: return word1[1:] == word2[1:]This is likely to make it slower because slicing an immutable string creates a new copy of the string in memory.
Unknown
That seems slower than the original code, with a really basic benchmark.. Using the timeit module, distance("abc", "abd") took 4.03 seconds, DifferenceByOne("abc", "abd") took 5.02
dbr
Is indexing each character also a slicing operation? That would probably be slow, too.
Mark Ransom
..those benchmarks were repeating the code 1 million times, using import timeit; t=timeit.Timer("code").timeit() on Python 2.5.1
dbr
Seeing all the responses with zip or izip, I see I still have some learning to do. I'm still thinking in C/C++.
Mark Ransom
Wow, I didn't think that if word1[0] != word2[0]: return word1[1:] == word2[1:]would make such a big difference, but it looks like it does. Well done sir. 5.02 -> 3.59 and 41.69 -> 29.82
Davy8
It's a classic optimization technique - peel off the common case and give it a faster treatment. You would expect the optimized case to happen 25/26 of the time, on average.
Mark Ransom
Oh, and I can see that I pasted the wrong code into my answer. I actually did test one version with izip instead of range.
Mark Ransom
+4  A: 
from itertools import izip

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in izip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Or maybe in-lining the izip code:

def is_neighbors(word1,word2):
    different = False
    next1 = iter(word1).next
    next2 = iter(word2).next
    try:
        while 1:
            if next1() != next2():
                if different:
                    return False
                different = True
    except StopIteration:
        pass
    return different

And a rewritten getchildren:

def iterchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w) )
  • izip(a,b) returns an iterator over pairs of values from a and b.
  • zip(a,b) returns a list of pairs from a and b.
MizardX
what's the difference between izip and zip?
Davy8
Cools, seems a lot faster
Davy8
+1  A: 

For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes. One of reddit's founders claims they made the website 2x as fast using this technique.

You can also use psyco on this function, but beware that it can eat up a lot of memory.

Jason Baker
A: 

Try this:

def distance(word1, word2):
  return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])

Also, do you have a link to your game? I like being destroyed by word games

Chris Cameron
Not yet, but I could post it up somewhere. The code probably all fits on one printed page so it's fairly simple
Davy8
Looks like this is actually slightly slower, but it is shorter
Davy8
LOL - I'm a lot of help. Try it again with izip instead of zip, use tuple syntax instead of list, hopefully it improves.
Chris Cameron
+3  A: 

How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization.

You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.

Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length.

I did some experiments with timeit for some alternative approaches that may be illustrative.

Timeit Results

Your Solution

d = """\
def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391

One Liner

d = """\
from itertools import izip
def hamdist(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179

Shortcut Evaluation

d = """\
def distance_is_one(word1, word2):
    diff = 0
    for i in xrange(len(word1)):
        if word1[i] != word2[i]:
            diff += 1
        if diff > 1:
            return False
    return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337
Ryan
In a question about optimisation, you're the only person (so far) to provide any kind of benchmark.. :/
dbr
I could look into the memorization stuff. It's more involved than I'd like to try for tonight, maybe tomorrow, but +1 for it anyway since it seems like it could help.
Davy8
It's spelled "memoization" there's no "r" in memoize.
S.Lott
+4  A: 

People are mainly going about this by trying to write a quicker function, but there might be another way..

"distance" is called over 5 million times

Why is this? Perhaps a better way to optimise is to try and reduce the number of calls to distance, rather than shaving milliseconds of distance's execution time. It's impossible to tell without seeing the full script, but optimising a specific function is generally unnecessary.

If that is impossible, perhaps you could write it as a C module?

dbr
Excellent point. The data could certainly be organized in a way that would eliminate a number of the checks.
Dan Olson
I thought about it, but really 90-95% of the calls are from the getchildren function shown above, so unless there's a way to have that call distance fewer times, I'm not sure.
Davy8
+1 anyway though
Davy8
lol, well I guess I did end up eliminating the call to distance all together
Davy8
A: 

First thing to occur to me:

from operator import ne

def distance(word1, word2):
    return sum(map(ne, word1, word2))

which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.

For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).

Darius Bacon
+14  A: 

If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.

Sumudu Fernando
Wow.... I think I'm going to have to accept this... you just gave it a 1000% speed boost... Well, with slight modification of converting the lists to sets and doing an intersection instead of checking what's in the list, but the general idea got me there.I'm glad I tried it out, I was skeptical at first because I thought that generating the lists would take a long time and was worried about figuring out how to optimize that part. Turns out creating that was really fast and the checking if each one was in the list that was long, and changing that to using sets turbo charged it.
Davy8
Optimization has always been first and foremost about selecting the proper algorithm, then fine-tuning once you have it. Congrats for thinking out of the box.
Mark Ransom
Interesting find, but I think this is more of a Python specific optimization (that sets are faster than array access). Think about the cost of searching among 25L (L for letters) as opposed to matching L/2 (assuming you will find 2+ differences by half the word). In a close to the metal language like C, I believe you will find that the set search will be less optimal until log(25) > L/2 - log(L). I graphed the function, and this is true for L > 17 letters long. http://imgur.com/27AUC.png
Unknown
I kind of disappeared from SO for a while so I didn't see this till now. As I noted in my answer, I don't know Python but you should be using a wordlist that supports log-time lookups, which is apparently set (like in C++).I disagree with unknown's analysis because he is not considering the number of words (W) in the wordlist. With my method, the worst case is 25L * log(W), but checking each word in the wordlist doesn't take advantage of the set structure, and takes W * L/2 (using his approximation). So if the word list is long enough my method is faster (as I noted at the start).
Sumudu Fernando
Just realised that worst-case for mine needs another factor of L due to the string comparison that happens during the set lookup. Davy8 has mentioned the number of words is around 4500 while the length of each word is around 5, so this is still okay.Unknown: If I were doing this in C++, I would still do it my way with std::set.
Sumudu Fernando
A: 

for this snippet:

for x,y in zip (word1, word2):
    if x != y:
        difference += 1
return difference

i'd use this one:

return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])

the same pattern would follow all around the provided code...

this is exactly what OP got read of.
SilentGhost