ansaurus

Question

Integer comparison in python slows everything down to a crawl

Answer 1

+4 A:

The reason you get the exception is that if the letter count matches for more than one letter, you are trying to remove the word more than once

The reason it is so slow is that you have loops inside loops inside loops.

If you would write a sentence or two about what the function is supposed to do, it would be much easier to refactor it. In the mean time, this would stop you checking whether a word needs to be removed once you have already removed it.

def extract_by_letters(letters, dictionary):
    for word in dictionary[:]:  # bad idea to change this while you iterate over it
        for letter in letters:
            if word.count(letter) != letters.count(letter):
                dictionary.remove(word)
                break
    return dictionary

If dictionary is a set you should get some speed up. If dictionary is a list this should give a huge speedup

gnibbler 2010-02-22 09:52:42

Either you have never tried your set idea or you have access to VoodooPython ... my standard-issue Python 2.6 cracks it: "RuntimeError: Set changed size during iteration". Write out 100 lines: "I must not mutate a container that I'm iterating over" :-)

John Machin 2010-02-22 10:43:17

This code doesn't do what the code in the question does. It is supposed to remove all words where the count of a given letter in a word does not match the count of the same letter in letters. This will stop after it removes one word I would think.

Justin Peel 2010-02-22 16:20:30

@John Machin, Justin - Yeah I was worried about refactoring too much. I noticed the mutating the container bug, but thought I would wait for an explanation of what the code is *supposed* to do. Still waiting...I think Justin is correct, so I will fix that, and I think he is using lists which means `word in dictionary` and `dictionary.remove` which are quite slow operations are being called 1000's of times

gnibbler 2010-02-22 19:26:27

+1 for catching the multiple attempts to remove

CurtainDog 2010-02-23 00:16:51

Answer 2

+2 A:

Try building the output instead of deleting from it:

def extract_by_letters(letters, dictionary):
    d = []
    for word in dictionary:
       for letter in letters:
           if word.count(letter)>0:
               d.append(word)
               break
    return d

Or, you could use regular expressions:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    d=[]
    for word in dictionary:
       if regex.search(word):
           d.append(word)
    return d

Or, perhaps the simplest way:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    return [word for word in dictionary if regex.search(word)]

This last one takes no noticeable time to scan /usr/share/dict/words on my Mac, which is a list of 234936 words.

Andrew McGregor 2010-02-22 11:03:39

The original algorithm seems to be comparing the count of the letters to the count in the word, you seem to just be looking for whether the letter is in the word.

gnibbler 2010-02-22 19:37:40

Hmm. True. That said, some combination of what's in the answers now will probably fix the OP's problem.

Andrew McGregor 2010-02-22 21:19:50

Answer 3

+2 A:

Here's a function that should offer a major speed-up:

def extract_by_letters(letters,dictionary):
    letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
    outarr = []
    for word in dictionary:
        goodword = True
        for letter in letdict:
            if word.count(letter) != letdict[letter]:
                goodword = False
                break
        if goodword:
            outarr.append(word)
    return outarr

Here are the optimizations I did:

Made a dictionary of the letters with their corresponding frequencies. This way, you aren't using letters.count over and over again when you only need to do this process once and store the results.
Rather than removing the words from dictionary, I add them to an array that is returned from the function. If you have a huge dictionary, chances are that only a few words will match. Also, if the dictionary variable is an array (which I suspect), then every time you called remove it had to first search for the word in dictionary (linearly starting at the beginning) and then remove it. It is faster to remove by popping using the index of the word to be removed.
Breaking out of the loop checking the letter counts whenever a mismatch is found. This prevents us from doing needless checks when we already have our answer.

I wasn't sure if you had repeated letters in the letters variable or not so I made sure that it could handle that by using letdict. If you had letters repeated in your letters variable before, then you were checking the counts of those letters in the word repeatedly.

Justin Peel 2010-02-22 16:43:24

+1 I *think* this is probably what the OP means to do, ie not modify the dictionary as it is iterated over.

gnibbler 2010-02-22 19:41:51

Answer 4

A:

You're trying to find all anagrams of 'letters'?

def anagrams(letters, words):
    letters = sorted(letters)
    result = []
    for word in words:
        if sorted(word.strip()) == letters:
            result.append(word)
    return result

Paul Hankin 2010-02-22 20:12:33

Answer 5

+1 A:

import pprint
from collections import defaultdict

# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.

def extract_by_letters_0(letters, input_list):
    dictionary = input_list.copy()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)
    return dictionary

# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)

def extract_by_letters_1(letters, input_list):
    dictionary = set(input_list)
    ripouts = set()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               ripouts.add(word)
    return dictionary - ripouts

def anagram_key(strg):
    return ''.join(sorted(list(strg)))

def check_anagrams(str1, str2):
    return sorted(list(str1)) == sorted(list(str2))

# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.

raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""

lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
    anagram_dict[anagram_key(word)].add(word)

print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# now purge trivial entries

temp = {}
for k, v in anagram_dict.iteritems():
    if len(v) != 1:
        temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# Test cases

tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
    print
    results = extract_by_letters_1(test, lexicon)
    print test, [(result, check_anagrams(test, result)) for result in results]
    # In the following statement, you can use set([test]) as the default
    # if that produces a more useful or orthogonal result.
    results = anagram_dict.get(anagram_key(test), default_set)
    print test, [(result, check_anagrams(test, result)) for result in results]

Output:

lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']

anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})

anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}

sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []

same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]

mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]

sameness [('sameness', True)]
sameness []

samehood [('samehood', True)]
samehood []

xsame []
xsame []

samex []
samex []

John Machin 2010-02-23 00:02:43

anagrams is an interesting guess. Perhaps the OP has just not considered checking `words` for characters that are not in `letters`. +1 for pointing out that tests would be a good idea - that would help to avoid pointing fingers at the wrong problems

gnibbler 2010-02-23 00:29:31

ansaurus

tags:

views:

answers:

Integer comparison in python slows everything down to a crawl

related questions