views:

237

answers:

5

The following code is causing me enormous headaches:

def extract_by_letters(letters, dictionary):

    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)

    return dictionary

First of all: the 'if word in dictionary' line. Why can't I leave this out? I get an error saying ValueError: list.remove(x): x not in list

But it is, obviously.

Second: Dictionary is a file of about 50,000 words separated by linebreaks. The above code takes about 2 minutes to run... Wayyy too long. I played with the code for a bit, and I found that the line:

if word.count(letter) != letters.count(letter):

seems to be causing all my problems. If I take that line out (which totally screws up the output), the function takes about 2 seconds to go through the dictionary.

I thought it was the count functions, but it's not.

if I change the if statement to something like:

print word.count(letter) 
print letters.count(letter)

the function takes about 3 seconds to run.

I'm convinced it's the comparison. Any other suggestions? Is there a better way to do this?

Thanks in advance!

+4  A: 

The reason you get the exception is that if the letter count matches for more than one letter, you are trying to remove the word more than once

The reason it is so slow is that you have loops inside loops inside loops.

If you would write a sentence or two about what the function is supposed to do, it would be much easier to refactor it. In the mean time, this would stop you checking whether a word needs to be removed once you have already removed it.

def extract_by_letters(letters, dictionary):
    for word in dictionary[:]:  # bad idea to change this while you iterate over it
        for letter in letters:
            if word.count(letter) != letters.count(letter):
                dictionary.remove(word)
                break
    return dictionary

If dictionary is a set you should get some speed up. If dictionary is a list this should give a huge speedup

gnibbler
Either you have never tried your set idea or you have access to VoodooPython ... my standard-issue Python 2.6 cracks it: "RuntimeError: Set changed size during iteration". Write out 100 lines: "I must not mutate a container that I'm iterating over" :-)
John Machin
This code doesn't do what the code in the question does. It is supposed to remove all words where the count of a given letter in a word does not match the count of the same letter in letters. This will stop after it removes one word I would think.
Justin Peel
@John Machin, Justin - Yeah I was worried about refactoring too much. I noticed the mutating the container bug, but thought I would wait for an explanation of what the code is *supposed* to do. Still waiting...I think Justin is correct, so I will fix that, and I think he is using lists which means `word in dictionary` and `dictionary.remove` which are quite slow operations are being called 1000's of times
gnibbler
+1 for catching the multiple attempts to remove
CurtainDog
+2  A: 

Try building the output instead of deleting from it:

def extract_by_letters(letters, dictionary):
    d = []
    for word in dictionary:
       for letter in letters:
           if word.count(letter)>0:
               d.append(word)
               break
    return d

Or, you could use regular expressions:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    d=[]
    for word in dictionary:
       if regex.search(word):
           d.append(word)
    return d

Or, perhaps the simplest way:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    return [word for word in dictionary if regex.search(word)]

This last one takes no noticeable time to scan /usr/share/dict/words on my Mac, which is a list of 234936 words.

Andrew McGregor
The original algorithm seems to be comparing the count of the letters to the count in the word, you seem to just be looking for whether the letter is in the word.
gnibbler
Hmm. True. That said, some combination of what's in the answers now will probably fix the OP's problem.
Andrew McGregor
+2  A: 

Here's a function that should offer a major speed-up:

def extract_by_letters(letters,dictionary):
    letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
    outarr = []
    for word in dictionary:
        goodword = True
        for letter in letdict:
            if word.count(letter) != letdict[letter]:
                goodword = False
                break
        if goodword:
            outarr.append(word)
    return outarr

Here are the optimizations I did:

  1. Made a dictionary of the letters with their corresponding frequencies. This way, you aren't using letters.count over and over again when you only need to do this process once and store the results.

  2. Rather than removing the words from dictionary, I add them to an array that is returned from the function. If you have a huge dictionary, chances are that only a few words will match. Also, if the dictionary variable is an array (which I suspect), then every time you called remove it had to first search for the word in dictionary (linearly starting at the beginning) and then remove it. It is faster to remove by popping using the index of the word to be removed.

  3. Breaking out of the loop checking the letter counts whenever a mismatch is found. This prevents us from doing needless checks when we already have our answer.

I wasn't sure if you had repeated letters in the letters variable or not so I made sure that it could handle that by using letdict. If you had letters repeated in your letters variable before, then you were checking the counts of those letters in the word repeatedly.

Justin Peel
+1 I *think* this is probably what the OP means to do, ie not modify the dictionary as it is iterated over.
gnibbler
A: 

You're trying to find all anagrams of 'letters'?

def anagrams(letters, words):
    letters = sorted(letters)
    result = []
    for word in words:
        if sorted(word.strip()) == letters:
            result.append(word)
    return result
Paul Hankin
+1  A: 
import pprint
from collections import defaultdict

# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.

def extract_by_letters_0(letters, input_list):
    dictionary = input_list.copy()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)
    return dictionary

# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)

def extract_by_letters_1(letters, input_list):
    dictionary = set(input_list)
    ripouts = set()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               ripouts.add(word)
    return dictionary - ripouts

def anagram_key(strg):
    return ''.join(sorted(list(strg)))

def check_anagrams(str1, str2):
    return sorted(list(str1)) == sorted(list(str2))

# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.

raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""

lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
    anagram_dict[anagram_key(word)].add(word)

print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# now purge trivial entries

temp = {}
for k, v in anagram_dict.iteritems():
    if len(v) != 1:
        temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# Test cases

tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
    print
    results = extract_by_letters_1(test, lexicon)
    print test, [(result, check_anagrams(test, result)) for result in results]
    # In the following statement, you can use set([test]) as the default
    # if that produces a more useful or orthogonal result.
    results = anagram_dict.get(anagram_key(test), default_set)
    print test, [(result, check_anagrams(test, result)) for result in results]

Output:

lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']

anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})

anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}

sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []

same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]

mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]

sameness [('sameness', True)]
sameness []

samehood [('samehood', True)]
samehood []

xsame []
xsame []

samex []
samex []
John Machin
anagrams is an interesting guess. Perhaps the OP has just not considered checking `words` for characters that are not in `letters`. +1 for pointing out that tests would be a good idea - that would help to avoid pointing fingers at the wrong problems
gnibbler