views:

54

answers:

3

I have 10 arbitrary letters and need to check the max length match from words file

  1. I started to learn RE just some time ago, and can't seem to find suitable pattern

    • first idea that came was using set: [10 chars] but it also repeats included chars and I don't know how to avoid that
  2. I stared to learn Python recently but before RE and maybe RE is not needed and this can be solved without it

    • using "for this in that:" iterator seems inappropriate, but maybe itertools can do it easily (with which I'm not familiar)

I guess solution is known even to novice programmers/scripters, but not to me Thanks

A: 

I think this code will do what you are looking for:

>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))

If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:

>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))
Tim McNamara
You're just finding the longest words in the word file. I think he's trying to find anagrams.
Seamus Campbell
Ah, I see.. /me checks question
Tim McNamara
A: 

I assume you are trying to find out what is the longest word that can be made from your 10 arbitrary letters.

You can keep your 10 arbitrary letters in a dict along with the frequency they occur.

e.g., your 4 (using 4 instead of 10 for simplicity) arbitrary letters are: e, w, l, l. This would be in a dict as: {'e':1, 'w':1, 'l':2}

Then for each word in the text file, see if all of the letters for that word can be found in your dict of arbitrary letters. If so, then that is one of your candidate words.

So: we wall well

all of the letters in well would be found in your dict of arbitrary letters so save it and its length for comparison against other words.

mlaw
Thanks for all answers, and yes that is the case I was trying to solveWord file has 400,000 words and I was looking for suitable iterator algorithm. This is the list: "l = open('dict.dat','r').readlines()"
otrov
+2  A: 

I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.

The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.

Python has a nice function all for testing a set of conditions based on elements in a sequence. all has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.

My first shot at writing this was simply:

matches = []
for word in wordlist:
    testset = set(letters)
    if all(c in testset for c in word):
        matches.append(word)

Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.

I took advantage of the fact that set.remove(item) returns None, which Python treats as a Boolean False, and expanded my generator expression used in calling all. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):

all(c in testset and "remove c from testset" for c in word)

Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:

all(c in testset and not testset.remove(c) for c in word)

Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our all test has now become a destructive test):

for word in wordlist:
    testset = set(letters)
    if all(c in testset and not testset.remove(c) for c in word):
        matches.append(word)

The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin len would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us not len, but -1 * len:

matches.sort(key=lambda wd: -len(wd))

Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.

(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)

Paul McGuire
+1 Great answer!
katrielalex
+1 Thanks for the fine answer and good lecture, but unfortunately I can't make it match anything. Here is my simple code (with 6 letters instead 10): http://pastebin.com/EXU9yFTKI never used all() function, but I'm not sure if I'm clear again about the task, so I'll try one more time to explain: Let's say you get 10 letters and your task is to find longest word that you can make with those letters. Sometimes longest word will be 8 letters sometimes more or lessI'll try to make it work for me later and post my result
otrov
You are using readlines() to get the content of dict.dat. This function appends '\n' to each line. Try read().splitlines() instead.
Paul McGuire
:) PERFECT! I have one more hero - Paul McGuire. Thanks man, I'm very pleased that it works now, thanks to your generous assistance and elegant and fast solution packed in couple of code lines :)
otrov
If I see further, it because I stand on the toes of giants. Glad it worked out! :)
Paul McGuire
For the record and for fun: I run the script on my old Nokia E61i (approx as P1: 64MB RAM, 220 MHz ARM) with PYS60 and it took something less than 2 min to output result :)
otrov