views:

945

answers:

5

Let's say that we have a dictionary of about 250.000 words. Algorithm should take in 12 letters as an array or a string and find the permutation (or is it variation or combination?) that matches longest word from a dictionary. So, it is not a real permutation, as not all of the letters are used.

Of course, one can always brute-force it, but I wonder what would be the most elegant way to do this?

Answers using languages other than PHP will also be accepted if they do not use any language-specific functions as a shortcut for the main problem.

Note: Words are stored in the database, but I could pull them into memory for speed. Although I'm not sure PHP's indexing is better than that of an MySQL database?

+1  A: 

If you are trying to find the longest matching word, I would start by trying to sort the dictionary by word length, so you can focus the most effort on the longest words

rikh
A: 

My idea:

pseudocode:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

well this works when you have non repetive letters in lettermask, but if you have more letters (as you probably have) than you can extend leter and permutationmatchmask

EDIT

Another idea

Sort words in vocabulary by alphabeticaly order.

if there are 12 letteres and all of them are different, than there are exactly 4095 posible cobinations (just sum i= 1->12 binomial(12 over i) ) (for letters ABCD, there are (ABCD,ABC,ABD,ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) And as I said there are 4095 in 12 different letters and even less if some of letters are same.

Complexity 4095*Log2(250000) what is aproximetly 75000. Well it is worth to try.

Go for exact search on each combination.

ralu
Care to explain in little more detail?
Milan Babuškov
It is brute force algorhitm and need to check each word separatly to find HIT.For instance:You have letters "ABFR"and 2 words in dictionary"FOO" and "BAR"bar is represented whit "11000000000000000100000000000000"binary (we have A at 1, B at 2 and R at 18)"arbf" has "11000100000000000100000000000000"binarywhen you do upper logical evaluantion which need in this case just few instructions it yeilds YOU_HAVE_HITBut as I told it is just fast brute force implementation.
ralu
+3  A: 

I'd go with a slightly modified version of the answer to the anagram question here

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Start with your complete input, alphabetically sorted. If its not found, remove one letter, do the search again. Do this for every letter. Then remove two letters... and so on.

Worst case: No 'anagram' found at all. You will have to test all possible input combinations, which will give you around 2^n lookups where n is the number of input characters (in your example: 12) However, the speed of the algorithm does not depend on the size of the dictionary at run time (of course, sorting the words alphabetically does) which in my opinion is the most important thing here.

HerdplattenToni
Presumably a typo, but just to check: did you *mean* 'acfoor' or **abfoor** ?
David Thomas
I actually copied that part from the original answer, but you are right of course
HerdplattenToni
+3  A: 

You should calculate the signature of every word, you do it only once and save it into your database along with the word.

The table should be something like this:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

and the fields from a to z have to contains the number of letter contained in the word, for example anagram would have a record like:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

once you have the twelve letters you have to calculate the signature of the set and use it to create a select like this:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

in order to have all the word that can be created using the letter set you have.

No permutation, no combination and the though work (compiling the dictionary) is done only once and offline.

Just another hint, strip from the database all the words that are longer than twelve chars

Eineki
I like this, although it seems to require to scan through entire set when searching for longest word. I was hoping for an algorithm that would allow me to use indexes.
Milan Babuškov
You can index each of the a-z fields. Though that may take up a lot of space.
DisgruntledGoat
Can you smash the characters used with an amount into another column (called "hash") and index that. For "anagram" you would have a hash column of "a3g1m1n1r1" (or you could do the "aaagmnr" hash as well)
null
If you use the signature/hash of a word (aaegmnr for manager) you can't retrieve easily the subwords like, for example, game o gamer with just a sql query
Eineki
Okay, I see... it's why you have the "... a <= 4 ..." part of the query. So if that's the case, then could you just hash the characters available, so that way you can index that? So the hash would be... "agmnr"?
null
+1  A: 

Eric Lippert has written an informative blog post about anagram searching. The examples all use c#, but the techniques are usable in any language.

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms

With this technique, you can easily look up anagrams from a hash table or balanced tree.

Sean Reilly
Isn't that the same as HerdplattenToni's answer?
Milan Babuškov
It's certainly similar, but the blog post includes lots of practical tuning advice, and it's worth mentioning.
Sean Reilly