ansaurus

Question

Answer 1

+1 A:

If you are trying to find the longest matching word, I would start by trying to sort the dictionary by word length, so you can focus the most effort on the longest words

rikh 2009-09-28 10:22:13

Answer 2

A:

My idea:

pseudocode:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

well this works when you have non repetive letters in lettermask, but if you have more letters (as you probably have) than you can extend leter and permutationmatchmask

EDIT

Another idea

Sort words in vocabulary by alphabeticaly order.

if there are 12 letteres and all of them are different, than there are exactly 4095 posible cobinations (just sum i= 1->12 binomial(12 over i) ) (for letters ABCD, there are (ABCD,ABC,ABD,ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) And as I said there are 4095 in 12 different letters and even less if some of letters are same.

Complexity 4095*Log2(250000) what is aproximetly 75000. Well it is worth to try.

Go for exact search on each combination.

ralu 2009-09-28 10:55:03

Care to explain in little more detail?

Milan Babuškov 2009-09-28 11:01:43

It is brute force algorhitm and need to check each word separatly to find HIT.For instance:You have letters "ABFR"and 2 words in dictionary"FOO" and "BAR"bar is represented whit "11000000000000000100000000000000"binary (we have A at 1, B at 2 and R at 18)"arbf" has "11000100000000000100000000000000"binarywhen you do upper logical evaluantion which need in this case just few instructions it yeilds YOU_HAVE_HITBut as I told it is just fast brute force implementation.

ralu 2009-09-28 11:17:35

Answer 3

+3 A:

I'd go with a slightly modified version of the answer to the anagram question here

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Start with your complete input, alphabetically sorted. If its not found, remove one letter, do the search again. Do this for every letter. Then remove two letters... and so on.

Worst case: No 'anagram' found at all. You will have to test all possible input combinations, which will give you around 2^n lookups where n is the number of input characters (in your example: 12) However, the speed of the algorithm does not depend on the size of the dictionary at run time (of course, sorting the words alphabetically does) which in my opinion is the most important thing here.

HerdplattenToni 2009-09-28 12:17:25

Presumably a typo, but just to check: did you *mean* 'acfoor' or **abfoor** ?

David Thomas 2009-09-28 18:28:10

I actually copied that part from the original answer, but you are right of course

HerdplattenToni 2009-09-29 07:59:21

Answer 4

+3 A:

You should calculate the signature of every word, you do it only once and save it into your database along with the word.

The table should be something like this:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

and the fields from a to z have to contains the number of letter contained in the word, for example anagram would have a record like:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

once you have the twelve letters you have to calculate the signature of the set and use it to create a select like this:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

in order to have all the word that can be created using the letter set you have.

No permutation, no combination and the though work (compiling the dictionary) is done only once and offline.

Just another hint, strip from the database all the words that are longer than twelve chars

Eineki 2009-09-28 12:37:14

I like this, although it seems to require to scan through entire set when searching for longest word. I was hoping for an algorithm that would allow me to use indexes.

Milan Babuškov 2009-09-28 13:59:36

You can index each of the a-z fields. Though that may take up a lot of space.

DisgruntledGoat 2009-09-28 16:33:45

Can you smash the characters used with an amount into another column (called "hash") and index that. For "anagram" you would have a hash column of "a3g1m1n1r1" (or you could do the "aaagmnr" hash as well)

null 2009-09-28 18:30:11

If you use the signature/hash of a word (aaegmnr for manager) you can't retrieve easily the subwords like, for example, game o gamer with just a sql query

Eineki 2009-09-28 18:41:57

Okay, I see... it's why you have the "... a <= 4 ..." part of the query. So if that's the case, then could you just hash the characters available, so that way you can index that? So the hash would be... "agmnr"?

null 2009-09-28 20:01:13

Answer 5

+1 A:

Eric Lippert has written an informative blog post about anagram searching. The examples all use c#, but the techniques are usable in any language.

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms

With this technique, you can easily look up anagrams from a hash table or balanced tree.

Sean Reilly 2009-09-29 15:16:11

Isn't that the same as HerdplattenToni's answer?

Milan Babuškov 2009-09-29 18:10:46

It's certainly similar, but the blog post includes lots of practical tuning advice, and it's worth mentioning.

Sean Reilly 2009-09-29 21:54:00

ansaurus

tags:

views:

answers:

Algorithm to find longest anagram

related questions