views:

6459

answers:

14

What would be the best strategy to generate anagrams.

An anagram is a type of word play, the result of rearranging the letters
of a word or phrase to produce a new  word or phrase, using all the original
letters exactly once; 
ex.
  • Eleven plus two is anagram of Twelve plus one
  • A decimal point is anagram of I'm a dot in place
  • Astronomers is anagram of Moon starers

At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.

I came across this page, Solving anagrams in Ruby.

But what are your ideas?

+3  A: 

I haven't looked at it in any detail, but this UNIX anagram finder utility includes a PDF with the source (public domain, C++) and a part-by-part discussion.

dF
A: 

Off the top of my head, the solution that makes the most sense would be to pick a letter out of the input string randomly and filter the dictionary based on words that start with that. Then pick another, filter on the second letter, etc. In addition, filter out words that can't be made with the remaining text. Then when you hit the end of a word, insert a space and start it over with the remaining letters. You might also restrict words based on word type (e.g. you wouldn't have two verbs next to each other, you wouldn't have two articles next to each other, etc).

Cody Brocious
+11  A: 

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "acfoor."

Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!

For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.

(see comments for more optimizations and details)

Jason Cohen
It seems like this (along with Jimmy's answer) would only work for a single word anagram -- how can this be applied to anagrams of phrases?
Cody Brocious
As I said in the post, for multiple words you could examine all pairs, triples, etc., in each case combining the letters and sorting (and use mergesort so that op is faster!) and test against that combo. You could be smarter still, e.g. bitfields of chars used at all and...
Jason Cohen
...obviously the total number of characters, so when I say "test all triples" there are massive categories you can prune. For example, store words first by length, then hash. So in your pairs/triples you're already able to skip combos with wrong numbers of characters.
Jason Cohen
sorting the characters first doesn't help at all. It might give you one or two, but you need to test all combinations and then reject them. One way would be to generate all possible triplets and then compare them to the first three letters of all words from a dictionary.
Mats Fredriksson
sorting does help -- it's the simplest way (aside from say, using .NET HashSet<T> or Python set()) to map a ordered list of letters to an unordered list.
Jimmy
ok, fair enough, it speeds up things in that the anagrams of "foobar" and "barfoo" will resolve to the same result set, but if you are going to get all anagrams from just one sentence, then sorting doesn't help you since you need to consider all characters available.
Mats Fredriksson
its a matter of reverse lookup, too. once you have your big list of letters ("astronomers"), you find a list of sorted substrings ("mno" + "aat" + "sors", or "mnoo"+"aerrsst" for example) so you can look it up in the lookup table you gener
Jimmy
@Jason, bitwise operations won't work because a letter may appear more than once in the String. If you use OR, these duplicate letters won't be counted, and if you use addition, there will be collisions.
Zach Langley
Inefficient for multiple words. See my answer.
FogleBird
+1  A: 

How I see it:

you'd want to build a table that maps unordered sets of letters to lists words i.e. go through the dictionary so you'd wind up with, say

lettermap[set(a,e,d,f)] = { "deaf", "fade" }

then from your starting word, you find the set of letters:

 astronomers => (a,e,m,n,o,o,r,r,s,s,t)

then loop through all the partitions of that set ( this might be the most technical part, just generating all the possible partitions), and look up the words for that set of letters.

edit: hmmm, this is pretty much what Jason Cohen posted.

edit: furthermore, the comments on the question mention generating "good" anagrams, like the examples :). after you build your list of all possible anagrams, run them through WordNet and find ones that are semantically close to the original phrase :)

Jimmy
A: 
  1. As Jason suggested, prepare a dictionary making hashtable with key being word sorted alphabetically, and value word itself (you may have multiple values per key).
  2. Remove whitespace and sort your query before looking it up.

After this, you'd need to do some sort of a recursive, exhaustive search. Pseudo code is very roughly:

function FindWords(solutionList, wordsSoFar, sortedQuery)
  // base case
  if sortedQuery is empty
     solutionList.Add(wordsSoFar)
     return

  // recursive case

  // InitialStrings("abc") is {"a","ab","abc"}
  foreach initialStr in InitalStrings(sortedQuery)
    // Remaining letters after initialStr
    sortedQueryRec := sortedQuery.Substring(initialStr.Length)
    words := words matching initialStr in the dictionary
    // Note that sometimes words list will be empty
    foreach word in words
      // Append should return a new list, not change wordSoFar
      wordsSoFarRec := Append(wordSoFar, word) 
      FindWords(solutionList, wordSoFarRec, sortedQueryRec)

In the end, you need to iterate through the solutionList, and print the words in each sublist with spaces between them. You might need to print all orderings for these cases (e.g. "I am Sam" and "Sam I am" are both solutions).

Of course, I didn't test this, and it's a brute force approach.

dbkk
+5  A: 

See this assignment from the University of Washington CSE department.

Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.

hazzen
working with the counts makes it simple combination problem. you have a map for the search phrase, and match it to combinations of word maps with the same sum of counts. This is an elegant solution.
Osama ALASSIRY
+2  A: 

Pre-process:

Build a trie with each leaf as a known word, keyed in alphabetical order.

At search time:

Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).

Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).

Augment this procedure with memoization for faster look-ups on common remainder multisets.

This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.

Tyler
+1  A: 

I've used the following way of computing anagrams a couple of month ago:

  • Compute a "code" for each word in your dictionary: Create a lookup-table from letters in the alphabet to prime numbers, e.g. starting with ['a', 2] and ending with ['z', 101]. As a pre-processing step compute the code for each word in your dictionary by looking up the prime number for each letter it consists of in the lookup-table and multiply them together. For later lookup create a multimap of codes to words.

  • Compute the code of your input word as outlined above.

  • Compute codeInDictionary % inputCode for each code in the multimap. If the result is 0, you've found an anagram and you can lookup the appropriate word. This also works for 2- or more-word anagrams as well.

Hope that was helpful.

+2  A: 

The book Programming Pearls by Jon Bentley covers this kind of stuff quite nicely. A must-read.

Don't know why you were modded down but Column 2 of Programming Pearls walks through an implementation of a program that finds all sets of anagrams given a dictionary of words. Definately worth a look.Compile and run the code as follows:./sign <somedictionary| sort | ./squash
vinc456
http://www.cs.bell-labs.com/cm/cs/pearls/code.html#col02
vinc456
+1  A: 

A while ago I have written a blog post about how to quickly find two word anagrams. It works really fast: finding all 44 two-word anagrams for a word with a textfile of more than 300,000 words (4 Megabyte) takes only 0.6 seconds in a Ruby program.

Two Word Anagram Finder Algorithm (in Ruby)

It is possible to make the application faster when it is allowed to preprocess the wordlist into a large hash mapping from words sorted by letters to a list of words using these letters. This preprocessed data can be serialized and used from then on.

martinus
+2  A: 

One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.

plinth
+1  A: 

http://github.com/offby1/anagrams

One algorithm, lots of languages.

offby1
A: 

I've been writing a series of blog post on anagram algorithms that are used in xworder.com: http://bbzippo.wordpress.com/2009/11/27/inside-xworder-the-road-to-full-anagrams/ bbzippo.wordpress.com/2009/12/06/inside-xworder-the-road-to-full-anagrams-2/ and more to come.

bbzippo
+9  A: 

Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.

What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:

http://www.isc.ro/lists/twl06.zip

MIN_WORD_SIZE = 4 # min size of a word in the output

class Node(object):
    def __init__(self, letter='', final=False, depth=0):
        self.letter = letter
        self.final = final
        self.depth = depth
        self.children = {}
    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1, index+1)
            node = node.children[letter]
    def anagram(self, letters):
        tiles = {}
        for letter in letters:
            tiles[letter] = tiles.get(letter, 0) + 1
        min_length = len(letters)
        return self._anagram(tiles, [], self, min_length)
    def _anagram(self, tiles, path, root, min_length):
        if self.final and self.depth >= MIN_WORD_SIZE:
            word = ''.join(path)
            length = len(word.replace(' ', ''))
            if length >= min_length:
                yield word
            path.append(' ')
            for word in root._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
        for letter, node in self.children.iteritems():
            count = tiles.get(letter, 0)
            if count == 0:
                continue
            tiles[letter] = count - 1
            path.append(letter)
            for word in node._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
            tiles[letter] = count

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

def main():
    print 'Loading word list.'
    words = load_dictionary('words.txt')
    while True:
        letters = raw_input('Enter letters: ')
        letters = letters.lower()
        letters = letters.replace(' ', '')
        if not letters:
            break
        count = 0
        for word in words.anagram(letters):
            print word
            count += 1
        print '%d results.' % count

if __name__ == '__main__':
    main()

When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.

It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.

Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.

The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.

FogleBird
+1 for an excellent answer.
viksit