views:

1294

answers:

8

Eg if input string is helloworld I want the output to be like:

do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...

all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example. The input string can be any length, but rarely more than 16 chars.

I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.

+1  A: 

What you want is an implementation of a power set.

Also look at Eric Lipparts blog, he blogged about this very thing a little while back

EDIT:

Here is an implementation I wrote of getting the powerset from a given string...

private IEnumerable<string> GetPowerSet(string letters)
{
  char[] letterArray = letters.ToCharArray();
  for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
  {
    StringBuilder sb = new StringBuilder();
    for (int j = 0; j < letterArray.Length; j++)
    {
      int pos = Convert.ToInt32(Math.Pow(2.0, j));
      if ((pos & i) == pos)
      {
        sb.Append(letterArray[j]);
      }
    }
    yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
  }
}

This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...

Dictionary<string,IEnumerable<string>>

I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)

wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
                let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
                group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Tim Jarvis
+4  A: 

The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.

Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.

So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.

John Pirie
I like this explanation. I can (almost) get my head around it. Need to do a pen and paper analysis I think.
20th Century Boy
Looks like it's pretty close to Norman Ramsey's DAWG (other post); should have known there was a formal definition for it.
John Pirie
Nope...I've thought about this and decided I don't understand the tester function. I understand how to build the tree. How does it get from "hello" to "hollow"? Recursion was never my strong point!
20th Century Boy
It won't get from "hello" to "hollow" directly. The call that received params of "h" and "elloworld" will subsequently make calls to all child nodes, which will be all 2-letter combinations which start words ("he" "ho"
John Pirie
How can this be done using a SQL database? The size of dictionary I want to use can't be handled by the language I'm using (PHP).
Joe Philllips
A: 

Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.

Lucas
+2  A: 

A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:

import itertools
import urllib

def words():
  f = urllib.urlopen(
    'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
  allwords = set(w[:-1] for w in f)
  f.close()
  return allwords

def substrings(s):
  for i in range(2, len(s)+1):
    for p in itertools.permutations(s, i):
      yield ''.join(p)

def main():
  w = words()
  print '%d words' % len(w)
  ss = set(substrings('weep'))
  print '%d substrings' % len(ss)
  good = ss & w
  print '%d good ones' % len(good)
  sgood = sorted(good, key=lambda w:(len(w), w))
  for aword in sgood:
    print aword

main()

will emit:

38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep

Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.

Alex Martelli
This works for a small test string, but a 16-character one yields 20922789888000 possible substrings. For long test strings, you want to be bound by valid entries, rather than possible letter combinations.
John Pirie
+1  A: 

The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.

I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.

Norman Ramsey
I found it on the first page of a Google search :-)
20th Century Boy
There is perhaps a faster one... see http://www.ericsink.com/downloads/faster-scrabble-gordon.pdf
Steve Powell
A: 

I believe the Ruby code in the answers to this question will also solve your problem.

las3rjock
+2  A: 

I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.

The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.

# Using the 38617 word dictionary at 
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")

def containsLetters(subword, word):
    wordlen = len(word)
    subwordlen = len(subword)

    if subwordlen > wordlen:
        return False

    word = list(word)
    for c in subword:
        try:
            index = word.index(c)
        except ValueError:
            return False
        word.pop(index)
    return True

def getAnagrams(word):
    output = []
    for key in mydict.iterkeys():
        if containsLetters(key, word):
            output.extend(mydict[key])

    output.sort(key=len)
    return output

f = open("dict.txt")
wordlist = f.readlines()
f.close()

mydict = {}
for word in wordlist:
    word = word.rstrip()
    temp = list(word)
    temp.sort()
    letters = ''.join(temp)

    if letters in mydict:
        mydict[letters].append(word)
    else:
        mydict[letters] = [word]

An example run:

>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
Unknown
This certainly gets the job done! But John Pirie's method is possibly more efficient? I'll play around with both.
20th Century Boy
@ 20th Century Boy: John Pirie describes some kind of trie, which I already suggested as a possible replacement to a hash table. However, it sounds incomplete. He describes walking the trie with H-e, which gives the word "he", but neglects the permutation "eh" (I think that's a word).
Unknown
@Unknown: I think "eh" would get detected after all the "h" words, ie when the function backtracks to the top of the tree and moves onto the branch starting with "e".
20th Century Boy
A: 

Did you ever complete this code? I have been searching for the exact same thing.

Yes I did. I wrapped it up in a Qt GUI so it's a nice little standalone app. I'll post a link up shortly when I find some web space.
20th Century Boy