views:

842

answers:

7

I'm making a boggle-like word game. The user is given a grid of letters like this:

O V Z W X
S T A C K
Y R F L Q

The user picks out a word using any adjacent chains of letters, like the word "STACK" across the middle line. The letters used are then replaced by the machine e.g. (new letters in lowercase):

O V Z W X
z e x o p
Y R F L Q

Notice you can now spell "OVeRFLoW" by using the new letters. My problem is: What algorithm can I use to pick new letters that maximizes the number of long words the user can spell? I want the game to be fun and involve spelling e.g. 6 letter words sometimes but, if you pick bad letters, games involve the user just spelling 3 letter words and not getting a chance to find larger words.

For example:

  • You could just randomly pick new letters from the alphabet. This does not work well.

  • Likewise, I found picking randomly but using the letter frequencies from Scrabble didn't work well. This works better in Scrabble I think as you are less constrained about the order you use the letters in.

  • I tried having a set of lists, each representing one of the dies from the Boggle game, and each letter would be picked from a random die side (I also wonder whether I can legally use this data in a product). I didn't notice this working well. I imagine the Boggle dice sides were chosen in some sensible manner, but I cannot find how this was done.

Some ideas I've considered:

  • Make a table of how often letter pairs occur together in the dictionary. For the sake of argument, say E is seen next to A 30% of the time. When picking a new letter, I would randomly pick a letter based on the frequency of this letter occurring next to a randomly chosen adjacent letter on the grid. For example, if the neighboring letter was E, the new letter would be "A" 30% of the time. The should mean there are lots of decent pairs to use scattered around the map. I could maybe improve this by making probability tables of a letter occurring between two other letters.

  • Somehow do a search for what words can be spelt on the current grid, taking the new letters to be wildcards. I would then replace the wildcards with letters that allowed the biggest words to be spelt. I'm not sure how you would do this efficiently however.

Any other ideas are appreciated. I wonder if there is a common way to solve this problem and what other word games use.

Edit: Thanks for the great answers so far! I forgot to mention, I'm really aiming for low memory/cpu requirements if possible, I'm probably going to use the SOWPODS dictionary (about 250,000) and my grid will be able 6 x 6.

A: 

I do not know about a precanned algorithm for this, but...

There is a dictionary file in UNIX, and I imagine there is something similar available on other platforms (maybe even in the java libraries? - google it). Anyways, use the files the spell checker uses.

After they spell a word an it drops out, you have existing letters and blank spaces.

1) From each existing letter, go right, left, up, down (you will need to understand recursive algorithms). As long as the string you have built so far is found at the start of words or backwards from the end of words in the dictionary file, continue. When you come across a blank space, count the frequency of the letters you need next. Use the most frequent letters.

It will not guarantee a word as you have not checked the corresponding ending or beginning, but I think it would be much easier to implement than an exhaustive search and get pretty good results.

Jeff
Could you give a short example? I'm not sure how this would work.
BobbyJim
A: 

I think this will get you a step closer to your destination: http://en.wikipedia.org/wiki/Levenshtein_distance

Trevoke
+6  A: 

Here's a simple method:

Write a fast solver for the game using the same word list that the player will use. Generate say 100 different possible boards at random (using letter frequencies is probably a good idea here, but not essential). For each board calculate all the words that can be generated and score the board based on the number of words found or the count weighted by word length (i.e. the total sum of word lengths of all words found). Then just pick the best scoring board from the 100 possibilities and give that to the player.

Also instead of always picking the highest scoring board (i.e. the easiest board) you could have different score thresholds to make the game more difficult for experts.

Mark Byers
Thanks. This is probably the most bullet-proof idea in that you could e.g. guarantee (most of the time) that there will always be a certain number of large words to pick. My board will be 6x6 and using a trie takes too much memory so I'm not sure how I could use this efficiently though.
BobbyJim
Using a word prefix list (trie) is going to give the best performance if you have the memory. If you store the trie compressed you can probably create fit a full trie in a few MBs I'd guess. If not, you can still probably get a word prefix list of up to length 5 in memory, then switch to binary (or interpolated) search of the full word list to check for matches longer than 5. Alternatively... count the prefixes up to length 5 and assume that lots of small partial words gives a good chance of a long word without explicitly checking for the long words.
Mark Byers
If you're daring you could use a DAWG that is stored in an array. There is an excellent video lecture from Stanford on that found here: http://www.youtube.com/watch?v=TJ8SkcUSdbUThe short story is she managed to store 250,000 words in .32 MB
Niki Yoshiuchi
If you can push this solving work to a separate thread, the solver can work ahead while the user is playing the game. This will give your solver more time to work.
Mr. Berna
+1  A: 

A minor variation on the letter-pair approach: use the frequency of letter pairs in long words - say 6 letters or longer - since that's your objective. You could also develop a weighting that included all adjacent letters, not just a random one.

Carl Manaster
Nice about about using the 6 letter long words! I considered using trigrams (only consider the frequency of 3 letter pairs) but your idea sounds closer to what I really want.
BobbyJim
A: 

This wordgame I slapped up a while back, which behaves very similarly to what you describe, uses English frequency tables to select letters, but decides first whether to generate a vowel or consonant, allowing me to ensure a given rate of vowels on the board. This seems to work reasonably well.

moonshadow
Thanks. What did you use for the vowel/consonant rate? My feelings is, in every local 2x2 grid, you should probably have at least one vowel. Otherwise, you might get 'trapped' groups of consonants in the corners that you cannot use in words. Did you use just use regular letter frequency tables and not e.g. paired letter frequencies?
BobbyJim
@Bobby: because the board mutates after each word, the player can "chip away" at clumps of difficult letters over time - one might think of that as part of the game strategy. The vowel/consonant rate is hardwired to 0.559 - I obtained that value and the letter frequencies by gathering stats on some ebooks I had lying around :)
moonshadow
OK, thanks. I've actually tested my game with the falling behavior but I've found players tend to ignore the bottom letters when the letters there aren't very good and they spend all their time at the top. I was thinking about letters falling in from all directions somehow. Or make it a requirement to dispose of old letters. Also, falling letters make it hard to e.g. fix the number of vowels in local grid positions. I might be over thinking this though. :) I would just quite like it if e.g. every grid had at least one long word in it so experts could show off.
BobbyJim
+2  A: 

You should look up n-gramming, and Markovian Models.

Your first idea is very losely related to Markovian algorithms. Basically, if you have a large text corpus, say of 1000 words. What you can do is analyse each letter and create a table to know the probability of a certain letter following the current letter.

For example, I know that the letter Q from my 1000 words ( 4000 letters in total ) is used only 40 times. Then I calculate what probable letters follow using my markov hash table.

For example, QU happens 100% of the time so I know that should Q be randomly chosen by your application that I need to make sure that the letter U is also included. Then, the letter "I" is used 50% of the time, and "A" 25% of the times and "O" 25% of the time.

Its actually really complicated to explain and I bet there are other explainations out there which are much better then this.

But the idea is that given a legitmately large text corpus you can create a chain of X letters which are probably consistent with English language and thus should be easy for users to make words out of. You can choose to look forward on a value of n-gram, the highest the number the easier you could make your game. For example, an n-gram of two would probably make it very hard to create words over 6, but an n-gram of 4 would be very easy.

The Wikipedia explains it really badly, so I wouldn't follow that.

Take a look at this Markov generator:

http://www.haykranen.nl/projects/markov/demo/

Laykes
Thanks, sounds interesting. Could you elaborate a bit more about the n-gram of 4 idea? Would I e.g. pick an adjacent chain of 4 letters, say "C-H-A-N", near my random letter location, then ask a table to pick a letter that usually follows the 3 letters "CHAN" e.g. "G" as in "CHANGING"?
BobbyJim
I've always been scared of Markov chains. The main wiki article is confusing but this one is quite good:http://en.wikipedia.org/wiki/Examples_of_Markov_chains
BobbyJim
n-gramming is just where you break down something in N number of grams. For example, on a 1-gram the word Boggle is1 gramB O G G L E2-gram (Commonly called a bigram)It would beB BO OG GG GL LE E3-gram (Commonly called a trigram)It would beB BO BOG OGG GGl GLE LE EOn a 4-gram (Just called an n-gram)It would beB BO BOG BOGG OGGL GGLE GLE LE EYou can see how if you use a markov chain with a particular n-gram you can group particular common occuring charachter sequences. Incidentally, as you increase the n-gram you will find the game becomes easier.
Laykes
A: 

You might look at this Java implementation of the Jumble algorithm to find sets of letters that permute to multiple dictionary words:

$ java -jar dist/jumble.jar | sort -nr | head
11 Orang Ronga angor argon goran grano groan nagor orang organ rogan 
10 Elaps Lepas Pales lapse salep saple sepal slape spale speal 
9 ester estre reest reset steer stere stree terse tsere 
9 caret carte cater crate creat creta react recta trace 
9 Easter Eastre asteer easter reseat saeter seater staree teaser 
9 Canari Carian Crania acinar arnica canari carina crania narica 
8 leapt palet patel pelta petal plate pleat tepal 
8 laster lastre rastle relast resalt salter slater stelar 
8 Trias arist astir sitar stair stria tarsi tisar 
8 Trema armet mater metra ramet tamer terma trame 
...
trashgod