views:

499

answers:

6

I created an anagram creating application, by creating an anagram field in my database, with a lower cased alphabetically stored string.

For example, suction becomes cinostu, ear becomes aer, and so on.

What I want to do now is create sub words from the original anagram searched for.

Example: How would you go about pulling the subset words from a search for arrest, ie 'rest' and 'stare' - http://bit.ly/wTQqN

+1  A: 

Include a space on the end of the original word. Each iteration where the space ends up in the middle of letters, you'll get two words. Then you can test those two words. If the space is on the beginning or end of the iteration pattern, trim it off and test that one word.

Mark Jones
This would probably work, but it seems like a long way around. I can't think of anything better at the moment, but I *feel* like there is an easier way... somehow. Good idea.
anonymous coward
I'm trying to figure out exactly what you mean. Do you mean arrest would come back as stare, or ear or rest. Or that one word would combine with another, always equaling the number of chars in the original anagram.
So "arrest" becomes "arrest " before you start. Note the new space at the end. Some iteration is going to return "r staer". Neither word makes sense. The next iteration would return "r stare". Aha! "stare"! Eventually, one will return "ar rest". Another will return "star re". Etc.
Mark Jones
A: 

This approach is slightly different from yours, but I believe it will be easy to implement programmatically. I'm not sure it's optimal performance wise, but I'll leave that to you :-)

First you need a dictionary of all the legal words you want to be able to match.

Create a "Dictionary" or "Words" table in your database, with the first column storing the actual word, the second column storing the word converted all to upper or lower case for easy comparison, and then one integer column for every letter in the alphabet A-Z.

Import your dictionary file into this table, and programmatically count the number of times each letter of the alphabet appears in this word, and store this number in the column for that letter.

Example Word: bookkeeper

Store the word "bookkeeper" in the word column, 1 in your "b", "p", and "r" columns, 2 in your "o" and "k" columns, and 3 in your "e" column.

Once you have your entire dictionary imported with letter counts, you can fairly easily determine all the possible sub-words in a given word using the following method:

  • Count the letters in your string.
  • Compose a SQL query that returns all the words form your dictionary table that do not use letters not found in your given word, or have more of any particular letter than exist in your word.

You could achieve this by making an in-memory array with 26 positions representing the alphabet

Example word: vehicle

SELECT Word FROM Dictionary WHERE NOT (
  (a >= 1) OR (b >= 1) OR (c >= 2) ... OR (z >= 1)
)

Thus any word in your dictionary that has an 'a' or a 'z' in it are ruled out, since the query will filter out any words where the 'a' or 'z' count is at least one, and any word that uses more than one 'c' is filtered out.

You can easily generate all the "OR" conditions programmatically by using an array of 26 integers, all starting at 1, and then go through your word, adding 1 to the appropriate array value of each letter you find.

UPDATE - final count example code

Forgive my code sample below - it is going to be in ASP (VBScript) - but you should be able to grasp and translate to PHP, or have a kind person do this for you if not.

Const AsciiCodeLowerCaseA = 97
InputWord = "Carrots"
LowerCaseInputWord = LCase(InputWord)

Dim LetterCount(26)

for i = 1 to 26
  LetterCount(i) = 1
next

for j = 1 to Len(InputWord)
  CurrentLetter = Mid(InputWord, j, 1)
  AsciiCode = Chr(CurrentLetter)
  AlphabetPos = AsciiCode - AsciiCodeLowerCaseA + 1
  LetterCount(AlphabetPos) = LetterCount(AlphabetPos) + 1
next

By converting each letter of the word to its ASCII value, then subtracting the ascii code for lower case 'a' and adding 1, you get the position of that letter in the alphabet from 1 to 26. You now add 1 to that position in the array.

It seems counterintuitive, but initialise all the letters to 1 in your array. When you build the SQL statement, you are eliminating all words with letter counts higher than your input word - so if a letter does not appear in your original word, you filter out words that have one or more of that letter. If the letter appears once, you filter out words that have two or more of that letter, and so on.

Bork Blatt
I have actually got up to having all the a-z's in my database. I was working along these lines. It was just the query I was struggling on. This may be the answer I was looking for. I will give this a try and let you know how I get on. Thanks Bork.
So I've created a 26 key array: $alpha_array = array("a" => 0, "b" => 0, "c" => 0) and so on. What I want to do is walk through my exploded input string..ie the one the user enters, and if a char exists I want to edit the $alpha_array and add one to the array in that instance.Then I can construct an SQL statement after this. Any ideas?
I'll modify my answer to try and explain this.
Bork Blatt
A: 

Hey Bork. been trying to adapt your code into PHP, and I have the following:

$LetterCount = array("a" => 1, "b" => 1, "c" => 1, "d" => 1, "e" => 0, "f" => 1, "g" => 1, "h" => 1, "i" => 1, "j" => 1, "k" => 1, "l" => 1, "m" => 1, "n" => 1, "o" => 1, "p" => 1, "q" => 1, "r" => 1, "s" => 1, "t" => 1, "u" => 1, "v" => 1, "w" => 1, "x" => 1, "y" => 1, "z" => 1);

$AsciiCodeLowerCaseA = 97;

for ($j = 1; $j < strlen($string); $j++) {
  $CurrentLetter = $string[$j];
  $AsciiCode = ord($CurrentLetter);
  $AlphabetPos = $AsciiCode - $AsciiCodeLowerCaseA + 1;
      $LetterCount[$AlphabetPos] = $LetterCount[$AlphabetPos] + 1;
}

I hardcoded the array declaration bit to save time.

Anyway, It didnt seem to work and gave me this error: Notice: Undefined offset: 1

Here is a screenshot of the errors I am getting, I have also added echos for each var or array in the loop to see if you can understand what is going on.

http://i42.tinypic.com/11ryz4g.png

I think it isnt identifying the aplhabet letter in the array correctly, and thus is adding numbers incorrectly to the end of the array.

Let me know what you think I should do.

+1  A: 

I haven't thought about this meaningfully yet, sorry (work to do!), but however you end up generating the words, don't forget that this will cache like a motherlover, so don't go re-generating these on the fly every time someone searches.

CS.

+1  A: 

Here's an approach I've used before that makes use of your list of alphabetically-sorted words.

1) Take your target word (arrest) and sort it (aerrst).

2) Then from the sorted word generate new strings where each letter is either included or excluded. For a word of N letters this gives 2**N possible strings. (I don't know PHP but can give you pseudocode or eg Python if you would like.)

For your target word we have: a, e, r, r, s, t, st, rs, rt, rst, rr, rs, rt, rst, rrs, rrt, rrst, er, er, es, et, est, ers, ert, erst, err, ers, ert, erst, errs, errt, errst, ae, ar, ar, as, at, ast, ars, art, arst, arr, ars, art, arst, arrs, arrt, arrst, aer, aer, aes, aet, aest, aers, aert, aerst, aerr, aers, aert, aerst, aerrs, aerrt, aerrst

3) Then check these strings against your sorted list. Those that appear in your sorted list correspond to the subset words you want.

eg aerrst corresponds to full anagrams (arrest, rarest, raster, ...)
eg aerst will be in your sorted list (stare, tears, ...)
eg rrs will not be in your sorted list

jonpalin
A: 

Andy,

I think you need to convert the ASCII code back into a character - you are indexing the array with letters, yet you are accessing it with ASCII values.

Here's your code, modified slightly:

$LetterCount = array("a" => 1, "b" => 1, "c" => 1, "d" => 1, "e" => 0, "f" => 1, "g" => 1, "h" => 1, "i" => 1, "j" => 1, "k" => 1, "l" => 1, "m" => 1, "n" => 1, "o" => 1, "p" => 1, "q" => 1, "r" => 1, "s" => 1, "t" => 1, "u" => 1, "v" => 1, "w" => 1, "x" => 1, "y" => 1, "z" => 1);

$AsciiCodeLowerCaseA = 97;

for ($j = **0**; $j < strlen($string); $j++) {
  $CurrentLetter = $string[$j];
  $AsciiCode = ord($CurrentLetter);
  $AlphabetPos = **chr($AsciiCode - $AsciiCodeLowerCaseA + 1);**
  $LetterCount[$AlphabetPos] = $LetterCount[$AlphabetPos] + 1;
}

Also I just noticed you're indexing the characters in the string from 1, but arrays are zero-idexed.

I think this might be much simpler as well (unless I'm missing something)

for($j = 0; $j < strlen($string); $j++) {
$LetterCount[$string[$j]]++;
}
Rob
Thanks Rob. Its working now... Takes 20 secs to execute the page regardless of the length of the number of letters the query is.Acceptable speed, will look to improve it now.Thanks again everyone!
Down to 1 sec now :) Sorted.