views:

490

answers:

7

Start with this:

[G|C] * [T] *

Write a program that generates this:

Cat
Cut
Cute
City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start.
Caught
...
Gate
Gotti
Gut
...
Kit
Kite
Kate
Kata
Katie

Another Example, This:

[C] * [T] * [N]

Should produce this:

Cotton Kitten

Where should I start my research as I figure out how to write a program/script that does this?

+4  A: 

You can do this by using regular expressions against a dictionary containing phonetic versions of words.

Here's an example in Javascript:

     <html>
    <head>
        <title>Test</title>
        <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"&gt;&lt;/script&gt;
        <script>

            $.get('cmudict0.3',function (data) {
                matches = data.match(/^(\S*)\s+K.*\sT.*\sN$/mg);
                $('body').html('<p>'+matches.join('<br/> ')+'</p>');
            })

        </script>
    </head>
    <body>
    </body>
</html>

You'll need to download the list of all words from http://icon.shef.ac.uk/Moby/mpron.tar.Z and put it (uncompressed) in the same folder as the HTML file. I've only translated the [C] * [T] * [N] version into a regular expression and the output isn't very nice but it'll give you the idea. Here's a sample of the output:

CALTON K AE1 L T AH0 N
CAMPTON K AE1 M P T AH0 N
CANTEEN K AE0 N T IY1 N
CANTIN K AA0 N T IY1 N
CANTLIN K AE1 N T L IH0 N
CANTLON K AE1 N T L AH0 N
...
COTTERMAN K AA1 T ER0 M AH0 N
COTTMAN K AA1 T M AH0 N
COTTON K AA1 T AH0 N
COTTON(2) K AO1 T AH0 N
COULSTON K AW1 L S T AH0 N
COUNTDOWN K AW1 N T D AW2 N
..
KITSON K IH1 T S AH0 N
KITTELSON K IH1 T IH0 L S AH0 N
KITTEN K IH1 T AH0 N
KITTERMAN K IH1 T ER0 M AH0 N
KITTLESON K IH1 T L IH0 S AH0 N
...
Rich
Regular expressions can't handle sounds.
danben
@danben: I've updated my answer to reflect a correct understanding of the problem.
Rich
Ok, downvote removed.
danben
+1  A: 

One approach would be to transform an English pronounciation dictionary into a finite state machine and then search it using a regular expression or a simple wildcard mechanism. You could also compile a such a dictionary yourself by running an English word list through a program that produces phonetic transcriptions, e.g. like the ones found on these sites:

Finding a mechanism to map back from the phonetic transcription to standard spelling should be easy.

ferdystschenko
Regular expressions can't handle sounds.
danben
True. The original post, before the edit, didn't make it clear enough that the problem is about pronounciation and not just spelling.
ferdystschenko
Really? It said "consonant sounds" in the title, and provided the example of "kitten" as a match for `C*T*`.
danben
Actually, regexes *can* handle sounds if you transform your dictionary into one containing phonetic transcriptions and build a corresponding fsa with that.
ferdystschenko
I am glad you changed your answer to address the crux of the problem. However, you should understand that regular expressions / state machines have no way to match similar-sounding words on their own. What you are talking about is preprocessing the input into something that can then be matched dumbly.
danben
C'mon, this was obvious. You might call it preprocessing and dumb matching, but it's an excellent solution to the problem. In fact, it's the *only* reasonable solution I've seen so far. Even though most other answers involve some kind of pronounciation dictionary, they don't adress the actual matching problem when vowels are removed from the input.
ferdystschenko
+1  A: 

A phoneme is "the smallest segmental unit of sound employed to form meaningful contrasts between utterances." It is my understanding that this is the basis for pronunciation-based spelling correction systems. Misspelling newspaper as noospaypr might generate the proper correction despite a large edit distance between the two words, because the corresponding segments in each word (oo and ew, pa and pay, per and pr) may be converted to the same phoneme.

Unfortunately, a couple of minutes of me Googling didn't find any libraries that will perform the conversion for english words, but that is where I would start.

danben
+4  A: 

You need a word list or dictionary that uses something like the International Phonetic Alphabet, or some other standard phonetic way of writing words. It would need to have a list of English words and their corresponding phonetic spellings. I have no idea where you would get one because I don't think that the standard dictionary makers just hand out that sort of information.

Justin Peel
+2  A: 

You want moby pronounciation. It's part of the moby words project.

You'll find an explanation and links to the documents here: http://en.wikipedia.org/wiki/Moby_Project

Moby pronounciation is a list of about 170k words and their phonetic pronounciations.

From there it should be a relatively straight forward process to build the program.

ablerman
A: 

Here is an example of output from Microsoft Word. Note that this almost does exactly what I want. Word does not allow wildcard characters to be substituted for vowels, and it does not allow alternates (e.g., [C|G] instead of just [C] ... nevertheless, it does suggest based on consonant "sound-alikes").

Therefore this is indeed very close to what I am trying to program myself.

Example output from Microsoft Word 2007 spellcheck

dreftymac
To me, this looks like a simple similar-words output based on edit distance (usually implemented using an fsa ;-) ). If the output is really based on similar pronounciation, then there's surely some mapping to a pronounciation dictionary going on, like suggested in the other answers.
ferdystschenko
+1  A: 

You can somewhat do this using the steps I outline. I will outline the algorithm first followed by some (untested and quite possibly broken) java code.

Note: I will be using the apache commons-codec library.


Algorithm:

  1. Use a regular expression to represent your input pattern.
  2. From a lexicon of "valid known words" filter the subset that matches your regular expression. Let's call this the matched subset (MS)
  3. Use the Double Metaphone algorithm to encode these words from MS.
  4. Apply some phonetic filtering to prune MS to your needs.

To illustrate how steps 3 and 4 work, I will first show you the output of the Double Metaphone algorithm on the five words you have suggested as examples: Cute, Cat, Cut, Caught, City

Code A (illustrating Double Metaphone):

private static void doubleMetaphoneTest() {
    org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone();
    System.out.println("Cute\t"+dm.encode("Cute"));
    System.out.println("Cat\t"+dm.encode("Cat"));
    System.out.println("Cut\t"+dm.encode("Cut"));
    System.out.println("Caught\t"+dm.encode("Caught"));
    System.out.println("City\t"+dm.encode("City"));
}

Output of code A

Cute   KT
Cat    KT
Cut    KT
Caught KFT
City   ST

Now in your question, you have stated that City is not a right solution because it begins with an "ESS" sound. Double Metaphone will help you to identify exactly this kind of issue (although I am sure there will be cases where it will fail to help). Now you can apply step 4 in the algorithm using this principle.


In the following code, for step 4 (apply some phonetic filtering), I will assume that you already know that you only want the 'K' sound and not the 'S' sound.

Code B (prototype solution to entire question)

Note: This code is meant to illustrate the use of the DoubleMetaphone algorithm for your purpose. I haven't run the code. The regex may be broken or may be a really lame one or my use of Pattern Matcher may be wrong (It's 2AM now). If it is wrong please improve/correct it.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.codec.language.DoubleMetaphone;

public class GenerateWords {

    /**
     * Returns a set of words that conform to the input pattern
     * @param inputPattern a regular expression
     * @param lexicon a list of valid words
     */
    public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){
        /* E.g. for the case           [C] * [T] * [N] 
         * the regex is:
         *  [Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+
         */
        Pattern p = Pattern.compile(inputPattern);
        List<String> result = new ArrayList<String>();
        for(String aWord:lexicon){
            Matcher m = p.matcher(aWord);
            if(m.matches()){
                result.add(aWord);
            }
        }
        return result; 
    }

    /**
     * Returns the subset of the input list that "phonetically" begins with the character specified.
     * E.g. The word 'cat' begins with 'K' and the word 'city' begins with 'S'
     * @param prefix
     * @param possibleWords
     * @return
     */
    public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){
        List<String> result = new ArrayList<String>();
        DoubleMetaphone dm = new DoubleMetaphone();
        for(String aWord:possibleWords){
            String phoneticRepresentation = dm.encode(aWord); // this will always return in all caps
            // check if the word begins with the prefix char of interest
            if(phoneticRepresentation.indexOf(0)==Character.toUpperCase(prefix)){
                result.add(aWord);
            }
        }
        return result;
    }

    public static void main(String args[]){

        // I have not implemented this method to read a text file etc.
        List<String> lexicon = readLexiconFromFileIntoList();
        String regex = "[Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+";
        List<String> possibleWords = fetchMatchingWordsFromLexicon(regex,lexicon);

        // your result
        List<String> result = filterWordsBeginningWithMetaphonePrefix('C', possibleWords);

        // print result or whatever

    }
}
hashable