You can somewhat do this using the steps I outline. I will outline the algorithm first followed by some (untested and quite possibly broken) java code.
Note: I will be using the apache commons-codec library.
Algorithm:
- Use a regular expression to represent your input pattern.
- From a lexicon of "valid known words" filter the subset that matches your regular expression. Let's call this the matched subset (MS)
- Use the Double Metaphone algorithm to encode these words from MS.
- Apply some phonetic filtering to prune MS to your needs.
To illustrate how steps 3 and 4 work, I will first show you the output of the Double Metaphone algorithm on the five words you have suggested as examples: Cute, Cat, Cut, Caught, City
Code A (illustrating Double Metaphone):
private static void doubleMetaphoneTest() {
org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone();
System.out.println("Cute\t"+dm.encode("Cute"));
System.out.println("Cat\t"+dm.encode("Cat"));
System.out.println("Cut\t"+dm.encode("Cut"));
System.out.println("Caught\t"+dm.encode("Caught"));
System.out.println("City\t"+dm.encode("City"));
}
Output of code A
Cute KT
Cat KT
Cut KT
Caught KFT
City ST
Now in your question, you have stated that City is not a right solution because it begins with an "ESS" sound. Double Metaphone will help you to identify exactly this kind of issue (although I am sure there will be cases where it will fail to help). Now you can apply step 4 in the algorithm using this principle.
In the following code, for step 4 (apply some phonetic filtering), I will assume that you already know that you only want the 'K' sound and not the 'S' sound.
Code B (prototype solution to entire question)
Note: This code is meant to illustrate the use of the DoubleMetaphone algorithm for your purpose. I haven't run the code. The regex may be broken or may be a really lame one or my use of Pattern Matcher may be wrong (It's 2AM now). If it is wrong please improve/correct it.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.codec.language.DoubleMetaphone;
public class GenerateWords {
/**
* Returns a set of words that conform to the input pattern
* @param inputPattern a regular expression
* @param lexicon a list of valid words
*/
public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){
/* E.g. for the case [C] * [T] * [N]
* the regex is:
* [Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+
*/
Pattern p = Pattern.compile(inputPattern);
List<String> result = new ArrayList<String>();
for(String aWord:lexicon){
Matcher m = p.matcher(aWord);
if(m.matches()){
result.add(aWord);
}
}
return result;
}
/**
* Returns the subset of the input list that "phonetically" begins with the character specified.
* E.g. The word 'cat' begins with 'K' and the word 'city' begins with 'S'
* @param prefix
* @param possibleWords
* @return
*/
public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){
List<String> result = new ArrayList<String>();
DoubleMetaphone dm = new DoubleMetaphone();
for(String aWord:possibleWords){
String phoneticRepresentation = dm.encode(aWord); // this will always return in all caps
// check if the word begins with the prefix char of interest
if(phoneticRepresentation.indexOf(0)==Character.toUpperCase(prefix)){
result.add(aWord);
}
}
return result;
}
public static void main(String args[]){
// I have not implemented this method to read a text file etc.
List<String> lexicon = readLexiconFromFileIntoList();
String regex = "[Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+";
List<String> possibleWords = fetchMatchingWordsFromLexicon(regex,lexicon);
// your result
List<String> result = filterWordsBeginningWithMetaphonePrefix('C', possibleWords);
// print result or whatever
}
}