views:

1519

answers:

14

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.

EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.

+4  A: 

It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.

William Keller
A: 

You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.

Ryan Bigg
+21  A: 

You can build a markov-chain of a huge english text.

Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.

See here: http://en.wikipedia.org/wiki/Markov_chain

At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.

In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.

Nils Pipenbrinck
+1  A: 

Do they have to be real English words, or just strings that look like they could be English words?

If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.

Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).

Kevin ORourke
A: 

That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.

Iain
+1  A: 

I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.

Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.

Soundex is very easy to implement - see Wikipedia for a description of the algorithm.

An example implementation of what you want to do would be:

def soundex(name, len=4):
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c
            d = digits[ord(c)-ord('A')]
            if not sndx or (d != sndx[-1]):
                sndx += d

    sndx = fc + sndx[1:]
    sndx = sndx.replace('0','')
    return (sndx + (len * '0'))[:len]

real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]

if soundex(candidate) in soundex_cache:
    print "keep"
else:
    print "discard"

Obviously you'll need to provide an implementation of read_english_dictionary.

EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.

Russ
A: 

I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.

Obviously you can implement this yourself if you want, in any language - but it might be quite a task.

This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.

ulrikj
+2  A: 

Do you really mean phonetics/sound? Do you want to detect on the basis of sound?

I assume not and therefore I would say:

Theoretically there are two different basic approaches for such a task:

A) The algorithmic approach: For that you need a list of as close as possible to all existing english words which you can search. You could let your script search LEO for example.

B) The heuristic approach: For that you would need to define rules about how English words look like and appoint a probability to each rule. For example: The word consists of only consonants->p=.01 that it is an english word. The word contains three consonants in a row->p=.12 that it is an english word.... Then you would need to combine these rules for the word in question and decide according to a given cut off point, say 90%, that it is a an english word. Then you have to say that the word in question is an english word with a probability of 90%.

Another approach would be to use google to search the word for you on only englisch sites and you would set the cut off point at a certain number of search results.

Such heuristic approaches can be improved by collecting the results and evaluating them and consequently adjust your probability figures.

tharkun
A: 

I'd suggest a few simple rules and standard pairs and triplets would be good.

For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.

You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.

I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

workmad3
+12  A: 

The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
e-satis
+2  A: 

You should research "pronounceable" password generators, since they're trying to accomplish the same task.

A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.

Andrew Barnett
+1  A: 

Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).

One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.

Andrew Barnett
+2  A: 
Josh Millard
A: 

I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm

Russell
Index of coincidence? That would show if there's a distribution similar to standard English with a simple substitution cipher applied, but that's not what the question is about.
David Thornley