views:

329

answers:

3

I'm tinkering with a domain name finder and want to favour those words which are easy to pronounce.

Example: nameoic.com (bad) versus namelet.com (good).

Was thinking something to do with soundex may be appropriate but it doesn't look like I can use them to produce some sort of comparative score.

PHP code for the win.

+4  A: 

I think the problem could be boiled down to parsing the word into a candidate set of phonemes, then using a predetermined list of phoneme pairs to determine how pronouncible the word is.

For example: "skill" phonetically is "/s/k/i/l/". "/s/k/", "/k/i/", "/i/l/" should all have high scores of pronouncibility, so the word should score highly.

"skpit" phonetically is "/s/k/p/i/t/". "/k/p/" should have a low pronouncibility score, so the word should score low.

Jeffrey Kemp
+8  A: 

Here is a function which should work with the most common of words... It should give you a nice result between 1 (perfect pronounceability according to the rules) to 0.

The following function far from perfect (it doesn't quite like words like Tsunami [0.857]). But it should be fairly easy to tweak for your needs.

<?php
// Score: 1
echo pronounceability('namelet') . "\n";

// Score: 0.71428571428571
echo pronounceability('nameoic') . "\n";

function pronounceability($word) {
    static $vowels = array
     (
     'a',
     'e',
     'i',
     'o',
     'u',
     'y'
     );

    static $composites = array
     (
     'mm',
     'll',
     'th',
     'ing'
     );

    if (!is_string($word)) return false;

    // Remove non letters and put in lowercase
    $word = preg_replace('/[^a-z]/i', '', $word);
    $word = strtolower($word);

    // Special case
    if ($word == 'a') return 1;

    $len = strlen($word);

    // Let's not parse an empty string
    if ($len == 0) return 0;

    $score = 0;
    $pos = 0;

    while ($pos < $len) {
        // Check if is allowed composites
     foreach ($composites as $comp) {
      $complen = strlen($comp);

      if (($pos + $complen) < $len) {
       $check = substr($word, $pos, $complen);

       if ($check == $comp) {
        $score += $complen;
        $pos += $complen;
        continue 2;
       }
      }
     }

     // Is it a vowel? If so, check if previous wasn't a vowel too.
     if (in_array($word[$pos], $vowels)) {
      if (($pos - 1) >= 0 && !in_array($word[$pos - 1], $vowels)) {
       $score += 1;
       $pos += 1;
       continue;
      }
     } else { // Not a vowel, check if next one is, or if is end of word
      if (($pos + 1) < $len && in_array($word[$pos + 1], $vowels)) {
       $score += 2;
       $pos += 2;
       continue;
      } elseif (($pos + 1) == $len) {
       $score += 1;
       break;
      }
     }

     $pos += 1;
    }

    return $score / $len;
}
Andrew Moore
yeah it sorta works. I notice 'wptmimi' = 'goodbye' (both .57). I'm going to use it and say anything less than .5 is not pronounceable.
Mike Blandford
+1  A: 

Use a Markov model (on letters, not words, of course). The probability of a word is a pretty good proxy for ease of pronunciation. You'll have to normalize for length, since longer words are inherently less probable.

Kevin Peterson