ansaurus

Question

Ruby, Count syllables

Answer 1

A:

To begin with it seems you should decrement len for the suffixes that should be excluded.

len-=1 if /.*[ing,es,ed]$/.match(word)

You could also check out Lingua::EN::Readability.

It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.

PS. I think I know where you got the function from. DS.

Jonas Elfström 2009-08-13 13:43:46

Yea, I got the function from there, seems to be the only method that gets close.

Eef 2009-08-13 14:19:01

Answer 2

+2 A:

The function I give you before is based upon these simple rules outlined here:

Each vowel (a, e, i, o, u, y) in a word counts as one syllable subject to the following sub-rules:

Ignore final -ES, -ED, -E (except for -LE)

Words of three letters or less count as one syllable

Consecutive vowels count as one syllable.

Here's the code:

def new_count(word)
  word.downcase!
  return 1 if word.length <= 3
  word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
  word.sub!(/^y/, '')
  word.scan(/[aeiouy]{1,2}/).size
end

Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.

EDIT:

I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").

Here's a comparison using the text in the question:

# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')

words = text.split(' ')

words.each do |word|
  old = count_syllables(word.dup)
  new = new_count(word.dup)
  puts "#{word}: \t#{old}\t#{new}" if old != new
end

The output is:

logorrhoea:     3 4
used:   2 1
makes:  2 1
themselves:     3 2

So it appears to be an improvement.

Pesto 2009-08-13 13:47:01

The problem with simply assuming all three-letter words are monosyllabic is that it catches words like "aid" but misses words like "ion". The algorithm can be improved by teaching it the diphthongs—i.e., which two-syllable clusters are pronounced as one.

zbrimhall 2009-08-13 17:56:06

For what the OP is doing, though, it's probably unnecessary. How many three-letter, three-syllable words are likely to be encountered? Anything is going to be a heuristic, the goal is to find an algorithm that is close enough without being too intensive in coding or running time.

Pesto 2009-08-13 18:01:45

Answer 3

+1 A:

One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.

You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.

Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.

zbrimhall 2009-08-13 17:43:58

ansaurus

tags:

views:

answers:

Ruby, Count syllables

EDIT:

related questions