views:

774

answers:

3

Hey,

I am using ruby to calculate the Gunning Fog Index of some content that I have, I can successfully implement the algorithm described here:

Gunning Fog Index

I am using the below method to count the number of syllables in each word:

Tokenizer = /([aeiouy]{1,3})/

def count_syllables(word)

      len = 0

      if word[-3..-1] == 'ing' then
        len += 1
        word = word[0...-3]
      end

      got = word.scan(Tokenizer)
      len += got.size()

      if got.size() > 1 and got[-1] == ['e'] and
          word[-1].chr() == 'e' and
          word[-2].chr() != 'l' then
        len -= 1
      end

      return len

    end

It sometimes picks up words with only 2 syllables as having 3 syllables. Can anyone give any advice or is aware of a better method?

EDIT:

text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."

# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')

word_array = text.split(' ')

word_array.each do |word|
    puts word if count_syllables(word) > 2
end

"themselves" is being counted as 3 but it's only 2

Cheers

Eef

A: 

To begin with it seems you should decrement len for the suffixes that should be excluded.

len-=1 if /.*[ing,es,ed]$/.match(word)

You could also check out Lingua::EN::Readability.

It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.

PS. I think I know where you got the function from. DS.

Jonas Elfström
Yea, I got the function from there, seems to be the only method that gets close.
Eef
+2  A: 

The function I give you before is based upon these simple rules outlined here:

Each vowel (a, e, i, o, u, y) in a word counts as one syllable subject to the following sub-rules:

  • Ignore final -ES, -ED, -E (except for -LE)
  • Words of three letters or less count as one syllable
  • Consecutive vowels count as one syllable.

Here's the code:

def new_count(word)
  word.downcase!
  return 1 if word.length <= 3
  word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
  word.sub!(/^y/, '')
  word.scan(/[aeiouy]{1,2}/).size
end

Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.

EDIT:

I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").

Here's a comparison using the text in the question:

# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')

words = text.split(' ')

words.each do |word|
  old = count_syllables(word.dup)
  new = new_count(word.dup)
  puts "#{word}: \t#{old}\t#{new}" if old != new
end

The output is:

logorrhoea:     3 4
used:   2 1
makes:  2 1
themselves:     3 2

So it appears to be an improvement.

Pesto
The problem with simply assuming all three-letter words are monosyllabic is that it catches words like "aid" but misses words like "ion". The algorithm can be improved by teaching it the diphthongs—i.e., which two-syllable clusters are pronounced as one.
zbrimhall
For what the OP is doing, though, it's probably unnecessary. How many three-letter, three-syllable words are likely to be encountered? Anything is going to be a heuristic, the goal is to find an algorithm that is close enough without being too intensive in coding or running time.
Pesto
+1  A: 

One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.

You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.

Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.

zbrimhall