I'm implementing readability test and have implemented simple algorithm of detecting sylables. Detecting sequences of vowels I'm counting them in words, for example word "shoud" contains one sequence of vowels which is 'ou'. Before I'm counting them i'm removing suffixes like -les, -e, -ed (for example word "like" contains one syllable but two sequences of vowels, so this method works).
But... Consider these words / sequences:
- x-ray (it contains two syllables)
- I'm (One syllable, maybe I may use removal of all apostrophes in the text?)
- goin'
- I'd've
- n' (for example Pork n' Beans)
- 3rd (how to treat this ?)
- 12345
What to do with special characters? Remove them all? It will be ok for most of words, but not with "n'" and "x-ray". And how treat cyphers.
These are special cases of words but I'll be very glad to see some experience or ideas in this subject.