views:

2541

answers:

8

I need to find a fairly efficient way to detect syllables in a word. E.g.,

invisible -> in-vi-sib-le

There are some syllabification rules that could be used:

V CV VC CVC CCV CCCV CVCC

*where V is a vowel and C is a consonant. e.g.,

pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

+13  A: 

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

Jason
I like that youve cited a thesis dissertation on the subject, it's a little hint to the original poster that this might not be an easy question.
Karl
Yes, I am aware that this is not a simple question, although I haven't worked much on it. I did underestimate the problem though, I thought I would work on other parts of my app, and later return to this 'simple' problem. Silly me :)
I read the disertation paper, and found it very helpful. The problem with the approach was that I did not have any patterns for the Albanian language, although I found some tools that could generate those patterns.Anyway, for my purpose I wrote a rule based app, which solved the problem...
... My approach is a bit slow (~20 sec on a 50K word file) but I think the results are reasonably accurate (i dont have any useful stats yet).
Glad to hear it was helpful. Keep us updated.
Jason
+1  A: 

What is this "given language"? I think what constitutes a syllable depends on the language you are looking at. Some languages might not even have this concept! (Look at Chinese, even "words" are not well defined there.)

For some languages it might be easier to find the syllables. In Japanese for example you have only a fixed list of syllables and it is unambigous to extract them from a text.

unbeknown
My spell check application is for the Albanian language. However tips for English would be fine. That is why I posted the syllabification rules on my post.
+1  A: 

Perl has Lingua::Phonology::Syllable. You might try that, or try looking into its algorithm. I saw a few other older modules there, too.

I don't understand why a regular expression gives you only a count of syllables. You should be able to get the syllables themselves using capture parentheses. Assuming you can construct a regular expression that works, that is.

skiphoppy
A: 

Java has pretty much NLP libraries which i think will do the job. Some are:

LingPipe

Stanford Java Nlp

OpenNLP

systemsfault
-1: I don't believe any of those tools contain pronunciation data.
Chris S
+5  A: 

I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here: http://code.google.com/p/hyphenator/

That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)

Sean
+3  A: 

Here are a few related links I came across while working on a similar problem:

  • A simple Python program which the author estimates to be about 85% reliable (and Perl version).

  • Some discussion of Haiku, with code.

  • A Python forum discussion listing some phonetic rules.

  • PyHyphen (hyphenation library of OpenOffice and FireFox wrapped for Python) which includes a 'syllables' method.

(I have not yet studied or experimented with these in any detail.)

Kilo
Also, could someone add the tags: 'nlp' and 'natural-language' to this question?
Kilo
+1  A: 

Why calculate it? Every online dictionary has this info. http://dictionary.reference.com/browse/invisible in·vis·i·ble

Chris S
Maybe it has to work for words that don't appear in dictionaries, such as names?
Wouter Lievens