tags:

views:

181

answers:

1

Hello, I am working in a natural language processing project. It aims to build libraries for Arabic language. We working on a POS tagger and now I am thinking in grammar phase. Since Arabic language and many others have complicated grammar, so it is very hard to build their context free grammar (CFG). For this reason I had an idea for an algorithm to build a CFG (with probability PCFG) for any language from a tagger corpora using unsupervised learning. To explain the algorithm suppose I have these three tagged statements as an input: 1- Verb Noun 2- Verb Noun Subject 3- Verb Noun Subject adverb The algorithm gives: 1) A--> Verb Noun 2) B-->A Subject 3) C-->B adverb.
We repeat this methodology for each statement such that we can finish with a specific PCFG. The main power of the algorithm lies beyond the fact of seeing the whole statement, so the probabilities can be conditional and they are specific. After that CKY algorithm can be applied to choose the best tree for new statements using probabilities. Do you expect that this algorithm is good or not and does it worth to continue improving it.

+1  A: 

Hello Hani. I did something similar for my M.Sc. thesis - learning CFG rules (without probabilities) using a partial grammar and POS tagging. Please see my answer to this question for a list of references about learning PCFGs. One approach is learning lexicalized grammars, that include word information along with the node name.

It is hard to answer your question without context: What would you consider a good algorithm? One that gives a good enough language model? That minimizes a statistical measure? That is efficient enough?

Given Arabic's rich morphology, maybe you can add morphology to your grammar - e.g. add gender and number agreement features to it.

Yuval F