views:

92

answers:

2

Hi folks,

I am working on one feature i.e. to apply language segmentation rules ( grammatical ) for Latin based language ( English currently ).

Currently I am in phase of breaking sentences of user input.

e.g.:

"I am working in language translation". "I have used Google MT API for this"

In above example i will break above sentence by full stop (.) This is normal cases where I am breaking sentence on dot, but there are n number of characters for breaking sentence like ( . ! ? etc ).

I have following SRX rules for segmentation.

Here my question are :-

1) Is there any reference ? which I can use for resolving my language segmentation rules.

2) Or Is there any forums on language segmentation ? , so that i can discuss efficiently

Please let me know if anybody know about this ?

Thanks a lot.

A: 

There seems to be a good amount of literature about this in linguistics journals...

This is a nice report about the problem, hope it can help you http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports

nico

nico
+1  A: 

You probably want to take a look at Reynar and Ratnaparkhi's paper A Maximum Entropy Approach to Identifying Sentence Boundaries (1997).

Abstract

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and / as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

Their resulting sentence segmenter is known as MxTerminator and is available here.

dmcer