views:

621

answers:

2

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い!とても快適です。

In ascii, it looks like this:

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";

When I look at the Boundary indices, I see this:

0|13|24|32

But those indices don't correspond to any sentence terminators.

+3  A: 

You want to looking into the internationalized BreakIterator classes. A good starting point for sentence boundaries.

GaryF
This doesn't appear to be working with my Japanese test. I tried it with both Japanese characters and ascii characters (converted using native2ascii)
Mike Sickler
@Mike - what sentence terminator are you testing? Cursory testing with \u3002 worked for me.
McDowell
+1  A: 

You wrote:

I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

A basic problem here is that sentence terminators depend on the context, consider:

How did Dr. Jones compute 5! without recursion?

This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.

Fabian Steeg
Thanks a lot. Now I understand why there isn't a library that does what I need. Cheers!
Mike Sickler
The library that I linked to in my answer handles that example.
Bill the Lizard
Bill, yes I tried it too, cool. Won't work for Japanese, though.
Fabian Steeg