views:

248

answers:

3

Hi all,

I have written this piece of code that splits a string and stores it in a string array:-

String[] sSentence = sResult.split("[a-z]\.\s+");

However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-

Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv

I see that I loose the pattern specified in the split function. Its okay for me to loose the period, but loosing the last letter of the word disturbs its meaning. Could some one help me with this and in addition also could someone help me with dealing with abbreviations? Like because I split the string based on periods, I do not want to loose the abbreviations.

Thanks in advance

+2  A: 

It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

Result:

This is a test
This is a T.L.A. test.

Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!

Mark Byers
Thank you for your reply.
rookie
This will fail in 9.3% of sentences. And sentences that ... use ellipsis. And sentences with typo.s in them. And so on. Whatever you do, your code will make mistakes, viewed from the human perspective.
Stephen C
+2  A: 

If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g. :-)), Mr., abbreviations, ellipsis (...), et cetera.

There is a very easy to follow tutorial on Sentence Detection in the LingPipe website.

JG
Hi, I checked out the tutorial. It seemed perfect, however I can't seem to figure out how to use it with eclipse. Could you help me out please?
rookie
+4  A: 

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.
Julien Silland