views:

74

answers:

3

Possible Duplicate:
PHP - How to split a paragraph into sentences.

I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentence with something like !?. What would be the best way to handle this? I figured there would be some regex that could handle this, but I'm open to a non-regex solution if that fits the problem better.

+1  A: 

Unfortunately there is no perfect solution for this, for the very reasons you stated. If it is content that you can somehow control or force a specified delimiter after every sentence, that would be ideal. Beyond that, all you can really do is look for (\.|!|?)+ and maybe even throw in a \s after that since most people pad new sentences with 1 or 2 spaces between the previous and next sentence.

Crayon Violent
A: 

I think the biggest problem is the possible existence of acronyms! Therefore you must use something like Prof. Knuth in a JavaDoc summary sentence so that the javadoc generator don't thinks that the first sentence ends after Prof.. This is a problem I don't know how anyone can reliably handle. The only approximate solution I could imagine is the use of an abbreviation dictionary.

splash
There are no acronyms (words formed out of the initials of other words, i.e. ASAP) in your example, only an abbreviation (a word represented by a leading subset of the usual letters).
dmckee
+2  A: 

Regex isn't the best solution for this problem. You'd be served better by creating a parsing library. Something where you an easily create logic blocks to distinguish one thing from another. You'll need to come up with a set of rules breaking up the text into the chunks you'd like to see.

"Are you sure?" he asked.

Doesn't that mess things up when using regex? However, with a parser you could actually see

<start quote><capitalization>are you sure<question><end quote>he asked<period>

that with simple rules could say "that's one sentence."

wheaties
Or, annoyingly, you could get things like `"Are you sure"? he asked.` which are semantically correct but look oh so wrong. Also, nouns which contain punctuation are also bad: `Which? recommend buying....`
Callum Rogers
Actually the ? should be inside the quotes.
Crayon Violent