views:

54

answers:

5

Is there a simple trick to isolate the first sentence in a large string of text? (Perhaps using regular expressions.)

Searching for the first fullstop "." doesn't work, as acronyms such as "U.S.A." will screw things up.

(There probably is no right answer.)

A: 

Using a plain text string theres not guranteed way to do it, but using a string with some masking for example if your string has \n at the end of each line or sentence you can use that to determine an end to the line, other than that you have to just guess it.

kyndigs
+1  A: 

Usually you will lookup for the first full stop that does not follow a capital letter. But this wont work with some abbreviations.

There is no magical solution… you could make a list of all abbreviations and ignore them when followed by a full stop.

Benoit
+1  A: 

No. There is no simple trick. To do this properly, you need to do a syntactic analysis of the text. Nobody can do that. At least not yet. At least not 100% of the time. Mainly because it also entails a semantic analysis of the text. You see, contrary to what the type of linguists that taught you grammar in school think, what makes up a sentence is pretty hard to sum up in a set of rules a computer could follow without understanding the text.

Spend the next couple of years looking up computational linguistics. Maybe by then there will be a shortcut?

But you can get close.

I'd probably try to look for the first period, question mark or exclamation mark followed by whitespace.

/^(.*?)[.?!]\s/

(The (.*?) is a non-greedy regex, to make sure you really do only find the first sentence.

Daren Thomas
This regex would fail if the sentence contained an acronym such as U.S.A. mid sentence. :)
pauldoo
@pauldoo, you're right. I was only guarding (with the `\s`) against the first two punctuation characters :(
Daren Thomas
+2  A: 

Would you pay for this being done? If so se Amazon's Mechanical Turk which farms tasks out to real people at a rate, lets say $0.01 per update. At least it beats the hell out of doing two years computational linguistics. ;-)

PurplePilot
A: 

Like said before, there is no easy solution.

A more enhanced version of a regex could be: /^(.*?(?<!\b\w)[.?!])\s+[A-Z0-9]/. It does not stop at mid sentence acronyms (but also not, if they are at the end of a sentence...), the next sentence has to start with an upper case letter oder digit....

If you know a list of acronyms that you dont want your regex to stop at, you migth add them like: /^(.*?(?<!\b\w|U\.S\.A|eg)[.?!])\s+[A-Z0-9]/.

If you know what language you are going to use, there might be some Natural Language Parsing (NLP) toolkit - but this would go beyond the scope of this question.

MaoPU