tags:

views:

433

answers:

7

I'm after a regex ( php / perl compatible ) to get the first sentence out of some text. I realize this could get huge if covering every case, but just after something that will be "good enough" at the moment. Anyone got something off the shelf for this?

+4  A: 

well, /^[^.]+/ is the simplest one

stereofrog
Yeah I had that, well I had strpos($content,'.'), but e.g. or [email protected] not so good, not to mention questions ?
Tim
good points, how about '/^.+?[.?!]+(?=\s|$)/'
stereofrog
I ended up going with /^.{150,}?[.?!]+(?=\s|$)/ as I wanted to ensure really short sentences were missed. Thanks also to Chris Lutz who did a good job of explaining the complexities of the problem.
Tim
P.S. And if no match I'm doing a preg_replace('/\s+?(\S+)?$/', '', substr($content, 0, $max_length = 200)).'...'; to break on the fist space after 200 chars.
Tim
+1  A: 

If sentence is "line" then simply match the first ^.* from a chunk of text. By default the DOT does not match new line characters.

If it's really the first sentence, do something like this: ^[^.!?]*

Bart Kiers
First sentence, not first line...
gnud
I was editing... :)
Bart Kiers
A: 

I know you just want anything that works for now, but this mailing list post came up with /^[^\.]*\.\s/, and the subsequent post came up with ([\s\S]+?)\.( |\r|\n).

Though these patterns seem only match for periods, it's up to you if you want to modify it to also match for other types of punctuation such as exclamation marks and questions marks.

Jorge Israel Peña
What about sentences that end With a `!` or `?`.
Bart Kiers
That's what I said in my post, heh.
Jorge Israel Peña
A: 
/\A(.+?)[.?!] /s

matches everything until one of those punctuation marks followed by the space. that's what sentence is, isn't? dot should match new lines

SilentGhost
So much for Mr. Regex and his sentence matching.
Chris Lutz
I beg your pardon?
SilentGhost
Test your regex against that - it will parse as two sentences because of Mr.
Chris Lutz
What about e.g. *e.g.*?
Gumbo
It will work for majority of cases.
SilentGhost
A: 

This works in .NET:

/(?<=^\s*)(?!\s)("(\<'.*?'\>|.)*"|.)*?((?<='*"*)|[.?!]+|$)(?=\ \ |\n\n|$)/s

Handles quotation marks (American-style) (and quotes "like this 'and this.' Yes, with punctuation.") and sentences ending with multiple punctuations. Also ignores preceding whitespace. Requires two spaces or two end-of-lines or and end-of-file after sentences, though.

Handles the following well:

So much for Mr. Regex and his sentence matching, as he says "this sentence, isn't it wonderful? One says, 'It's almost as if this was crafted purely for example.'" This part shouldn't match, though.

strager
So much for Mr. Regex and his sentence matching.
Chris Lutz
@Lutz, So much for him.
strager
did you test it? I don't think PHP supports variable-length look-behind.
SilentGhost
@SilentGhost, Oh, I wasn't testing on PHP. *doh*
strager
+3  A: 

It isn't just a regex, but I wrote a Python function to do this: Separating sentences. Natural language processing is notoriously difficult, so there are cases this doesn't treat right, but it does handle some tricky cases well.

Ned Batchelder
This is definitely the right approach - defining English grammar rules, rather than trying to build a regex that can only be convoluted and inaccurate.
Peter Boughton
+2  A: 

What you need, in the end, is natural language parsing, which is extremely difficult to do, and probably impossible for regular expressions (even super-souped up PCRE ones) alone. Consider this sentence:

So much for Mr. Regex and his sentence matching.

Every answer given thus far will parse that as two sentences, and this isn't even that much of an edge case - it's quite reasonable to imagine a block of text beginning with "Dear Mr. Adams:" or something like that. You can tack on lookbehinds to check what the word before the punctuation mark was, but that's going to get unmaintainable, since you have to check for every possible abbreviation. You have to check for Mr. and e.g. and co. and St. and for so many other ones that you'll never think of. You might end up with a "pretty good" practical solution after a while, but it's going to be ugly, and one day it will fail.

Chris Lutz
My solution seems to work and isn't pretty ugly. It assumes two spaces after each sentence, though. It also doesn't handle quotes.
strager
The two spaces after each sentence is nice, if people follow it (I for one hate it, and never do that, so maybe I'm just biased). But your regex is the exact point at which I would stop and say "This isn't a job for regular expressions."
Chris Lutz
I have to agree regular expressions aren't the right tool for the job. But it's good enough for quick'n'dirty, and if this has to be done only once but a thousand times, regexp with human correction is IMO more efficient than a full language parser (unless there's a parser out there already which is used).
strager
We can write a "good-enough" regex for a subset of data, but then we'd need some sample data to look at.
Chris Lutz
Thanks for the final confirmation Chris - I spent some time last night with a sample paragraph, unknowingly proving to myself that this isn't possible, after leaping on it (there's always a mad rush to answer regex questions :-) before going to sleep (d'oh). At least my ReGex skills were put to the test in private...
Dave Everitt