I'm after a regex ( php / perl compatible ) to get the first sentence out of some text. I realize this could get huge if covering every case, but just after something that will be "good enough" at the moment. Anyone got something off the shelf for this?
If sentence is "line" then simply match the first ^.*
from a chunk of text. By default the DOT does not match new line characters.
If it's really the first sentence, do something like this: ^[^.!?]*
I know you just want anything that works for now, but this mailing list post came up with /^[^\.]*\.\s/
, and the subsequent post came up with ([\s\S]+?)\.( |\r|\n)
.
Though these patterns seem only match for periods, it's up to you if you want to modify it to also match for other types of punctuation such as exclamation marks and questions marks.
/\A(.+?)[.?!] /s
matches everything until one of those punctuation marks followed by the space. that's what sentence is, isn't? dot should match new lines
This works in .NET:
/(?<=^\s*)(?!\s)("(\<'.*?'\>|.)*"|.)*?((?<='*"*)|[.?!]+|$)(?=\ \ |\n\n|$)/s
Handles quotation marks (American-style) (and quotes "like this 'and this.' Yes, with punctuation.") and sentences ending with multiple punctuations. Also ignores preceding whitespace. Requires two spaces or two end-of-lines or and end-of-file after sentences, though.
Handles the following well:
So much for Mr. Regex and his sentence matching, as he says "this sentence, isn't it wonderful? One says, 'It's almost as if this was crafted purely for example.'" This part shouldn't match, though.
It isn't just a regex, but I wrote a Python function to do this: Separating sentences. Natural language processing is notoriously difficult, so there are cases this doesn't treat right, but it does handle some tricky cases well.
What you need, in the end, is natural language parsing, which is extremely difficult to do, and probably impossible for regular expressions (even super-souped up PCRE ones) alone. Consider this sentence:
So much for Mr. Regex and his sentence matching.
Every answer given thus far will parse that as two sentences, and this isn't even that much of an edge case - it's quite reasonable to imagine a block of text beginning with "Dear Mr. Adams:" or something like that. You can tack on lookbehinds to check what the word before the punctuation mark was, but that's going to get unmaintainable, since you have to check for every possible abbreviation. You have to check for Mr. and e.g. and co. and St. and for so many other ones that you'll never think of. You might end up with a "pretty good" practical solution after a while, but it's going to be ugly, and one day it will fail.