ansaurus

Question

First Sentence Regex

Answer 1

+4 A:

well, /^[^.]+/ is the simplest one

stereofrog 2009-10-14 21:26:09

Yeah I had that, well I had strpos($content,'.'), but e.g. or [email protected] not so good, not to mention questions ?

Tim 2009-10-14 21:29:34

good points, how about '/^.+?[.?!]+(?=\s|$)/'

stereofrog 2009-10-14 21:33:27

I ended up going with /^.{150,}?[.?!]+(?=\s|$)/ as I wanted to ensure really short sentences were missed. Thanks also to Chris Lutz who did a good job of explaining the complexities of the problem.

Tim 2009-10-15 06:43:00

P.S. And if no match I'm doing a preg_replace('/\s+?(\S+)?$/', '', substr($content, 0, $max_length = 200)).'...'; to break on the fist space after 200 chars.

Tim 2009-10-15 06:45:07

Answer 2

+1 A:

If sentence is "line" then simply match the first ^.* from a chunk of text. By default the DOT does not match new line characters.

If it's really the first sentence, do something like this: ^[^.!?]*

Bart Kiers 2009-10-14 21:27:08

First sentence, not first line...

gnud 2009-10-14 21:27:34

I was editing... :)

Bart Kiers 2009-10-14 21:29:03

Answer 3

A:

I know you just want anything that works for now, but this mailing list post came up with /^[^\.]*\.\s/, and the subsequent post came up with ([\s\S]+?)\.( |\r|\n).

Though these patterns seem only match for periods, it's up to you if you want to modify it to also match for other types of punctuation such as exclamation marks and questions marks.

Jorge Israel Peña 2009-10-14 21:30:18

What about sentences that end With a `!` or `?`.

Bart Kiers 2009-10-14 21:32:11

That's what I said in my post, heh.

Jorge Israel Peña 2009-10-14 21:56:02

Answer 4

A:

/\A(.+?)[.?!] /s

matches everything until one of those punctuation marks followed by the space. that's what sentence is, isn't? dot should match new lines

SilentGhost 2009-10-14 21:34:01

So much for Mr. Regex and his sentence matching.

Chris Lutz 2009-10-14 21:38:43

I beg your pardon?

SilentGhost 2009-10-14 21:41:18

Test your regex against that - it will parse as two sentences because of Mr.

Chris Lutz 2009-10-14 21:43:10

What about e.g. *e.g.*?

Gumbo 2009-10-14 21:43:31

It will work for majority of cases.

SilentGhost 2009-10-14 21:46:02

Answer 5

A:

This works in .NET:

/(?<=^\s*)(?!\s)("(\<'.*?'\>|.)*"|.)*?((?<='*"*)|[.?!]+|$)(?=\ \ |\n\n|$)/s

Handles quotation marks (American-style) (and quotes "like this 'and this.' Yes, with punctuation.") and sentences ending with multiple punctuations. Also ignores preceding whitespace. Requires two spaces or two end-of-lines or and end-of-file after sentences, though.

Handles the following well:

So much for Mr. Regex and his sentence matching, as he says "this sentence, isn't it wonderful? One says, 'It's almost as if this was crafted purely for example.'" This part shouldn't match, though.

strager 2009-10-14 21:38:22

So much for Mr. Regex and his sentence matching.

Chris Lutz 2009-10-14 21:40:01

@Lutz, So much for him.

strager 2009-10-14 21:51:26

did you test it? I don't think PHP supports variable-length look-behind.

SilentGhost 2009-10-14 21:56:48

@SilentGhost, Oh, I wasn't testing on PHP. *doh*

strager 2009-10-14 22:02:20

Answer 6

+3 A:

It isn't just a regex, but I wrote a Python function to do this: Separating sentences. Natural language processing is notoriously difficult, so there are cases this doesn't treat right, but it does handle some tricky cases well.

Ned Batchelder 2009-10-14 21:45:37

This is definitely the right approach - defining English grammar rules, rather than trying to build a regex that can only be convoluted and inaccurate.

Peter Boughton 2009-10-14 21:56:23

Answer 7

+2 A:

What you need, in the end, is natural language parsing, which is extremely difficult to do, and probably impossible for regular expressions (even super-souped up PCRE ones) alone. Consider this sentence:

So much for Mr. Regex and his sentence matching.

Every answer given thus far will parse that as two sentences, and this isn't even that much of an edge case - it's quite reasonable to imagine a block of text beginning with "Dear Mr. Adams:" or something like that. You can tack on lookbehinds to check what the word before the punctuation mark was, but that's going to get unmaintainable, since you have to check for every possible abbreviation. You have to check for Mr. and e.g. and co. and St. and for so many other ones that you'll never think of. You might end up with a "pretty good" practical solution after a while, but it's going to be ugly, and one day it will fail.

Chris Lutz 2009-10-14 21:50:53

My solution seems to work and isn't pretty ugly. It assumes two spaces after each sentence, though. It also doesn't handle quotes.

strager 2009-10-14 21:52:34

The two spaces after each sentence is nice, if people follow it (I for one hate it, and never do that, so maybe I'm just biased). But your regex is the exact point at which I would stop and say "This isn't a job for regular expressions."

Chris Lutz 2009-10-14 21:56:54

I have to agree regular expressions aren't the right tool for the job. But it's good enough for quick'n'dirty, and if this has to be done only once but a thousand times, regexp with human correction is IMO more efficient than a full language parser (unless there's a parser out there already which is used).

strager 2009-10-14 22:06:59

We can write a "good-enough" regex for a subset of data, but then we'd need some sample data to look at.

Chris Lutz 2009-10-14 22:26:25

Thanks for the final confirmation Chris - I spent some time last night with a sample paragraph, unknowingly proving to myself that this isn't possible, after leaping on it (there's always a mad rush to answer regex questions :-) before going to sleep (d'oh). At least my ReGex skills were put to the test in private...

Dave Everitt 2009-10-15 07:36:20

ansaurus

tags:

views:

answers:

First Sentence Regex

related questions