ansaurus

Question

Answer 1

A:

I tried on Notepad++, and I got this :

.*$

And activate the multiline option :

re.MULTILINE

Cheers

Ars 2010-08-23 15:37:43

Answer 2

A:

Try the other way around: Split the text at sentence boundaries.

lines = re.split(r'\s*[!?.]\s*', text)

If that doesn't work, add a \ before the ..

Aaron Digulla 2010-08-23 15:38:31

Answer 3

A:

Something like this works:

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

Notice how name is. is not in the result because it does not start with a uppercase letter.

Your problem comes from the use of the ^$ anchors, they work on the whole text.

THC4k 2010-08-23 15:38:51

Thanks a lot. I adapted it to re.findall since I have to process the txt file. Is there a way to prevent '\n' character from coming up in the result ? I mean, in sentences that carry over to new line, that \n comes up between the words in different lines.

sarevok 2010-08-23 16:07:42

@sarevok: You can remove the \n before splitting it with `text.replace('\n', '')`.

THC4k 2010-08-23 17:20:14

@THC4k: Thanks once again :)

sarevok 2010-08-26 12:04:30

Answer 4

+1 A:

There are two issues in your regex:

Your expression is anchored by ^ and $, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.
You are searching for \s+ before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.

Daniel Vandersluis 2010-08-23 15:39:08

Upvoted for actually explaining both things that were problems, and not just handing out a fixed regex.

cincodenada 2010-08-23 15:58:44

Answer 5

A:

Edited: now it will work with multiline sentences too.

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

Only one thing left to explain - re.DOTALL makes . match newline as described here

cji 2010-08-23 15:39:39

Answer 6

A:

You can try:

p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()

The regex used here is [^.!?]+[.!?] which tries to match one or more non-sentence delimiter followed by a sentence delimiter.

codaddict 2010-08-23 15:40:43

ansaurus

tags:

views:

answers:

Regex to find all sentences of text ?

related questions