ansaurus

Question

Answer 1

A:

Try this:

'.'.join(re.split('\.(?=\s*[A-Z])', myarticle)[:2]) + '.'

It cuts your string after the second sentence ([:2]).

However there are some problems (as always if you deal with natural language): Most notably it will only recognize a sentence that starts with 'A-Z'. That might be true for English but not for other languages.

Felix Schwarz 2010-08-05 07:01:20

+1 but just because I can't vote +2 :) . regular expressions are so powerful.

luc 2010-08-05 07:20:58

This fails with "1st sentence. 2nd sentence."…

EOL 2010-08-05 08:02:03

@EOL : Is it just pb with the final '.' ? I don't think that it worths a -1. I think it is still a nice one line solution even if other longer solution might be better

luc 2010-08-05 08:53:35

re.split('\.(?=\s+[A-Z])', myarticle) will not break U.S.A. but will require a space after full stop.

luc 2010-08-05 08:57:00

@luc: Yes, there is a double '.' when there are N or fewer sentences in the input text. This is perfectly avoidable with a relatively short one-liner (as in my answer, for instance). Regular expressions *are* indeed powerful, so it's best to use their power. :)

EOL 2010-08-05 10:01:43

Answer 2

+12 A:

Consider downloading the Natural Language Toolkit (NLTK). Then you can create sentences that will not break for things like "U.S.A." or fail to split sentences that end in "?!".

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third."
>>> sentences = nltk.sent_tokenize(paragraph)
[u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]

Your code becomes much more readable. To access the second sentence, you use notation you're used to.

>>> sentences[1]
u"And this is my second."

Tim McNamara 2010-08-05 07:11:12

+1, for the added bonus of not breaking "U.S.A.".

EOL 2010-08-05 07:59:35

seems great but I am not able to import tokenize_sents from nltk. What library do I need?regards.

Hellnar 2010-08-05 08:30:09

@Hellnar sorry about this, from memory I had the incorrect function. try again with `nltk.sent_tokenize()`

Tim McNamara 2010-08-05 08:34:28

thanks Tim, now works great!

Hellnar 2010-08-05 08:40:20

Answer 3

+1 A:

If there can be other punctuation marks than the usual '.', you should probably try this:

re.split('\W(?=[A-Z])',ss)

This returns the list of the sentences. Of course, it does not treat correctly the cases mentioned by Paul.

xmoleslo 2010-08-05 07:38:32

This does not work with something like "The WWF breaks this."

EOL 2010-08-05 08:01:37

Answer 4

+2 A:

Here is a more robust solution:

myarticle = """This is a sentence.
   And another one.
   And a 3rd one."""

N = 3  # 3 sentences

print ''.join(sentence+'.' for sentence in re.split('\.(?=\s*(?:[A-Z]|$))', myarticle, maxsplit=N)[:-1])

This solution has a few advantages over some of the other possibilities mentioned before:

It works even when there are exactly N sentences in your text. Some other answers yield a double . at the end. This is avoided here by taking into account the fact that the last sentence is not followed by an uppercase letter, but by an end-of-text ($).
This works even when there are fewer than N sentences in the text.
The number of splits is limited by the maxsplit argument to re.split(), which limits the number of splittings and is therefore quite efficient.

Hope this helps!

EOL 2010-08-05 07:58:19

ansaurus

tags:

views:

answers:

Python cut a string after Xth sentence.

related questions