views:

125

answers:

4

I have to cut a unicode string which is actually an article (contains sentences) I want to cut this article string after Xth sentence in python.

A good indicator of a sentence ending is that it ends with full stop (".") and the word after start with capital name. Such as

myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third."

How can this be achieved ?

Thanks

A: 

Try this:

'.'.join(re.split('\.(?=\s*[A-Z])', myarticle)[:2]) + '.'

It cuts your string after the second sentence ([:2]).

However there are some problems (as always if you deal with natural language): Most notably it will only recognize a sentence that starts with 'A-Z'. That might be true for English but not for other languages.

Felix Schwarz
+1 but just because I can't vote +2 :) . regular expressions are so powerful.
luc
This fails with "1st sentence. 2nd sentence."…
EOL
@EOL : Is it just pb with the final '.' ? I don't think that it worths a -1. I think it is still a nice one line solution even if other longer solution might be better
luc
re.split('\.(?=\s+[A-Z])', myarticle) will not break U.S.A. but will require a space after full stop.
luc
@luc: Yes, there is a double '.' when there are N or fewer sentences in the input text. This is perfectly avoidable with a relatively short one-liner (as in my answer, for instance). Regular expressions *are* indeed powerful, so it's best to use their power. :)
EOL
+12  A: 

Consider downloading the Natural Language Toolkit (NLTK). Then you can create sentences that will not break for things like "U.S.A." or fail to split sentences that end in "?!".

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third."
>>> sentences = nltk.sent_tokenize(paragraph)
[u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]

Your code becomes much more readable. To access the second sentence, you use notation you're used to.

>>> sentences[1]
u"And this is my second."
Tim McNamara
+1, for the added bonus of not breaking "U.S.A.".
EOL
seems great but I am not able to import tokenize_sents from nltk. What library do I need?regards.
Hellnar
@Hellnar sorry about this, from memory I had the incorrect function. try again with `nltk.sent_tokenize()`
Tim McNamara
thanks Tim, now works great!
Hellnar
+1  A: 

If there can be other punctuation marks than the usual '.', you should probably try this:

re.split('\W(?=[A-Z])',ss)

This returns the list of the sentences. Of course, it does not treat correctly the cases mentioned by Paul.

xmoleslo
This does not work with something like "The WWF breaks this."
EOL
+2  A: 

Here is a more robust solution:

myarticle = """This is a sentence.
   And another one.
   And a 3rd one."""

N = 3  # 3 sentences

print ''.join(sentence+'.' for sentence in re.split('\.(?=\s*(?:[A-Z]|$))', myarticle, maxsplit=N)[:-1])

This solution has a few advantages over some of the other possibilities mentioned before:

  1. It works even when there are exactly N sentences in your text. Some other answers yield a double . at the end. This is avoided here by taking into account the fact that the last sentence is not followed by an uppercase letter, but by an end-of-text ($).

  2. This works even when there are fewer than N sentences in the text.

  3. The number of splits is limited by the maxsplit argument to re.split(), which limits the number of splittings and is therefore quite efficient.

Hope this helps!

EOL