ansaurus

Question

Converting regex statement for sentence extraction to Ruby

Answer 1

+1 A:

This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):

result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)

The difference is that your code passes a literal string to split(), while this code passes a literal regex.

It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.

I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).

When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Jan Goyvaerts 2010-05-02 02:58:23

Hi, thanks for the Oniguruma lead. I am trying to use the gem so I do not have to re-compile my ruby 1.8: http://oniguruma.rubyforge.org/. This seems to be working but I get nil if I do: reg = Oniguruma::ORegexp.new( '((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])' ) and then reg.scan(text). Should this way work?

DavidP6 2010-05-04 02:01:29

ansaurus

tags:

views:

answers:

Converting regex statement for sentence extraction to Ruby

related questions