tags:

views:

157

answers:

1

I found this regex statement at http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation for Sentence boundary disambiguation, but am not able to use it in a Ruby split statment. I'm not too good with regex so maybe I am missing something? This is statment:

((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])

and this is what I tried in Ruby, but no go:

text.split("((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])")
+1  A: 

This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):

result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)

The difference is that your code passes a literal string to split(), while this code passes a literal regex.

It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.

I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).

When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Jan Goyvaerts
Hi, thanks for the Oniguruma lead. I am trying to use the gem so I do not have to re-compile my ruby 1.8: http://oniguruma.rubyforge.org/. This seems to be working but I get nil if I do: reg = Oniguruma::ORegexp.new( '((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])' ) and then reg.scan(text). Should this way work?
DavidP6