This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):
result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)
The difference is that your code passes a literal string to split()
, while this code passes a literal regex.
It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.
I also modified the regular expression. I replaced (\s|\r\n)
with \s+
. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).
When working with Unicode text, a further improvement would be to replace a-z
with \p{Ll}\p{Lo}
, A-Z
with \p{Lu}\p{Lt}\p{Lo}
, and 0-9
with \p{N}
in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.