tags:

views:

123

answers:

5

I know this regex divides a text into sentences. Can someone help me understand how?

/(?<!\..)([\?\!\.])\s(?!.\.)/
+11  A: 

You can use YAPE::Regex::Explain to decipher Perl regular expressions:

use strict;
use warnings;
use YAPE::Regex::Explain;

my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();

__END__

The regular expression:

(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )                        end of look-behind
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [\?\!\.]                 any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
toolic
+1 for having code explain the code! :-)
Sean Vieira
+1 for introducing a cool tool!
Sebastián Grignoli
+2  A: 
(?         # Find a group (don't capture)
<          # before the following regular expression
!          # that does not match
\.         # a literal "."
.          # followed by 1 character
)          # (End look-behind group)
(          # Start a group (capture it to $1)
[\?\!\.]   # Containing any one of the characters in the following set "?!."
)          # End group $1
\s         # followed by a whitespace character " ", \t, etc.
(?         # Followed by a group (don't capture)
           # after the preceding regular expression
!          # that does not have
.          # 1 character
\.         # followed by a literal "."
)          # (End look-ahead group)
Sean Vieira
+2  A: 

The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.

The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.

The next part is a single (??) white-space character: \s

And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.

This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.

I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:

/(?<![?!.].)([?!.])\s(?!.[?!.])/
Axeman
I suppose the two-back requirement stops it from splitting after "G.E.D.", although it does nothing at all to protect against "Dr." or "Mr." The after-space bit is more mysterious and possibly a bit cargo-culty.
hobbs
@hobbs: Well I was thinking about abbreviations, but I also thought that it doesn't guard against spaced out abbreviations that you sometimes see like "U. S. A."
Axeman
A: 

Portions:

  • ([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
  • (?<!\..) where the characters before this 'ending character' arent a .+anything
  • (?!.\.) after the whitespace character any character directly followed by any . isn't allowed.

Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

Wrikken
+5  A: 

There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.

tanascius