views:

130

answers:

3

In the example below the following regex (".*?") was used to remove all dialogue first. The next step is to remove all remaining sentences starting with a lower case letter. Only sentences starting with an upper case letter should remain.

Example:

exclaimed Wade. Indeed, below them were villages, of crude huts made of timber and stone and mud. Rubble work walls, for they needed little shelter here, and the people were but savages.

asked Arcot, his voice a bit unsteady with suppressed excitement.

replied Morey without turning from his station at the window. Below them now, less than half a mile down on the patchwork of the Nile valley, men were standing, staring up, collecting in little groups, gesticulating toward the strange thing that had materialized in the air above them.

In the example above the following should be deleted only:

exclaimed Wade.
asked Arcot, his voice a bit unsteady with suppressed excitement.
replied Morey without turning from his station at the window.

A useful regex or simple Perl or python code is appreciated. I'm using version 7 of Textpipe.

Thanks.

+3  A: 

This should work for the example you posted:

text = re.sub(r'(^|(?<=[.!?])\s+)[a-z].*?[.!?](?=\s|$)', r'\1', text)
Max Shawabkeh
This might have trouble with a sentence that contains, say, "foo.com".
dreeves
Detecting the end of sentences reliably is not easy. We do what we can, and often it's good enough (e.g. it works on the example in the question). However, I've added a requirement of space/end-of-line after the period/question mark/exclamation mark.
Max Shawabkeh
A: 

This works for me in Perl on your example:

$s = "exclaimed Wade. Indeed, ...";

do {
  $prev = $s;
  $s =~ s/(^\s*|[.!?]\s+)[a-z][^.!?]*[.!?]\s*/$1/gs;
} until ($s eq $prev);

Without the do loop it had trouble with removing multiple consecutive sentences.

Note that doing this perfectly is pretty much AI-complete. See this question for examples of the kind of sentences that you'll never get right: http://stackoverflow.com/questions/2024338/latex-sometimes.

Of course you could use LaTeX's heuristic for what's a sentence-ending period and get it right most of the time.

dreeves
Ironically, this won't work with sentences containing periods.
Max Shawabkeh
I don't understand. I did test it with sentences that contain strings like "abc.def" and it works. Arguably my code does the wrong thing when a sentence contains "abc.Def". (But maybe it depends on the corpus whether that's more likely to a be an intra-word or sentence-ending period.) Or do you mean sentences with periods like "I love E.B. White."? That's a tough one.
dreeves
A: 

Why not use a module like Lingua::EN::Sentence? It makes it very easy to get pretty good sentences from arbitrary English text.

#!perl

use strict;
use warnings;

use Lingua::EN::Sentence qw( get_sentences );

my $text = <<END;

exclaimed Wade. Indeed, below them were villages, of crude huts made of timber and stone and mud. Rubble work walls, for they needed little shelter here, and the people were but savages.

asked Arcot, his voice a bit unsteady with suppressed excitement.

replied Morey without turning from his station at the window. Below them now, less than half a mile down on the patchwork of the Nile valley, men were standing, staring up, collecting in little groups, gesticulating toward the strange thing that had materialized in the air above them.
END


my $sentences = matching_sentences( qr/^[^a-z]/, $text );

print map "$_\n", @$sentences;

sub matching_sentences {
    my $re   = shift;
    my $text = shift;

    my $s = get_sentences( $text );

    @$s = grep /$re/, @$s;

    return $s;
}

Results:

Indeed, below them were villages, of crude huts made of timber and stone and mud.
Rubble work walls, for they needed little shelter here, and the people were but savages.
Below them now, less than half a mile down on the patchwork of the Nile valley, men were standing, staring up, collecting in little groups, gesticulating toward the strange thing that had materialized in the air above them.
daotoad