views:

263

answers:

6

Hello:

I'm tyring to remove lowercase sentence fragments from standard text files using regular expresions or a simple Perl oneliner.

These are commonly referred to as speech or attribution tags, for example - he said, she said, etc.

This example shows before and after using manual deletion:

  1. Original:

"Ah, that's perfectly true!" exclaimed Alyosha.

"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!" cried the girl by the window, suddenly turning to her father with a disdainful and contemptuous air.

"Wait a little, Varvara!" cried her father, speaking peremptorily but looking at them quite approvingly. "That's her character," he said, addressing Alyosha again.

"Where have you been?" he asked him.

"I think," he said, "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."

He sat down. Father stood over him.

"You sit down, too," said he.


  1. All lower case sentence fragments manually removed:

"Ah, that's perfectly true!"

"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!"

"Wait a little, Varvara!" "That's her character,"

"Where have you been?"

"I think," "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."

He sat down. Father stood over him.

"You sit down, too,"


I've changed straight quotes " to balanced and tried: ” (...)+[.]

Of course, this removes some fragments but deletes some text in balanced quotes and text starting with uppercase letters. [^A-Z] didn't work in the above expression.

I realize that it may be impossible to achieve 100% accuracy but any useful expression, perl, or python script would be deeply appreciated.

Cheers,

Aaron

+3  A: 

Here's a Python snippet that should do:

 thetext="""triple quoted paste of your sample text"""
 y=thetext.split('\n')
 for line in y:
    m=re.findall('(".*?")',line)
    if m:
        print ' '.join(m)
    else:
        print line
Vicki Laidler
A: 

The Text::Balanced module is what you seem to be after if you're looking to use Perl. The following should be able to extract all the quoted speech in your example (not pretty, but gets the job done).

It also works for Dennis' test cases.

The advantage of the code below is that the quotes are grouped by paragraph, which may or may not be useful for later analysis

Script

use strict;
use warnings;
use Text::Balanced qw/extract_quotelike extract_multiple/;

my %quotedSpeech;

{
    local $/ = '';
    while (my $text = <DATA>) { # one paragraph at a time

        while (my $speech = extract_multiple(
                            $text,
                            [sub{extract_quotelike($_[0])},],
                            undef,
                            1))
        {   push @{$quotedSpeech{$.}}, $speech; }
    }
}

# Print total number of paragraphs in DATA filehandle

print "Total paragraphs: ", (sort {$a <=> $b} keys %quotedSpeech)[-1];

# Print quotes grouped by paragraph:

foreach my $paraNumber (sort {$a <=> $b} keys %quotedSpeech) {
    print "\n\nPara ",$paraNumber;
    foreach my $speech (@{$quotedSpeech{$paraNumber}}) {
        print "\t",$speech,"\n";
    }
}
# How many quotes in paragraph 8?
print "Number of quotes in Paragraph 8: ", scalar @{$quotedSpeech{8}};

__DATA__

"Ah, that's perfectly true!" exclaimed Alyosha.

"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!" cried the girl by the window, suddenly turning to her father with a disdainful and contemptuous air.

"Wait a little, Varvara!" cried her father, speaking peremptorily but looking at them quite approvingly. "That's her character," he said, addressing Alyosha again.

"Where have you been?" he asked him.

"I think," he said, "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."

He sat down. Father stood over him.

"You sit down, too," said he.

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.

Output

Total paragraphs: 10

Para 1  "Ah, that's perfectly true!"


Para 2  "Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"


Para 3  "Wait a little, Varvara!"
        "That's her character,"


Para 4  "Where have you been?"


Para 5  "I think,"
        "I've forgotten something... my handkerchief, I think.... Well, even if
I've not forgotten anything, let me stay a little."


Para 7  "You sit down, too,"


Para 8  "It doesn't always work."


Para 9  "Secondly,"
        "it fails for three quoted phrases..."
        "with two unquoted ones."


Para 10 "That's right."
Zaid
`perl script.pl textfile` produces no output.
Dennis Williamson
@Dennis: That's because you need to run the script as `perl script.pl "text"` with the way it's written right now.
Zaid
Several of the OP's examples do not work then.
Dennis Williamson
@Dennis: See the updated code, which works for your failed cases as well.
Zaid
Your new version is better, but it prints newlines between parts of a multiple-phrase input (the ones that start with "Wait" and "Secondly", for example).
Dennis Williamson
@Dennis: This is a non-issue; it's just formatting. I will post when I can.
Zaid
@Dennis (et al): Posted
Zaid
A: 

I am not entirely sure which editor are you using, if you are using something editor that supports atomic grouping (e.g. EditorPad Pro) You can use the regular expression below to do the search and replace:

Search for

(".+?"|^[A-Z].+\r\n)(.(?!"))* 
Note: you should replace \r\n with \n or \r according to your line breaks

Replace with

\1

Here is a bit explanation for the regular expression:

The first capturing group is for characters between quotes and lines starting with Capital Letters. The second capturing group is for any characters that is after a quote but before another quote.

Peter
I don't see any atomic groups in that regex--just two capturing groups and a negative lookahead.
Alan Moore
A: 

This works for all cases shown in the question:

sed -n '/"/!{p;b}; s/\(.*\)"[^"]*/\1" /;s/\(.*"\)\([^"]*\)\(".*"\)/\1 \3/;p' textfile

It fails for cases such as these:

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.
Dennis Williamson
A: 

If I understand what you are after... passing each line through a regex like this should work...

You can use the perl debugger to play around with this. Hop into the perl debugger with just a perl -de 42 on the command line in linux/mac. (The "42" is just a valid expression - it could be anything, but why not choose the meaning of life?)

anyways

open FILE, "<", "filename.txt" or die $!;
while (my $line = <FILE>) {
   @fixed_text = $line =~ m{(?:(" .+? ")) | (?:\A .* [^"] .* \z)}xmsg;
  for my $new_line (@fixed_text) {
    print qq($new_line );
  }
  print qq(\n);
}

NOTE: Sorry I had to edit it - didn't see you wanted lines without any quotes at all...

Yes, Regex and Perl is amazing. It should be 100% accurate and get all of your instances, acept in the case where a quote extends across paragraphs

davehamptonusa
A: 

Hello:

I forgot to mention that I'm running Perl and Python on a pc. I'm using Textpipe for regex.

I get the following error message when running the Perl script. The example file "text" is in the same BIN directory. I've had to make some changes to linux perl scripts before. Should I be using different syntax? Thanks to everyone for their contributions. I'll try the python script next

C:\PERL\BIN>perl script.pl "text" Name "main::DATA" used only once: possible typo at script.pl line 10. readline() on unopened filehandle DATA at script.pl line 10.

cheers,

Aaron

Aaron
This should have been posted as a comment to your original question or edited into it. Also, you appear to have two accounts on StackOverflow. The Perl script is written to read the data from within the same file that the script is in from a section with a `__DATA__` header. It would have to be modified to read from a filename given on the command line.
Dennis Williamson