ansaurus

Question

How to remove lowercase sentence fragments from text?

Answer 1

+3 A:

Here's a Python snippet that should do:

 thetext="""triple quoted paste of your sample text"""
 y=thetext.split('\n')
 for line in y:
    m=re.findall('(".*?")',line)
    if m:
        print ' '.join(m)
    else:
        print line

Vicki Laidler 2010-03-13 21:07:30

Answer 2

A:

The Text::Balanced module is what you seem to be after if you're looking to use Perl. The following should be able to extract all the quoted speech in your example (not pretty, but gets the job done).

It also works for Dennis' test cases.

The advantage of the code below is that the quotes are grouped by paragraph, which may or may not be useful for later analysis

Script

use strict;
use warnings;
use Text::Balanced qw/extract_quotelike extract_multiple/;

my %quotedSpeech;

{
    local $/ = '';
    while (my $text = <DATA>) { # one paragraph at a time

        while (my $speech = extract_multiple(
                            $text,
                            [sub{extract_quotelike($_[0])},],
                            undef,
                            1))
        {   push @{$quotedSpeech{$.}}, $speech; }
    }
}

# Print total number of paragraphs in DATA filehandle

print "Total paragraphs: ", (sort {$a <=> $b} keys %quotedSpeech)[-1];

# Print quotes grouped by paragraph:

foreach my $paraNumber (sort {$a <=> $b} keys %quotedSpeech) {
    print "\n\nPara ",$paraNumber;
    foreach my $speech (@{$quotedSpeech{$paraNumber}}) {
        print "\t",$speech,"\n";
    }
}
# How many quotes in paragraph 8?
print "Number of quotes in Paragraph 8: ", scalar @{$quotedSpeech{8}};

__DATA__

"Ah, that's perfectly true!" exclaimed Alyosha.

"Oh, do leave off playing the fool! Some idiot comes in, and you put us to shame!" cried the girl by the window, suddenly turning to her father with a disdainful and contemptuous air.

"Wait a little, Varvara!" cried her father, speaking peremptorily but looking at them quite approvingly. "That's her character," he said, addressing Alyosha again.

"Where have you been?" he asked him.

"I think," he said, "I've forgotten something... my handkerchief, I think.... Well, even if I've not forgotten anything, let me stay a little."

He sat down. Father stood over him.

"You sit down, too," said he.

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.

Output

Total paragraphs: 10

Para 1  "Ah, that's perfectly true!"


Para 2  "Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"


Para 3  "Wait a little, Varvara!"
        "That's her character,"


Para 4  "Where have you been?"


Para 5  "I think,"
        "I've forgotten something... my handkerchief, I think.... Well, even if
I've not forgotten anything, let me stay a little."


Para 7  "You sit down, too,"


Para 8  "It doesn't always work."


Para 9  "Secondly,"
        "it fails for three quoted phrases..."
        "with two unquoted ones."


Para 10 "That's right."

Zaid 2010-03-13 21:09:25

`perl script.pl textfile` produces no output.

Dennis Williamson 2010-03-14 00:48:37

@Dennis: That's because you need to run the script as `perl script.pl "text"` with the way it's written right now.

Zaid 2010-03-14 04:27:59

Several of the OP's examples do not work then.

Dennis Williamson 2010-03-14 04:43:13

@Dennis: See the updated code, which works for your failed cases as well.

Zaid 2010-03-14 07:11:05

Your new version is better, but it prints newlines between parts of a multiple-phrase input (the ones that start with "Wait" and "Secondly", for example).

Dennis Williamson 2010-03-14 21:23:32

@Dennis: This is a non-issue; it's just formatting. I will post when I can.

Zaid 2010-03-14 21:35:15

@Dennis (et al): Posted

Zaid 2010-03-15 11:38:56

Answer 3

A:

I am not entirely sure which editor are you using, if you are using something editor that supports atomic grouping (e.g. EditorPad Pro) You can use the regular expression below to do the search and replace:

Search for

(".+?"|^[A-Z].+\r\n)(.(?!"))* 
Note: you should replace \r\n with \n or \r according to your line breaks

Replace with

\1

Here is a bit explanation for the regular expression:

The first capturing group is for characters between quotes and lines starting with Capital Letters. The second capturing group is for any characters that is after a quote but before another quote.

Peter 2010-03-13 21:49:23

I don't see any atomic groups in that regex--just two capturing groups and a negative lookahead.

Alan Moore 2010-03-15 13:51:20

Answer 4

A:

This works for all cases shown in the question:

sed -n '/"/!{p;b}; s/\(.*\)"[^"]*/\1" /;s/\(.*"\)\([^"]*\)\(".*"\)/\1 \3/;p' textfile

It fails for cases such as these:

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.

Dennis Williamson 2010-03-14 01:37:05

Answer 5

A:

If I understand what you are after... passing each line through a regex like this should work...

You can use the perl debugger to play around with this. Hop into the perl debugger with just a perl -de 42 on the command line in linux/mac. (The "42" is just a valid expression - it could be anything, but why not choose the meaning of life?)

anyways

open FILE, "<", "filename.txt" or die $!;
while (my $line = <FILE>) {
   @fixed_text = $line =~ m{(?:(" .+? ")) | (?:\A .* [^"] .* \z)}xmsg;
  for my $new_line (@fixed_text) {
    print qq($new_line );
  }
  print qq(\n);
}

NOTE: Sorry I had to edit it - didn't see you wanted lines without any quotes at all...

Yes, Regex and Perl is amazing. It should be 100% accurate and get all of your instances, acept in the case where a quote extends across paragraphs

davehamptonusa 2010-03-14 02:24:41

Answer 6

A:

Hello:

I forgot to mention that I'm running Perl and Python on a pc. I'm using Textpipe for regex.

I get the following error message when running the Perl script. The example file "text" is in the same BIN directory. I've had to make some changes to linux perl scripts before. Should I be using different syntax? Thanks to everyone for their contributions. I'll try the python script next

C:\PERL\BIN>perl script.pl "text" Name "main::DATA" used only once: possible typo at script.pl line 10. readline() on unopened filehandle DATA at script.pl line 10.

cheers,

Aaron

Aaron 2010-03-14 19:45:11

This should have been posted as a comment to your original question or edited into it. Also, you appear to have two accounts on StackOverflow. The Perl script is written to read the data from within the same file that the script is in from a section with a `__DATA__` header. It would have to be modified to read from a filename given on the command line.

Dennis Williamson 2010-03-14 21:28:04

ansaurus

tags:

views:

answers:

How to remove lowercase sentence fragments from text?

related questions