ansaurus

Question

How can I extract paragaphs and selected lines with Perl?

Answer 1

+4 A:

You can use the range operator in scalar context to extract the whole paragraph:

while (<INFILE>) {
    chomp;
    if (/AceView summary/ .. /Please quote/) {
        print "$_\n";
    }

    print "$_\n" if /^The closest human gene/;
}

eugene y 2010-04-14 10:50:51

Answer 2

+1 A:

OTTOMH I'd do the extraction part of this with a simple state machine. Start with state=0, set it to one when /AceView summary/ and to zero on /Please quote/. Then, push $_ to your output array if $state==1.

But I like Eugene's answer better. This is Perl, there are many ways to skin your proverbial cat...

crazyscot 2010-04-14 11:06:40

Answer 3

+4 A:

If I understand correctly, you get that information from http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=mouse&c=gene&a=fiche&l=1700049G17Rik which returns one of the most horrible hodge-podge of HTML I have seen (maybe tied for first with the crap Medicare plan finder spews).

However, it is still no match for HTML::TokeParser::Simple:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('ace.html');
my ($summary, $closest_human);

while ( my $tag = $parser->get_tag('span') ) {
    next unless $tag->get_attr('class') eq 'hh3';
    next unless $parser->get_text('/span') eq 'AceView summary';
    $summary = $parser->get_text('span');
    $summary =~ s/^\s+//;
    $summary =~ s/\s*Please quote:.*\z//;
    last;
}

while ( my $tag = $parser->get_tag('b') ) {
    $closest_human = $parser->get_text('/b');
    next unless $closest_human eq 'The closest human genes';
    $closest_human .= $parser->get_text('br');
    last;
}

print "=== Summary ===\n\n$summary\n\n";
print "=== Closest Human Gene ==\n\n$closest_human\n"

Output (snipped):

=== Summary ===

Note that this locus is complex: it appears to produce several proteins with no
sequence overlap.
Expression: According to AceView, this gene is well expressed, 
... 
Please see the Jackson Laboratory Mouse Genome Database/Informatics site MGI_192
0680 for in depth functional annotation of this gene.

=== Closest Human Gene ==

The closest human genes, according to BlastP, are the AceView genes ZNF780AandZN
F780B (e=10^-15,), ZNF766 (e=2 10^-15,), ZNF607andZNF781andZFP30 (e=2 10^-14,).

Sinan Ünür 2010-04-14 15:02:05

ansaurus

tags:

views:

answers:

How can I extract paragaphs and selected lines with Perl?

related questions