tags:

views:

96

answers:

3

I have a text where I need to:

  1. to extract the whole paragraph under the section "Aceview summary" until the line that starts with "Please quote" (not to be included).
  2. to extract the line that starts with "The closest human gene".
  3. to store them into array with two elements.

The text looks like this (also on pastebin):

  AceView: gene:1700049G17Rik, a comprehensive annotation of human, mouse and worm genes with mRNAs or ESTsAceView.

  <META NAME="title"
 CONTENT="
AceView: gene:1700049G17Rik a comprehensive annotation of human, mouse and worm genes with mRNAs or EST">

<META NAME="keywords"
 CONTENT="
AceView, genes, Acembly, AceDB, Homo sapiens, Human,
 nematode, Worm, Caenorhabditis elegans , WormGenes, WormBase, mouse,
 mammal, Arabidopsis, gene, alternative splicing variant, structure,
 sequence, DNA, EST, mRNA, cDNA clone, transcript, transcription, genome,
 transcriptome, proteome, peptide, GenBank accession, dbest, RefSeq,
 LocusLink, non-coding, coding, exon, intron, boundary, exon-intron
 junction, donor, acceptor, 3'UTR, 5'UTR, uORF, poly A, poly-A site,
 molecular function, protein annotation, isoform, gene family, Pfam,
 motif ,Blast, Psort, GO, taxonomy, homolog, cellular compartment,
 disease, illness, phenotype, RNA interference, RNAi, knock out mutant
 expression, regulation, protein interaction, genetic, map, antisense,
 trans-splicing, operon, chromosome, domain, selenocysteine, Start, Met,
 Stop, U12, RNA editing, bibliography">
<META NAME="Description" 
 CONTENT= "
AceView offers a comprehensive annotation of human, mouse and nematode genes
 reconstructed by co-alignment and clustering of all publicly available
 mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable
 up-to-date resource on the genes, their functions, alternative variants,
 expression, regulation and interactions, in the hope to stimulate
 further validating experiments at the bench
">


<meta name="author"
 content="Danielle Thierry-Mieg and Jean Thierry-Mieg,
 NCBI/NLM/NIH, [email protected]">




   <!--
    var myurl="av.cgi?db=mouse" ;
    var db="mouse" ;
    var doSwf="s" ;
    var classe="gene" ;
  //-->

However I am stuck with the following script logic. What's the right way to achieve that?

   #!/usr/bin/perl -w

   my  $INFILE_file_name = $file;      # input file name

    open ( INFILE, '<', $INFILE_file_name )
        or croak "$0 : failed to open input file $INFILE_file_name : $!\n";


    my @allsum;

    while ( <INFILE> ) {
        chomp;

        my $line = $_;

        my @temp1 = ();
        if ( $line =~ /^ AceView summary/ ) {
            print "$line\n";
            push @temp1, $line;
        }
        elsif( $line =~ /Please quote/) {
            push @allsum, [@temp1];
             @temp1 = ();
        }
        elsif ($line =~ /The closest human gene/) {

            push @allsum, $line;
        }

    }

    close ( INFILE );           # close input file
    # Do something with @allsum

There are many files like that I need to process.

+4  A: 

You can use the range operator in scalar context to extract the whole paragraph:

while (<INFILE>) {
    chomp;
    if (/AceView summary/ .. /Please quote/) {
        print "$_\n";
    }

    print "$_\n" if /^The closest human gene/;
}
eugene y
+1  A: 

OTTOMH I'd do the extraction part of this with a simple state machine. Start with state=0, set it to one when /AceView summary/ and to zero on /Please quote/. Then, push $_ to your output array if $state==1.

But I like Eugene's answer better. This is Perl, there are many ways to skin your proverbial cat...

crazyscot
+4  A: 

If I understand correctly, you get that information from http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=mouse&amp;c=gene&amp;a=fiche&amp;l=1700049G17Rik which returns one of the most horrible hodge-podge of HTML I have seen (maybe tied for first with the crap Medicare plan finder spews).

However, it is still no match for HTML::TokeParser::Simple:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('ace.html');
my ($summary, $closest_human);

while ( my $tag = $parser->get_tag('span') ) {
    next unless $tag->get_attr('class') eq 'hh3';
    next unless $parser->get_text('/span') eq 'AceView summary';
    $summary = $parser->get_text('span');
    $summary =~ s/^\s+//;
    $summary =~ s/\s*Please quote:.*\z//;
    last;
}

while ( my $tag = $parser->get_tag('b') ) {
    $closest_human = $parser->get_text('/b');
    next unless $closest_human eq 'The closest human genes';
    $closest_human .= $parser->get_text('br');
    last;
}

print "=== Summary ===\n\n$summary\n\n";
print "=== Closest Human Gene ==\n\n$closest_human\n"

Output (snipped):

=== Summary ===

Note that this locus is complex: it appears to produce several proteins with no
sequence overlap.
Expression: According to AceView, this gene is well expressed, 
... 
Please see the Jackson Laboratory Mouse Genome Database/Informatics site MGI_192
0680 for in depth functional annotation of this gene.

=== Closest Human Gene ==

The closest human genes, according to BlastP, are the AceView genes ZNF780AandZN
F780B (e=10^-15,), ZNF766 (e=2 10^-15,), ZNF607andZNF781andZFP30 (e=2 10^-14,).
Sinan Ünür