views:

830

answers:

3

I have two following Fasta file:

file1.fasta

>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC

file2.qual

>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5

Note the line break in "qual" file for each fasta header - marked with ">". Number of file header ('>') is the same for both files. Number of numerical qualities = sequence length.

What I want to do is to append this two files yielding:

GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC  40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5

But somehow my code below fail to do it correctly? Especially the second line of each entry in 'qual' file doesn't get printed.

use strict;
use Data::Dumper;        
use Carp;
use File::Basename;      

my $fastafile = $ARGV[0] || "reads/2039F.2.fasta"; 
my $base      = basename( $fastafile, ".fasta" );
my $qualfile  = "reads/" . $base . ".qual";
print "$qualfile\n";

open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality


while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);
     chomp($qual);

     if ($seq =~ /^>/ || $qual =~ /^>/) {
         next;
     }
     else {
         print "$seq\t$qual\n";      
     }

}

What's the correct way to do it?

+8  A: 

The problem is you are advancing through the file in parallel, so when the line is ">" in one file, it might not be ">" in the next.

The way you are reading the data is in pairs, like so:

1: >0 
2: >0
1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: >1
2: 40 40 40 40 40 40 40 40 15 40 40
1: GTTAAGTTATATCAAACTAAATATACATACTATAAA
2: >1
1: >2
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40
1: EOF
2: >2
1: EOF
2: 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
1: EOF
2: 40 8 3 29 10 19 18 40 19 15 5

The same set of data applied your looping rules would do this:

1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40

So you need to either separate the looping logic out or find a way to make the files match.

Here is an attempt at separating the seeking, but I haven't tested it.

fileIO: {
  while( 1 ){ 
   my $seq; 
   my $qual  = q{};
   while( 1 ){ 
     $seq = <SEQ>; 
     last fileIO if not $seq;  # stop at end of file
     last if $seq !~ /^>/; 
  }
  while( 1 ){ 
     my $qual_in = <PRB>;
     last fileIO if not $qual_in; # stop at end of file 
     last if $qual_in =~ /^>/ and $qual ne q{}; 
     next if $qual_in =~ /^>/ and $qual eq q{}; 
     $qual .= $qual_in;
  }
  print "$seq \n $qual \n";

 }
}

Update

I re-factored the above code into a single function that will read a chunk from an arbitrary file handle as needed, it seems to work as needed. Note of course I experimented here a little with a trick I've been meaning to use for something practical.

use strict;
use warnings;

# 
#  readUntilNext( $fileHandle, \$scalar_ref ); 
#
#  returns 0 when nothing could be read from the fileHandle. 
#  otherwise returns 1; 
#

sub readUntilNext {
    my ($fh)            = shift;
    my ($output)        = shift;
    my ($output_buffer) = '';
    while (1) {
        my $line = <$fh>;
        if ( !$line ) { # No more data
            # No data to flush to user, return false.
            return 0 if $output_buffer eq q{};
            last;  # data to  flush to user, loop exit. 
        }
        if ( $line =~ /^>/ ) {
            # Didn't get anything, keep looking. 
            next if $output_buffer eq q{};
            # Got something, flush data to user. 
            last;
        }
        chomp($line);
        $output_buffer .= $line;
    }
    # Data to flush to user 
    # Write to the scalar-reference 
    $$output .= $output_buffer;
    return 1;
}

open my $m, '<', 'a.txt';
open my $n , '<', 'b.txt';
# Creates 2 scalar references every loop, and only loops as long 
# as both files have data. 
while ( readUntilNext( $m, \my $seq ) && readUntilNext( $n, \my $qual ) ) {
    print "$seq\t$qual\n";
}

And the above code, tested, does exactly what you want to do.

Note on that \my stuff

while( readUntilNext( $m, \my $seq ) ) { 
}

is fundamentally the same as

my $seq; 
while( readUntilNext( $m, \$seq ) ) { 
}

Except for the fact the former creates a new scalar every time, guaranteeing that the same value wont be visible to a sucessive loop;

so it becomes more like:

while( 1 ){ 
 my $seq; 
 last if not readUntilNext($m, \$seq);
 do { 
    # loop body here
 }
}
Kent Fredric
+3  A: 

Here is a solution not using perl, but plain shell commands:

prompt>grep -v '^>[0-9]' file1.fasta > tmp1
prompt>(tr '\012' ' ' < file2.qual; echo) | sed 's/>[0-9]* /\n/g' | sed 1d > tmp2
prompt>paste tmp1 tmp2
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC    40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
prompt>

I searched many years for the paste command (knowing "this is a super basic operation, someone must already have implemented something to solve this problem").

The second command line first translates all newlines to spaces, and the echo command is added to add a final newline to the input (because sed will ignore lines lacking EOL), thereby joining all the input lines into one single line which then the sed command splits up again (portability note: not all sed programs will work with arbitrary line lengths, but GNU sed does).

hlovdal
+3  A: 

You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:

local $/ = "\n>";
while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
     chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;

     print "$seq\t$qual\n";      

}

You could also easily capture the FASTA header in the first replace.

bubaker