ansaurus

Question

How do I merge two FASTA files (one file with line break) in Perl?

Answer 1

+8 A:

The problem is you are advancing through the file in parallel, so when the line is ">" in one file, it might not be ">" in the next.

The way you are reading the data is in pairs, like so:

1: >0 
2: >0
1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: >1
2: 40 40 40 40 40 40 40 40 15 40 40
1: GTTAAGTTATATCAAACTAAATATACATACTATAAA
2: >1
1: >2
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40
1: EOF
2: >2
1: EOF
2: 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
1: EOF
2: 40 8 3 29 10 19 18 40 19 15 5

The same set of data applied your looping rules would do this:

1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40

So you need to either separate the looping logic out or find a way to make the files match.

Here is an attempt at separating the seeking, but I haven't tested it.

fileIO: {
  while( 1 ){ 
   my $seq; 
   my $qual  = q{};
   while( 1 ){ 
     $seq = <SEQ>; 
     last fileIO if not $seq;  # stop at end of file
     last if $seq !~ /^>/; 
  }
  while( 1 ){ 
     my $qual_in = <PRB>;
     last fileIO if not $qual_in; # stop at end of file 
     last if $qual_in =~ /^>/ and $qual ne q{}; 
     next if $qual_in =~ /^>/ and $qual eq q{}; 
     $qual .= $qual_in;
  }
  print "$seq \n $qual \n";

 }
}

Update

I re-factored the above code into a single function that will read a chunk from an arbitrary file handle as needed, it seems to work as needed. Note of course I experimented here a little with a trick I've been meaning to use for something practical.

use strict;
use warnings;

# 
#  readUntilNext( $fileHandle, \$scalar_ref ); 
#
#  returns 0 when nothing could be read from the fileHandle. 
#  otherwise returns 1; 
#

sub readUntilNext {
    my ($fh)            = shift;
    my ($output)        = shift;
    my ($output_buffer) = '';
    while (1) {
        my $line = <$fh>;
        if ( !$line ) { # No more data
            # No data to flush to user, return false.
            return 0 if $output_buffer eq q{};
            last;  # data to  flush to user, loop exit. 
        }
        if ( $line =~ /^>/ ) {
            # Didn't get anything, keep looking. 
            next if $output_buffer eq q{};
            # Got something, flush data to user. 
            last;
        }
        chomp($line);
        $output_buffer .= $line;
    }
    # Data to flush to user 
    # Write to the scalar-reference 
    $$output .= $output_buffer;
    return 1;
}

open my $m, '<', 'a.txt';
open my $n , '<', 'b.txt';
# Creates 2 scalar references every loop, and only loops as long 
# as both files have data. 
while ( readUntilNext( $m, \my $seq ) && readUntilNext( $n, \my $qual ) ) {
    print "$seq\t$qual\n";
}

And the above code, tested, does exactly what you want to do.

Note on that \my stuff

while( readUntilNext( $m, \my $seq ) ) { 
}

is fundamentally the same as

my $seq; 
while( readUntilNext( $m, \$seq ) ) { 
}

Except for the fact the former creates a new scalar every time, guaranteeing that the same value wont be visible to a sucessive loop;

so it becomes more like:

while( 1 ){ 
 my $seq; 
 last if not readUntilNext($m, \$seq);
 do { 
    # loop body here
 }
}

Kent Fredric 2009-04-10 06:46:15

Answer 2

+3 A:

Here is a solution not using perl, but plain shell commands:

prompt>grep -v '^>[0-9]' file1.fasta > tmp1
prompt>(tr '\012' ' ' < file2.qual; echo) | sed 's/>[0-9]* /\n/g' | sed 1d > tmp2
prompt>paste tmp1 tmp2
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC    40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
prompt>

I searched many years for the paste command (knowing "this is a super basic operation, someone must already have implemented something to solve this problem").

The second command line first translates all newlines to spaces, and the echo command is added to add a final newline to the input (because sed will ignore lines lacking EOL), thereby joining all the input lines into one single line which then the sed command splits up again (portability note: not all sed programs will work with arbitrary line lengths, but GNU sed does).

hlovdal 2009-04-10 15:54:07

Answer 3

+3 A:

You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:

local $/ = "\n>";
while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
     chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;

     print "$seq\t$qual\n";      

}

You could also easily capture the FASTA header in the first replace.

bubaker 2009-04-11 21:26:07

ansaurus

tags:

views:

answers:

How do I merge two FASTA files (one file with line break) in Perl?

Update

related questions