views:

47

answers:

3

So my data sample is in the following format.

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   19856   19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21455   21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21727   21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21980   22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   24670   24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   34741   34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    3836
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   59253   59409
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  101969
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    106436  107233

and what I am attempting to do is for each unique name in the first column, retrieve the min value for column 3, and the max value for column 4. So the final input will look the same, a tab-delimited file, except that it will have the 1st 2 columns for each unique name, then the 3rd and 4th columns be the min and max values mentioned above. I'm fairly novice at programming and attempted to do this using hashes but failed miserably. Am trying now with arrays/regular expressions as seen below.

open (IN, "POS2") || die "nope\n";
my $prev_qn = super;
my $prev_sn = ultra;
my $prev_start = non;
my $prev_end = nono;
while (<IN>) {
    chomp;
    push (@list, "$_");
}
close (IN);
foreach $v (@list) {
    $info = $v;
    ($query_name, $scaf_num, $start, $end) = split(/\t/, $info);
    unless ($info =~ m/^$prev_qn/) {
        push @ready, $info;
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }
    else {
        if ($start < $prev_start) {
            splice(@ready,2,1,$start);
        }
        if ($end > $prev_end) {
            splice(@ready,3,1,$end);
        }
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }

    foreach $z (@ready) {
        print "$z\n";
    }
}

the output this returns is below.

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
59409
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
101969

So it seems clear that the file is doing the comparison fine, but it is not replacing the elements in the array as expected, simply appending them beneath and replacing those. Additionally it never prints past the first unique name. Any suggestions?

A: 

Does this do what you are looking for?

open (IN, "POS2") || die "nope\n";
my %data;

# Read data line by line
while (<IN>)
{
    chomp;
    my @fields = split /\t/;

    # Note $fields[0] is the name by which we want to group.
    if (defined $data{$fields[0]})
    {
        # If there is already an entry for this name, update it
        $data{$fields[0]} = [
            $fields[1],
            $data{$fields[0]}[1] < $fields[2] ? $data{$fields[0]}[1] : $fields[2],
            $data{$fields[0]}[2] > $fields[3] ? $data{$fields[0]}[2] : $fields[3]
        ];
    }
    else
    {
        # Otherwise, create a new one
        $data{$fields[0]} = [ $fields[1], $fields[2], $fields[3] ];
    }
}
close (IN);

# Output one row for each group
foreach my $name (keys %data)
{
    my ($stuff, $min, $max) = @{$data{$name}};
    print "$name\t$stuff\t$min\t$max\n";
}

I tried this and it outputs this:

jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  107233
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    59409

Is that what you wanted?

Timwi
both of those worked for me, thank you guys so much, i had tried the hash strategy before but was getting bogged down in syntax, so these scripts will also help me learn how to set things up better
Adam
+1  A: 

Here's one way to do it. Just supply the input file name as a command-line argument. The <> operator will open the file and supply the lines to your script.

use strict;
use warnings;

my %h;

while (my $line = <>){
    chomp $line;
    my ($k, $scaff, $mn, $mx) = split /\t/, $line;

    $h{$k} = { min => 9e99, max => -9e99 } unless exists $h{$k};

    $h{$k}{min} = $mn if $mn < $h{$k}{min};
    $h{$k}{max} = $mx if $mx > $h{$k}{max};
}

for my $k (sort keys %h){
    print join("\t", $k, $h{$k}{min}, $h{$k}{max}), "\n";
}

I use a hash-of-hashes to store the min and max information, because it makes the code more declarative and because it's flexible. For example, suppose you decide that the output needs to preserve the order of the first appearance of any name from column 1. Just add another element to the hash-of-hashes structure to keep track of input line number whenever a name first appears:

$h{$k} = { min => 9e99, max => -9e99, line_n => $. } unless exists $h{$k};

Then use that new piece of info when sorting the output:

for my $k (sort { $h{$a}{line_n} <=> $h{$b}{line_n} } keys %h){
    # Same as above.
}
FM
A: 

Can do the following:

use FileHandle;

$file = new FileHandle "input_file";
@array = <$file>;
close $file;

%seen = ();

foreach (@array){
    ($col1,$col2,$col3,$col4) = split(/[\t\s]+/,$_);
    push(@newarray,$_) unless $seen{$col1}++;
}
print @newarray;
Divya Saxena