ansaurus

Question

Need to compare values in a file where 1st column repeats

Answer 1

A:

Does this do what you are looking for?

open (IN, "POS2") || die "nope\n";
my %data;

# Read data line by line
while (<IN>)
{
    chomp;
    my @fields = split /\t/;

    # Note $fields[0] is the name by which we want to group.
    if (defined $data{$fields[0]})
    {
        # If there is already an entry for this name, update it
        $data{$fields[0]} = [
            $fields[1],
            $data{$fields[0]}[1] < $fields[2] ? $data{$fields[0]}[1] : $fields[2],
            $data{$fields[0]}[2] > $fields[3] ? $data{$fields[0]}[2] : $fields[3]
        ];
    }
    else
    {
        # Otherwise, create a new one
        $data{$fields[0]} = [ $fields[1], $fields[2], $fields[3] ];
    }
}
close (IN);

# Output one row for each group
foreach my $name (keys %data)
{
    my ($stuff, $min, $max) = @{$data{$name}};
    print "$name\t$stuff\t$min\t$max\n";
}

I tried this and it outputs this:

jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  107233
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    59409

Is that what you wanted?

Timwi 2010-08-18 01:06:00

both of those worked for me, thank you guys so much, i had tried the hash strategy before but was getting bogged down in syntax, so these scripts will also help me learn how to set things up better

Adam 2010-08-18 23:15:48

Answer 2

+1 A:

Here's one way to do it. Just supply the input file name as a command-line argument. The <> operator will open the file and supply the lines to your script.

use strict;
use warnings;

my %h;

while (my $line = <>){
    chomp $line;
    my ($k, $scaff, $mn, $mx) = split /\t/, $line;

    $h{$k} = { min => 9e99, max => -9e99 } unless exists $h{$k};

    $h{$k}{min} = $mn if $mn < $h{$k}{min};
    $h{$k}{max} = $mx if $mx > $h{$k}{max};
}

for my $k (sort keys %h){
    print join("\t", $k, $h{$k}{min}, $h{$k}{max}), "\n";
}

I use a hash-of-hashes to store the min and max information, because it makes the code more declarative and because it's flexible. For example, suppose you decide that the output needs to preserve the order of the first appearance of any name from column 1. Just add another element to the hash-of-hashes structure to keep track of input line number whenever a name first appears:

$h{$k} = { min => 9e99, max => -9e99, line_n => $. } unless exists $h{$k};

Then use that new piece of info when sorting the output:

for my $k (sort { $h{$a}{line_n} <=> $h{$b}{line_n} } keys %h){
    # Same as above.
}

FM 2010-08-18 11:21:05

Answer 3

A:

Can do the following:

use FileHandle;

$file = new FileHandle "input_file";
@array = <$file>;
close $file;

%seen = ();

foreach (@array){
    ($col1,$col2,$col3,$col4) = split(/[\t\s]+/,$_);
    push(@newarray,$_) unless $seen{$col1}++;
}
print @newarray;

Divya Saxena 2010-08-20 11:02:27

ansaurus

tags:

views:

answers:

Need to compare values in a file where 1st column repeats

related questions