tags:

views:

278

answers:

5

So basically my problem can be written in pseudo-code as follows:

split the line by =
using value before =, find the next line
check this the value after = matches previous
if not, then loop till end of file
collect all the values which match and using the line numbers, get the last 2 columns value
sum all the values for a given set with equal key=value pair.

The dataset I have is as follows:

3=5002, 0=10002, 5=1, 4=1, 7=1, 8=1, 9=0, 1=14002, 6=5, 200, 100
3=5002, 0=10002, 5=0, 4=1, 7=0, 8=0, 9=1, 1=14002, 6=5, 300, 10
3=5001, 0=10001, 5=0, 4=0, 7=0, 8=0, 9=0, 1=14001, 6=3, 1000, 80
3=5001, 0=10004, 5=1, 4=1, 7=2, 8=2, 9=1, 1=14001, 6=3, 10000, 1200
3=5003, 0=10004, 5=2, 4=0, 7=2, 8=2, 9=1, 1=14003, 6=8, 5000, 500
3=5003, 0=10004, 5=3, 4=1, 7=2, 8=1, 9=0, 1=14003, 6=8, 1000, 7

What I need to do is, take all values for 3, which are equal and get the summation of the last 2 columns and sum it for that value. Example:

3 = 5002, sum = 500, 110
5 = 0, sum = 1300, 90
8 = 2, sum = 15000, 1700

I have been able to parse the first 3, but am unable to do anything for the rest of the columns :-(

A: 

The way you explained your problem is not very clear. Based on my understanding, this would be my approach:

  • Create a 2d array containing the different comma delimited fields maintaining the row, column structure.

  • Analyze each column and create a hash mapping the each data value to the rows containing it.

IE: For column one you'd have a hash
3=5002 0,1
3=5001 2,3
3=5003 4,5

  • Then you go through each entry of the hash and sum the last member of the rows listed for the different data.

  • Repeat for each column excluding the last.

dborba
+1  A: 

How about splitting on ",". Then you can pull off the last element and pair it with each element from the list. For your first line you would end up with the following pairs:

3=5002, 200
0=10002, 200
5=1, 200
4=1, 200
7=1, 200
8=1, 200
9=0, 200
1=14002, 200
6=5, 200

Add each one of those pairs to a master list. Once you get that you can sort by the first element in the pair and sum.

mamboking
Add a reference to Text::CSV_XS, and this is the perfect response, IMO.
Tanktalus
@Tanktalus There is no need for Text::CSV_XS or Text::XSV in this case.
Sinan Ünür
+2  A: 

Based on my understanding, here are two possible methods. The first one uses composite keys to store values in a single level hash. The second one uses a multi-level hash:

Method 1:

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw( sum );

my %data;

while ( my $line = <DATA> ) {
    chomp $line;

    my @parts = split /, /, $line;
    last unless @parts;

    my $value = pop @parts;

    push @{ $data{$_} }, $value for @parts;
}

for my $col ( sort keys %data ) {
    printf("%12s:%9d\n", $col, sum @{ $data{$col} } );
}

__DATA__
3=5002, 0=10002, 5=1, 4=1, 7=1, 8=1, 9=0, 1=14002, 6=5, 200
3=5002, 0=10002, 5=0, 4=1, 7=0, 8=0, 9=1, 1=14002, 6=5, 300
3=5001, 0=10001, 5=0, 4=0, 7=0, 8=0, 9=0, 1=14001, 6=3, 1000
3=5001, 0=10004, 5=1, 4=1, 7=2, 8=2, 9=1, 1=14001, 6=3, 10000
3=5003, 0=10004, 5=2, 4=0, 7=2, 8=2, 9=1, 1=14003, 6=8, 5000
3=5003, 0=10004, 5=3, 4=1, 7=2, 8=1, 9=0, 1=14003, 6=8, 1000

C:\Temp> hj
  3=5001:    11000
  3=5002:      500
  3=5003:     6000
 0=10001:     1000
 0=10002:      500
 0=10004:    16000
 1=14001:    11000
 1=14002:      500
 1=14003:     6000
     4=0:     6000
     4=1:    11500
     5=0:     1300
     5=1:    10200
     5=2:     5000
     5=3:     1000
     6=3:    11000
     6=5:      500
     6=8:     6000
     7=0:     1300
     7=1:      200
     7=2:    16000
     8=0:     1300
     8=1:     1200
     8=2:    15000
     9=0:     2200
     9=1:    15300

Method: 2

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw( sum );

my %data;

while ( my $line = <DATA> ) {
    chomp $line;

    my @parts = split /, /, $line;
    last unless @parts;

    my $value = $parts[-1];

    for ( my $i = 0 ; $i < @parts - 2; ++$i ) {
        my @subparts = split /=/, $parts[$i];
        push @{ $data{$subparts[0]}->{$subparts[1]} }, $value;
    }
}

for my $k1 ( keys %data ) {
    for my $k2 ( keys %{ $data{$k1} } ) {
        printf(
            "%2d:%6d:%9d \n",
            $k1, $k2, sum @{ $data{$k1}->{$k2} }
        );
    }
}

__DATA__
3=5002, 0=10002, 5=1, 4=1, 7=1, 8=1, 9=0, 1=14002, 6=5, 200
3=5002, 0=10002, 5=0, 4=1, 7=0, 8=0, 9=1, 1=14002, 6=5, 300
3=5001, 0=10001, 5=0, 4=0, 7=0, 8=0, 9=0, 1=14001, 6=3, 1000
3=5001, 0=10004, 5=1, 4=1, 7=2, 8=2, 9=1, 1=14001, 6=3, 10000
3=5003, 0=10004, 5=2, 4=0, 7=2, 8=2, 9=1, 1=14003, 6=8, 5000
3=5003, 0=10004, 5=3, 4=1, 7=2, 8=1, 9=0, 1=14003, 6=8, 1000

C:\Temp> hjk
 3:  5003:     6000
 3:  5002:      500
 3:  5001:    11000
 7:     1:      200
 7:     0:     1300
 7:     2:    16000
 9:     1:    15300
 9:     0:     2200
 8:     1:     1200
 8:     0:     1300
 8:     2:    15000
 4:     1:    11500
 4:     0:     6000
 1: 14001:    11000
 1: 14003:     6000
 1: 14002:      500
 0: 10001:     1000
 0: 10004:    16000
 0: 10002:      500
 5:     1:    10200
 5:     3:     1000
 5:     0:     1300
 5:     2:     5000

NB: Add sort to taste.

Sinan Ünür
Just wondering why you are using the shebang `#!/usr/bin/perl` under Windows, Sinan.
Alan Haggai Alavi
@Alan because Windows does not pay attention to the interpreter path but I can still specify flags on the shebang line. By keeping it a relatively standard `/usr/bin/perl`, I can run the same script in cmd.exe using ActiveState or Strawberry perl or Cygwin perl in the Bash shell or in ArchLinux on those occasions when I dual boot. Finally, I can upload and test the script to any other *nix machine I have access to without having to make changes.
Sinan Ünür
@Sinan, I see. Nice idea. Using Arch Linux too. :-)
Alan Haggai Alavi
@Alan I must say, I am very fond of Arch. Everything just works. No fuss.
Sinan Ünür
A: 

I hope this is what you are looking for:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

my %data;
my $csv = Text::CSV_XS->new();
while ( <DATA> ) {
    $csv->parse($_);
    my @fields = $csv->fields();
    $fields[0] =~ s/^3=//;
    $data{ $fields[0] } += $fields[9];
}

use Data::Dumper;
print Dumper \%data;

__DATA__
3=5002, 0=10002, 5=1, 4=1, 7=1, 8=1, 9=0, 1=14002, 6=5, 200
3=5002, 0=10002, 5=0, 4=1, 7=0, 8=0, 9=1, 1=14002, 6=5, 300
3=5001, 0=10001, 5=0, 4=0, 7=0, 8=0, 9=0, 1=14001, 6=3, 1000
3=5001, 0=10004, 5=1, 4=1, 7=2, 8=2, 9=1, 1=14001, 6=3, 10000
3=5003, 0=10004, 5=2, 4=0, 7=2, 8=2, 9=1, 1=14003, 6=8, 5000
3=5003, 0=10004, 5=3, 4=1, 7=2, 8=1, 9=0, 1=14003, 6=8, 1000
Alan Haggai Alavi
A: 

Well, it seems that everybody is trying to understand what you really want. I don't understand it, but it seems that you only want to capture the sum of all lines that contain a given key=value pair. Except, you don't actually care for the key.

Or something like that.

So, my question would be: can you provide the expected output for the example set of data?

Anyways, here's my try (the '#/' comments are just to help the syntax highlighter.)

#!/usr/bin/perl
use strict;
use warnings;
my %h;
my @ord_keys;
while (<DATA>) {
    chomp;
    my @cols = split /,\s*/; #/
    my $val = pop @cols;

    foreach my $k (@cols) {
        if (exists($h{$k})) {
            $h{$k} += $val;
        } else {
            push @ord_keys, $k;
            $h{$k} = $val;
        }
    }
}

foreach my $key (@ord_keys) {
    my ($k, $v) = split /=/, $key; #/
    print "$k = $v, sum = $h{$key}\n";
}

__DATA__
3=5002, 0=10002, 5=1, 4=1, 7=1, 8=1, 9=0, 1=14002, 6=5, 200
3=5002, 0=10002, 5=0, 4=1, 7=0, 8=0, 9=1, 1=14002, 6=5, 300
3=5001, 0=10001, 5=0, 4=0, 7=0, 8=0, 9=0, 1=14001, 6=3, 1000
3=5001, 0=10004, 5=1, 4=1, 7=2, 8=2, 9=1, 1=14001, 6=3, 10000
3=5003, 0=10004, 5=2, 4=0, 7=2, 8=2, 9=1, 1=14003, 6=8, 5000
3=5003, 0=10004, 5=3, 4=1, 7=2, 8=1, 9=0, 1=14003, 6=8, 1000

And the results:

3 = 5002, sum = 500
0 = 10002, sum = 500
5 = 1, sum = 10200
4 = 1, sum = 11500
7 = 1, sum = 200
8 = 1, sum = 1200
9 = 0, sum = 2200
1 = 14002, sum = 500
6 = 5, sum = 500
5 = 0, sum = 1300
7 = 0, sum = 1300
8 = 0, sum = 1300
9 = 1, sum = 15300
3 = 5001, sum = 11000
0 = 10001, sum = 1000
4 = 0, sum = 6000
1 = 14001, sum = 11000
6 = 3, sum = 11000
0 = 10004, sum = 16000
7 = 2, sum = 16000
8 = 2, sum = 15000
3 = 5003, sum = 6000
5 = 2, sum = 5000
1 = 14003, sum = 6000
6 = 8, sum = 6000
5 = 3, sum = 1000

Comments welcome.

Leonardo Herrera