views:

888

answers:

6

I have a text file layed out like this:

1   a, b, c
2   c, b, c
2.5 a, c

I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:

a   1, 2.5
b   1, 2
c   1, 2, 2.5

(Notice how 2 isn't duplicated for c.)

I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.

How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.

+2  A: 

I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.

Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.

zimbu668
I have never coded in Perl. Can you provide an example to work off of?
strager
+1  A: 

here's a small utility in php:

// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
    list($num, $values) = explode("\t", $line);
    $newData["$num"] = explode(", ", trim($values));
}
unset($data);

// reverse the index/value association
foreach ($newData as $index => $values) {
    asort($values);
    foreach($values as $value) {
     if (!isset($data[$value]))
      $data[$value] = array();
     if (!in_array($index, $data[$value]))
      array_push($data[$value], $index);
    }
}

// printout the result
foreach ($data as $index => $values) {
    echo "$index\t" . implode(", ", $values) . "\n";
}

not really optimized or good looking, but it works...

jcinacio
Works, except the resulting keys aren't sorted. Sorting should be trivial, though.
strager
+2  A: 

If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.

Edited to be closer to requester's requirements:

import csv
from decimal import * 
getcontext().prec = 7

csv_reader = csv.reader(open('test.csv'), delimiter='\t')

maindict = {}
for row in csv_reader:
    value = row[0]
    for key in row[1:]:
        try:
            maindict[key].add(Decimal(value))
        except KeyError:
            maindict[key] = set()
        maindict[key].add(Decimal(value))

csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')

sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
    csv_writer.writerow([key] + sorted(maindict[key]))
ChristopheD
Very good! One bug, though: the 'value' isn't ordered as a number. I want it ordered as a number. For example, I get this: 13 18 2 3 35 39 6.
strager
Change the last line to :csv_writer.writerow([key] + sorted([float(x) for x in (maindict[key])]))That should do the trick ;)
ChristopheD
I receive this error with the update: ValueError: invalid literal for float(): .
strager
Are you sure these are all 'numbers' (and not some characters in between there somewhere)?
ChristopheD
All of the first column are float numbers (integers, or int + .5). Also, the output before the exception was thrown added .0 to items which were integers, which isn't what I want.
strager
Another note: the keyword sort (which are on the left in the output) should be case-insensitive.
strager
Try this:http://paste.pocoo.org/show/109142/
ChristopheD
Works great. Update your post with it and I'll give you the check. =]
strager
Actually, the left column still isn't sorted in a case-insensitive manner still.
strager
The edited version above should be case-insensitive, i forgot about that.
ChristopheD
A: 

Here's an awk(1) and sort(1) answer:

Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:

awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'

Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:

sort -t $'\t' -u -k 1f,1 -k 2n

Then, put it back in the form you want:

awk -F '\t' '{ 
    if (key != $1) { 
        if (key) printf "\n";
        key=$1;
        printf "%s\t%s", $1, $2
    } else {
        printf ", %s", $2
    }
  }
  END {printf "\n"}'

Pipe them altogether and you should get your desired output. I tested with the GNU tools.

camh
I will try this out tomorrow. The "convert the keys to lowercase" bit -- is that necessary? I want a case insensitive compare/sort but still want to maintain the casing in the final result.
strager
If the keys are distinct, but should still be sorted case-insensitively, change the sort line to "sort -u -k 1f,1 -k 2n" and remove the tolower call in the first awk invocation.
camh
I made that edit to my answer.
camh
This almost works. I have spaces in some of my values, and this ignores them.
strager
You have spaces in all your values by the looks of it, but not all are significant. It is only spaces after commas that should be ignored? Or spaces before commas too? Its all easily fixable when the rules are known - its just the awk field separator.
camh
I've updated the answer to handle spaces within a value, but strip them from around the value. It properly handles the tab between keys and values, whereas before it just looked for whitespace.
camh
A: 
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';


our %data;

while(<>){
  chomp;
  my($number,$csv) = split /\t/;
  my @csv = split m"\s*,\s*", $csv;
  push @{$data{$_}}, $number for @csv;
}

for my $number (sort keys %data){
  my @unique = sort keys %{{ map { ($_,undef) } @{$data{$number}} }};
  say $number, "\t", join ', ', @unique;
}
Brad Gilbert
A: 

Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:

use strict;
use warnings;
use Text::CSV;

my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });

open my $file, "<", "file/to/read.txt";

while(<$file>) {
  my ($first, $rest) = split /\t/, $_, 2;
  my @values;

  if($csv->parse($rest)) {
    @values = $csv->fields()
  } else {
    warn "Error: invalid CSV: $rest";
    next;
  }

  foreach(@values) {
    push @{ $hash{$_} }, $first;
  }
}

# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
  print "$_\t", join(",", sort { $a <=> $b } @{ $hash{$_} }), "\n";
}

Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.

Chris Lutz