views:

306

answers:

7

System OSX or Linux

I'm trying to automate my work flow at work, each week I receive an excel file, which I convert to a csv.

An example is:

,,L1,,,L2,,,L3,,,L4,,,L5,,,L6,,,L7,,,L8,,,L9,,,L10,,,L11,
Title,r/t,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst,neede d,actual,Inst,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst,needed,actual,Inst
EXAMPLEfoo,60,6,6,6,0,0,0,0,0,0,6,6,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
EXAMPLEbar,30,6,6,12,6,7,14,6,6,12,6,6,12,6,8,16,6,7,14,6,7.5,15,6,6,12,6,8,16,6,0,0,6,7,14
EXAMPLE1,60,3,3,3,3,5,5,3,4,4,3,3,3,3,6,6,3,4,4,3,3,3,3,4,4,3,8,8,3,0,0,3,4,4
EXAMPLE2,120,6,6,3,0,0,0,6,8,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
EXAMPLE3,60,6,6,6,6,8,8,6,6,6,6,6,6,0,0,0,0,0,0,6,8,8,6,6,6,0,0,0,0,0,0,0,10,10
EXAMPLE4,30,6,6,12,6,7,14,6,6,12,6,6,12,3,5.5,11,6,7.5,15,6,6,12,6,0,0,6,9,18,6,0,0,6,6.5,13

And so you can get a picture of how it looks in excel: alt text

What I need to do, is create multiple csv files for each instance in row 1, so L1, L2, L3, L4...

And within that each csv file it needs to contain the title, r/t, needed

So for L1 an example out put would look like:

EXAMPLEfoo,60,6
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,6
EXAMPLE3,60,6
EXAMPLE4,30,6

And for L2:

EXAMPLEfoo,60,0
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,0
EXAMPLE3,60,6
EXAMPLE4,30,6

And so on.

I have tried playing around with sed and awk and hit google but I have found nothing that really solves the issue.

I'd imagine perl would be particular suited to this or maybe python, so I would be more than happy to accept suggestions from users.

So, any suggestions?

Thanks in advance.

A: 

Hi, is the number of entries in the first line constant? Is L1 the only one to miss the Inst column?

Questions regarding the question should be posted as comments, not as answers.
Ofri Raviv
This should be a comment instead of answer.
Space
+2  A: 

try this

#!/bin/bash
awk 'BEGIN{ OFS=FS="," }
NR==1{
 for(i=1;i<=NF;i++){
   if($i){ f[i]=$i }
 }
}
NR>2{ for(o in f){ print $1,$2, $o > "file_"f[o]".csv" } } ' file

output

$ cat file_L1.csv
EXAMPLEfoo,60,6
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,6
EXAMPLE3,60,6
EXAMPLE4,30,6

$ cat file_L2.csv
EXAMPLEfoo,60,0
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,0
EXAMPLE3,60,6
EXAMPLE4,30,6
ghostdog74
Thanks for the reply, I think it could just be me, as none of the examples done like this, yours included gives me an error along the lines of awk: syntax error at source line 7 context is NR>2{ for(o in f){ print $1,$2, $o > >>> "file_"f <<< [o]".csv" } }
S1syphus
i am not sure why. You may try a dos2unix (or unix2dos) to convert your newlines of your script to that suitable for OSX.
ghostdog74
A: 

In Python, slightly hacky and untested, but should do the job:

import csv
r = csv.reader(open(r'file.csv'), dialect='excel')
topline = r.next()
headerline = r.next()

lastcell = ''
for i, cell in enumerate(topline): #Copy cells forwards in the top line, so L1 for example goes across all cells
    if cell == '':
        topline[i] = lastcell
    else:
        lastcell = cell

for i in range(len(headerline)): #Copy the topline cells into the header line, so the headerline cells should be unique
    headerline[i] = '-'.join((topline[i], headerline[i]))

rows = [dict(zip(headerline, line)) for line in r]

# Rows should now consist of dicts of the form {'Title': 'EXAMPLEfoo', 'r/t': '60', 'L1-needed': '6' ...}

for lval in frozenset(topline): #Use frozenset to ensure we only have unique values.
    if lval != '': #Make sure we don't look at the blank value
        w = csv.writer(open(r'%s.csv' % lval, 'w'), dialect='excel')
        for row in rows:
            line = [row['Title'], row['r/t'], row['-'.join((lval, 'needed'))]]
            w.writerow(line)
me_and
Your "for cell in topline:" loop doesn't actually do anything; assigning to "cell" doesn't change the elements of "topline". You need to replace that with `for i, cell in enumerate(topline): if cell == '': topline[i] = lastcell` and so forth.And why are you using frozenset in the final loop, instead of just set?
Peter Milley
@Peter: Good spot; I always forget that chink in Python list enumeration. I'll update momentarily. I'm using frozenset because it seemed like a good idea at the time… AFAIK it makes no difference in this scenario.
me_and
+2  A: 

Perl "one-liner"

perl -MText::CSV_XS -e'$c=Text::CSV_XS->new({binary=>1,eol=>"\n"});%a=map{$i++;/^L\d+$/?($_=>$i):()}@{$c->getline(*ARGV)};open$b{$_},">$_"for keys%a;while($f=$c->getline(*ARGV)){$c->print($b{$_},[@$f[0,1,$a{$_}]])for keys%a}'

For ones which have problem with reading:

$ echo '$c=Te...' | perltidy
$c = Text::CSV_XS->new( { binary => 1, eol => "\n" } );
%a = map { $i++; /^L\d+$/ ? ( $_ => $i ) : () } @{ $c->getline(*ARGV) };
open $b{$_}, ">$_" for keys %a;
while ( $f = $c->getline(*ARGV) ) {
    $c->print( $b{$_}, [ @$f[ 0, 1, $a{$_} ] ] )
      for keys %a;
}
Hynek -Pichi- Vychodil
-1 for unreadability.
Seriously! tl;dr!
Dennis Williamson
+2  A: 

Using only AWK:

awk -F, -vOFS=, -vc=1 '
    NR == 1 {
        for (i=1; i<NF; i++) {
            if ($i != "") {
                g[c]=i;
                f[c++]=$i
            }
        }
    }
    NR>2 {
        for (i=1; i < c; i++) {
            print $1,$2, $g[i] > "output_"f[i]".csv"
        }
    }' data.csv

As a one-liner:

awk -F, -vOFS=, -vc=1 'NR == 1 {for (i=1; i<NF; i++) {if ($i != "") {g[c]=i; f[c++]=$i}}} NR>2 { for (i=1; i < c; i++) {print $1,$2, $g[i] > "file_"f[i]".csv" }}' data.csv

Example output:

$ cat file_L1.csv
EXAMPLEfoo,60,6
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,6
EXAMPLE3,60,6
EXAMPLE4,30,6
$ cat file_L2.csv
EXAMPLEfoo,60,0
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,0
EXAMPLE3,60,6
EXAMPLE4,30,6
$ cat file_L11.csv
EXAMPLEfoo,60,0
EXAMPLEbar,30,6
EXAMPLE1,60,3
EXAMPLE2,120,0
EXAMPLE3,60,0
EXAMPLE4,30,6
Dennis Williamson
i don't get the output as stated by OP.
ghostdog74
Oops, I forgot to take out a test print.
Dennis Williamson
The `NR == 1` loop builds an array of the positions and contents of the cells that contain `L1`, etc. the `NR>2` loop moves horizontally through each record outputting the correct data to the correct file. On my system `>` seems to append, perhaps the redirection should be changed to `>>`. Your script goes through the whole file 11 times. Mine does it only once.
Dennis Williamson
Thanks for the reply, I'll get through to everybody eventually I hope, but Dennis, yours looked most promising so I'll start here.What version of awk did you run this on?$ awk -F, -vOFS=, -vc=1 'NR == 1 {for (i=1; i<NF; i++) {if ($i != "") {g[c]=i; f[c++]=$i}}} NR>2 { for (i=1; i < c; i++) {print $1,$2, $g[i] > "file_"f[i]".csv" }}' data.csvawk: invalid -v optionI'm going to try upgrade from default OSx install which is awk version 20070501, I'll post back my results.
S1syphus
oh, you have edited. then that's all right.
ghostdog74
@S1syphus: Try changing it to `-v OFS=","` (with a space and quotes). I'm using GNU AWK (gawk) 3.1.6.
Dennis Williamson
Still no dice, could be because I am using the bell labs source, um I'll compile GNU and get back.
S1syphus
My edits didn't change the way it works, it just took out an extra output field (and added example output). Your edited answer is an improvement over mine and quite an improvement over your original. Well done! (But I'm disappointed there's no `{...}1`. ;-) By the way, where is that construct documented?)
Dennis Williamson
@S1syphus: I don't have a Mac handy, but the [man page for AWK](http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/awk.1.html) says that there's a `-v` option. You can always set the output field separator in a `BEGIN` clause like in **ghostdog74's** answer.
Dennis Williamson
Well I have to leave work in a minute, so Ill test it on my Linux box at home, see how it works out there, thanks for the help.
S1syphus
@dennis, its just a "true" condition, so at default, it prints the entire line. i don't think its documented (AFAIK, i may be wrong)
ghostdog74
@S1syphus: You say you're "using the bell labs source", but you also say "default OSx install" which seems contradictory. What happens if you run it with `/full/path/to/official/osx/awk ...` (Also, what does `type -a {,g,n}awk` show?)
Dennis Williamson
I used the awk installed by OSX at first, then on my second attempted upgraded to the bell source, now trying with GNU Awk.type -a {,g,n}awkawk is /usr/bin/awkawk is /usr/local/bin/awkgawk is /usr/local/bin/gawkTried with each, and it returns an unterminated string, http://pastebin.com/bRufiFs9. When I edit the file and terminate it, it runs but no files are created.
S1syphus
A: 

Have a look at perl module Text::CSV_XS - comma-separated values manipulation routines. I found this module very helpful while manipulating with CSV files.

Space
+1  A: 
use strict;
use warnings;

use Text::CSV;
my $csv = Text::CSV->new;

sub parse_line {
    $csv->parse(shift) or die $!;
    return $csv->fields;
}

my @metadata;
my @files  = parse_line(scalar <>);
my @header = parse_line(scalar <>); # Ignore.
for my $i (0 .. $#files){
    next unless length $files[$i];
    open(my $h, '>', "$files[$i].csv") or die $!;
    push @metadata, {column => $i, handle => $h};
}

while (my $line = <>){
    my @fields = parse_line($line);
    for my $m (@metadata){
        $csv->print($m->{handle}, [ @fields[0, 1, $m->{column}] ]);
        print {$m->{handle}} "\n";
    }
}
FM