views:

987

answers:

9

Hi all,

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only using commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

A: 

apl :) is the ONLY way to go here

ennuikiller
But you didn't give the one-liner...
Jonathan Leffler
@Jonathan if I remember correctly it would be ՓA
ennuikiller
+4  A: 

gawk

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s
ghostdog74
And now to handle row and column labels too?
Jonathan Leffler
no requirement for that.
ghostdog74
OK - you're correct; your sample data doesn't match the question's sample data, but your code works fine on the question's sample data and gives the required output (give or take blank vs tab spacing). Mainly my mistake.
Jonathan Leffler
Interesting timings - I agree you see a performance benefit in awk. I was using MacOS X 10.5.8, which does not use 'gawk'; and I was using Perl 5.10.1 (32-bit build). I gather that your data was 10000 lines with 4 columns per line? Anyway, it doesn't matter a great deal; both awk and perl are viable solutions (and the awk solution is neater - the 'defined' checks in my Perl are necessary for warning free runs under strict/warnings) and neither is a slouch and both are likely to be way faster than the original shell script solution.
Jonathan Leffler
yes, my data is just repetition till 10000 lines.
ghostdog74
On my original 2.2GB matrix, the perl solution is slightly faster than awk - 350.103s vs. 369.410sI was using perl 5.8.8 64bit
Thrawn
i am using gawk 3.16a, Perl 5.10.0.
ghostdog74
`mawk` should be even faster
Porges
A: 

A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

#!/usr/bin/perl
use warnings;
use strict;

my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
    chomp $line;
    my @array = split ("\t",$line);
    open OUTPUT, ">temp$." or die ("unable to open output file!");
    print OUTPUT join ("\n",@array);
    close OUTPUT;
    $counter=$.;
}
close INPUT;

# paste files together
my $execute = "paste ";
foreach (1..$counter) {
    $execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
Thrawn
using paste and temp files are just extra unnecessary operations. you can just do manipulation inside memory itself, eg arrays/hashes
ghostdog74
Yep, but wouldn't that mean keeping everything in memory? The files I'm dealing with are around 2-20gb in size.
Thrawn
+1  A: 

The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:

/bin/rm output 2> /dev/null

cols=`head -n 1 input | wc -w` 
for (( i=1; i <= $cols; i++))
do
  awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Simon C
+1  A: 

If you have sc installed, you can do:

psc -r < inputfile | sc -W% - > outputfile
Dennis Williamson
+3  A: 

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

Stephan202
+2  A: 

Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.


Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL:

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

Jonathan Leffler
on my system, gawk outperforms perl. you can see my results in my edited post
ghostdog74
conclusion gathered: different platform, different software version, different results.
ghostdog74
+3  A: 

Pure BASH, no additional process. A nice exercise:

declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
     array[$index]=${line[$COUNTER]}
     ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done
fgm
A: 

Hi

I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s" ${array[$COUNTER]}
    if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ]
    then
        printf "\t"
    fi
  done
  printf "\n" 
done
dtw