ansaurus

Question

Transpose a file in bash

Answer 1

A:

apl :) is the ONLY way to go here

ennuikiller 2009-11-13 15:17:21

But you didn't give the one-liner...

Jonathan Leffler 2009-11-16 09:36:17

@Jonathan if I remember correctly it would be ՓA

ennuikiller 2009-11-16 11:20:03

Answer 2

+4 A:

gawk

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

ghostdog74 2009-11-13 15:34:46

And now to handle row and column labels too?

Jonathan Leffler 2009-11-13 15:54:49

no requirement for that.

ghostdog74 2009-11-13 17:14:15

OK - you're correct; your sample data doesn't match the question's sample data, but your code works fine on the question's sample data and gives the required output (give or take blank vs tab spacing). Mainly my mistake.

Jonathan Leffler 2009-11-13 17:20:26

Interesting timings - I agree you see a performance benefit in awk. I was using MacOS X 10.5.8, which does not use 'gawk'; and I was using Perl 5.10.1 (32-bit build). I gather that your data was 10000 lines with 4 columns per line? Anyway, it doesn't matter a great deal; both awk and perl are viable solutions (and the awk solution is neater - the 'defined' checks in my Perl are necessary for warning free runs under strict/warnings) and neither is a slouch and both are likely to be way faster than the original shell script solution.

Jonathan Leffler 2009-11-16 09:43:01

yes, my data is just repetition till 10000 lines.

ghostdog74 2009-11-16 09:48:05

On my original 2.2GB matrix, the perl solution is slightly faster than awk - 350.103s vs. 369.410sI was using perl 5.8.8 64bit

Thrawn 2009-11-16 10:18:47

i am using gawk 3.16a, Perl 5.10.0.

ghostdog74 2009-11-16 10:27:55

`mawk` should be even faster

Porges 2009-11-16 20:48:17

Answer 3

A:

A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

#!/usr/bin/perl
use warnings;
use strict;

my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
    chomp $line;
    my @array = split ("\t",$line);
    open OUTPUT, ">temp$." or die ("unable to open output file!");
    print OUTPUT join ("\n",@array);
    close OUTPUT;
    $counter=$.;
}
close INPUT;

# paste files together
my $execute = "paste ";
foreach (1..$counter) {
    $execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;

Thrawn 2009-11-13 15:49:11

using paste and temp files are just extra unnecessary operations. you can just do manipulation inside memory itself, eg arrays/hashes

ghostdog74 2009-11-13 17:11:09

Yep, but wouldn't that mean keeping everything in memory? The files I'm dealing with are around 2-20gb in size.

Thrawn 2009-11-16 11:49:01

Answer 4

+1 A:

The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:

/bin/rm output 2> /dev/null

cols=`head -n 1 input | wc -w` 
for (( i=1; i <= $cols; i++))
do
  awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output

Simon C 2009-11-13 16:08:51

Answer 5

+1 A:

If you have sc installed, you can do:

psc -r < inputfile | sc -W% - > outputfile

Dennis Williamson 2009-11-13 16:54:28

Answer 6

+3 A:

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

Stephan202 2009-11-13 17:21:00

Answer 7

+2 A:

Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.

Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL:

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

Jonathan Leffler 2009-11-14 19:54:04

on my system, gawk outperforms perl. you can see my results in my edited post

ghostdog74 2009-11-16 09:34:45

conclusion gathered: different platform, different software version, different results.

ghostdog74 2009-11-16 16:11:45

Answer 8

+3 A:

Pure BASH, no additional process. A nice exercise:

declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
     array[$index]=${line[$COUNTER]}
     ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done

fgm 2009-11-19 15:11:58

Answer 9

A:

Hi

I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s" ${array[$COUNTER]}
    if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ]
    then
        printf "\t"
    fi
  done
  printf "\n" 
done

dtw 2010-03-21 22:39:57

ansaurus

tags:

views:

answers:

Transpose a file in bash

related questions