An efficient way to transpose a file in Bash

Federico Giorgi picture Federico Giorgi · Nov 13, 2009 · Viewed 124.1k times · Source

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

Answer

ghostdog74 picture ghostdog74 · Nov 13, 2009
awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.