how to subset a file - select a numbers of rows or columns

jianfeng.mao picture jianfeng.mao · Jun 27, 2011 · Viewed 50.2k times · Source

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

For example,

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

Could you please give any advice? Thanks in advance.

Answer

Drakosha picture Drakosha · Jun 27, 2011

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile