I would like to have your advice/help on how to subset a big file (millions of rows or lines).
For example,
(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.
(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.
I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.
Could you please give any advice? Thanks in advance.
Filtering rows is easy, for example with AWK:
cat largefile | awk 'NR >= 10000 && NR <= 100000 { print }'
Filtering columns is easier with CUT:
cat largefile | cut -d '\t' -f 10000-100000
As Rahul Dravid mentioned, cat
is not a must here, and as Zsolt Botykai added you can improve performance using:
awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile