split large csv text file based on column value

Question 1

split large csv text file based on column value

csv text split large-data

user788171 · Mar 31, 2012 · Viewed 14.9k times · Source

Answer

Answer

Here's an old school one liner for you (just replace the >> with > to truncate the output files each run):

awk -F, '{print >> ($3".csv")}' input.csv

Due to popular demand (and an itch I just had), I've also written a version that will duplicate the header lines to all files:

awk -F, 'NR==1 {h=$0; next} {f=$3".csv"} !($3 in p) {p[$3]; print h > f} {print >> f}' input.csv

But you could just start with this and finish with the first awk:

HDR=$(head -1 input.csv); for fn in $(tail -n+2 input.csv | cut -f3 -d, | sort -u); do echo $HDR > $fn.csv; done

Most modern systems have the awk binary included, but if you don't have it, you can find an exe at Gawk for Windows

Question 2

I have CSV files that have multiple columns that are sorted. For instance, I might have lines like this:

19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1

I would like to divide up the file based on the 3rd column, e.g. put PLXS and PCP entries into their own files called PLXS.csv and PCP.csv. Because the file happens to be pre-sorted, all of the PLXS entries are before the PCP entries and so on.

I generally end up doing things like this in C++ since that's the language I know the best, but in this case, my input CSV file is several gigabytes and too large to load into memory in C++.

Can somebody show how this can be accomplished? Perl/Python/php/bash solutions are all okay, they just need to be able to handle the huge file without excessive memory usage.

split large csv text file based on column value

Answer

Related questions