Can sort | uniq or sort | awk count repeated values from one column?

blehman picture blehman · May 24, 2013 · Viewed 16.9k times · Source

Using the following data:

$cat p1.csv
R,3
R,4
S,1
S,2
S,3
R,2
T,4
R,3
ST,4
RST,2
RSTR,4

First sort based on column 2:

$cat p1.csv | sort -t "," -k2
S,1
R,2
RST,2
S,2
R,3
R,3
S,3
R,4
RSTR,4
ST,4
T,4

I want to count the number of 1's, 2's, 3's, and 4's from column two. Something like $cat p1.csv | sort -t "," -k2 | uniq -f2 -c .....Can uniq be applied to one column? The -f2 doesn't properly apply uniq to the correct field. The output should take the first instance of a unique value in column two and count the number of them. Hence, the data must first be sorted on column two. A correct output would look like:

1 S,1
3 R,2
3 R,3
4 R,4

Suggestions?

Answer

jaypal singh picture jaypal singh · May 24, 2013

Your question isn't quite clear so I just reverse engineered your output to your input (assuming there is a typo in your output since you mention to count number of 1's, 2's and 3's from column 2 and show 2 R,2). You'll probably need to explain your question a little better -

sort -t "," -k2 < p1.csv |   
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |  
sort -k1

Explaination:

 - !z[$2]++ removes the duplicates based on column 2 as awk progresses thru 
   each line. 
 - a[$2]=$0 stores the non-duplicates lines in an array
 - END {..} looks at all the keys in array and pulls up values. For array a 
   it pulls up the first line it sees with unique column 2 (as your desired 
   output). For array z it pulls up number of lines seen with same column 2.

Test:

[jaypal:~/temp] cat file
R,3
R,4
S,1
S,2
S,3
R,2
T,4
R,3
ST,4
RST,2
RSTR,4

[jaypal:~/temp] sort -t "," -k2 < t |   
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |  
sort -k1
1 S,1
3 R,2
3 R,3
4 R,4

Sort with -u option

For finding unique entries based on column, you can try sort with -u option (but it won't give you the counts though).

From the man page:

-u, --unique
              with -c, check for strict ordering; 
              without -c, output only the first of an equal run

You can try something like this -

sort -t, -k2 p1.csv | sort -u -t, -k2

Using Uniq

I am not sure Uniq can be performed on a column separated by a delimiter other than a blank. Atleast on my mac it doesn't. Here is the man page reference

-f num  Ignore the first num fields in each input line when doing comparisons.  
        A field is a string of non-blank characters separated
        from adjacent fields by blanks.  Field numbers are one based, 
        i.e., the first field is field one.

So if you can remove the , delimiter and run the following command you should get your desired result.

sort -k2 test | uniq -c -f1

Test:

[jaypal:~/temp] cat test
R 3
R 4
S 1
S 2
S 3
R 2
T 4
R 3
ST 4
RST 2
RSTR 4

[jaypal:~/temp] sort -k2 test | uniq -c -f1
   1 S 1
   3 R 2
   3 R 3
   4 R 4