how to remove duplicates in SAS data step

santosh315345 picture santosh315345 · Feb 14, 2014 · Viewed 37.7k times · Source

How to remove duplicates in SAS data step.

data uscpi;
      input year month cpi;
   datalines;
   1990  6 129.9
   1990  7 130.4
   1990  8 131.6
   1990  9 132.7
   1991  4 135.2
   1991  5 135.6
   1991  6 136.0
   1991  7 136.2
   ;
   run;

PROC SORT DATA = uscpi OUT = uscpi_dist NODUPKEY; 
 BY year ; 
 RUN; 

i can with proc step, but how to remove it in data step. Thanks in advance

Answer

user1509107 picture user1509107 · Feb 14, 2014

You can use the first. & last. automatic variables created by SAS when using by-group processing. They give more control on which row you consider as duplicate. Please read the manual to understand by group processing in a Data Step

 data uscpi_dedupedByYear;
 set uscpi_sorted;
 by year;
 if first.year; /*only keep the first occurence of each distinct year.  */
 /*if last.year; */ /*only keep the last occurence of each distinct year*/

 run;

A lot depends on who your input dataset is sorted. For ex: If your input dataset is sorted by year & month and you use if first.year; then you can see that it only keeps the earliest month in any given year. However, if your dataset is sorted by year & descending month then if first.year; retains last month in any given year.

This behaviour obviously differs from how nodupkey works.