How to remove duplicates in SAS data step.
data uscpi;
input year month cpi;
datalines;
1990 6 129.9
1990 7 130.4
1990 8 131.6
1990 9 132.7
1991 4 135.2
1991 5 135.6
1991 6 136.0
1991 7 136.2
;
run;
PROC SORT DATA = uscpi OUT = uscpi_dist NODUPKEY;
BY year ;
RUN;
i can with proc step, but how to remove it in data step. Thanks in advance
You can use the first.
& last.
automatic variables created by SAS when using by-group
processing. They give more control on which row you consider as duplicate.
Please read the manual to understand by group processing in a Data Step
data uscpi_dedupedByYear;
set uscpi_sorted;
by year;
if first.year; /*only keep the first occurence of each distinct year. */
/*if last.year; */ /*only keep the last occurence of each distinct year*/
run;
A lot depends on who your input dataset is sorted. For ex: If your input dataset is sorted by year & month and you use if first.year;
then you can see that it only keeps the earliest month in any given year. However, if your dataset is sorted by year & descending month
then if first.year;
retains last month in any given year.
This behaviour obviously differs from how nodupkey
works.