What is the function of an ID statement in Proc means in SAS?

RHelp picture RHelp · Jan 4, 2014 · Viewed 8.7k times · Source

I am working on replicating a SAS code into a R code and I came across the following SAS code snippet -

proc means data=A noprint;
by name date; 
id comp_no;
var price; 
id rep_dats act no;
output out= test(drop=_type_ _freq_)        
median=median n=num; 
run;

I know that the 'by' statement is used to group by to give statistics at that level. But, what is 'id' used for? Why are there two 'id' statements? I checked out SAS help but I didn't really understand it. I also checked out their examples at http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p19dfq16fqt1t3n1eroiabnn6r3s.htm. But there was no example illustrating the use of ID.

As I don't have access to SAS, I can't try this out and see how the output looks like. Any clarifications would be of great help to me. Thanks!

Answer

Yick Leung picture Yick Leung · Jan 4, 2014

The proc means procedure can calculate and display simple summary statistics of a data set and output that summary statistics. By default, it summarizes numeric variables (columns) by analyzing every numeric variable in the data set.

By using ID statement with by in a proc means it will produce a one value per group. This one value is the greatest value of the first variable specified in ID within the by group. Thus, if you specify many variables, e.g. id A B; It will output the only greatest value of A for that group.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146733.htm

By the way, I don't know how your data set looks like, but it seems like your proc means is only summarizing the price variable.

For example, if you have a data set:

                        Obs    sex     A      B    C     D

                         1      M      20    50    1    34
                         2      F     500    45    3    45
                         3      M     200    23    7    32
                         4      M     120    67    5    44
                         5      F     400    98    2    59

then

proc means data=sorted;
by sex;
var A B;
id D C;
output out=means(drop =_type_ _freq_);
run;

will output:

                          sex     D    C    _STAT_       A          B

                           F     59    2     N          2.000     2.0000
                           F     59    2     MIN      400.000    45.0000
                           F     59    2     MAX      500.000    98.0000
                           F     59    2     MEAN     450.000    71.5000
                           F     59    2     STD       70.711    37.4767
                           M     44    5     N          3.000     3.0000
                           M     44    5     MIN       20.000    23.0000
                           M     44    5     MAX      200.000    67.0000
                           M     44    5     MEAN     113.333    46.6667
                           M     44    5     STD       90.185    22.1886

Note that in variable D, 59 is the greatest value of D in group F, but C is not because D was specified first. It is the similar case for Group M as well where C is just the number that was on the same row as the greatest value of D.