I am working on replicating a SAS code into a R code and I came across the following SAS code snippet -
proc means data=A noprint;
by name date;
id comp_no;
var price;
id rep_dats act no;
output out= test(drop=_type_ _freq_)
median=median n=num;
run;
I know that the 'by' statement is used to group by to give statistics at that level. But, what is 'id' used for? Why are there two 'id' statements? I checked out SAS help but I didn't really understand it. I also checked out their examples at http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p19dfq16fqt1t3n1eroiabnn6r3s.htm. But there was no example illustrating the use of ID.
As I don't have access to SAS, I can't try this out and see how the output looks like. Any clarifications would be of great help to me. Thanks!
The proc means
procedure can calculate and display simple summary statistics of a data set and output that summary statistics. By default, it summarizes numeric variables (columns) by analyzing every numeric variable in the data set.
By using ID statement with by
in a proc means
it will produce a one value per group. This one value is the greatest value of the first variable specified in ID within the by
group. Thus, if you specify many variables, e.g. id A B;
It will output the only greatest value of A for that group.
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146733.htm
By the way, I don't know how your data set looks like, but it seems like your proc means
is only summarizing the price variable.
For example, if you have a data set:
Obs sex A B C D
1 M 20 50 1 34
2 F 500 45 3 45
3 M 200 23 7 32
4 M 120 67 5 44
5 F 400 98 2 59
then
proc means data=sorted;
by sex;
var A B;
id D C;
output out=means(drop =_type_ _freq_);
run;
will output:
sex D C _STAT_ A B
F 59 2 N 2.000 2.0000
F 59 2 MIN 400.000 45.0000
F 59 2 MAX 500.000 98.0000
F 59 2 MEAN 450.000 71.5000
F 59 2 STD 70.711 37.4767
M 44 5 N 3.000 3.0000
M 44 5 MIN 20.000 23.0000
M 44 5 MAX 200.000 67.0000
M 44 5 MEAN 113.333 46.6667
M 44 5 STD 90.185 22.1886
Note that in variable D
, 59 is the greatest value of D in group F, but C is not because D was specified first. It is the similar case for Group M as well where C
is just the number that was on the same row as the greatest value of D
.