R Standardizing numeric variables in dataframe while retaining factor variables

lambertj picture lambertj · Feb 6, 2018 · Viewed 8.2k times · Source

I have a dataframe (dcc) loaded in R which I have narrowed down to complete cases.

str(dcc)

'data.frame':   41715 obs. of  9 variables:
 $ XCoord                  : num  661382 661412 661442 661472 661502 ...
 $ YCoord                  : num  648092 648092 648092 648092 648092 ...
 $ OBJECTID                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ POINTID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ GRID_CODE               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ APPL_COST_DIST_RIV_COAST: num  21350 21674 22185 22748 23448 ...
 $ APPL_DEM30              : int  785 793 792 769 765 777 784 789 781 751 ...
 $ APPL_DEM30_SLOPE        : num  19.7 13.3 18.6 23.2 21 ...
 $ APPL_SITE_NONSITE       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

I want to standardize the numeric and integer variables by subtracting the mean and dividing by the standard deviation. When I apply the following code, I inadvertently drop the factor variable APPL_SITE_NONSITE from the dataframe:

ind <- sapply(dcc, is.numeric)
dcc.s<-sapply(dcc[,ind], function(x) (x-mean(x))/sd(x))
dcc.s<-data.frame(dcc.s)

If I'm not mistaken, that happens because ind=FALSE for that variable. It seems like I need some combination of a for loop and if/else statement to standardize the numeric variables and leave the factor variable alone. I have tried a number of permutations, but keep getting errors. For example, the following code:

dcc.s <- for (i in 1:ncol(dcc)){ sapply(dcc[,i],
if (is.numeric(dcc[,i])==TRUE) {
function(x) (x-mean(x))/sd(x) }
 else {dcc[,i]})
}

returns the error:

Error in match.fun(FUN) : c("'if (is.numeric(dcc[, i]) == TRUE) {' is not a function, character or symbol", "' function(x) (x - mean(x))/sd(x)' is not a function, character or symbol", "'} else {' is not a function, character or symbol", "' dcc[, i]' is not a function, character or symbol", "'}' is not a function, character or symbol")

Perhaps this is a simple formatting error or misplaced bracket, but I'm thoroughly stuck. I am open to other approaches if there is an more elegant way to do this. Any help would be much appreciated.

Answer

Onyambu picture Onyambu · Feb 6, 2018

You need to use rapply instead of sapply

set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame':   10 obs. of  3 variables:
 $ A: num  -0.626 0.184 -0.836 1.595 0.33 ...
 $ b: int  1 2 3 4 5 6 7 8 9 10
 $ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2

The code you need to use:

> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
             A          b C
1  -0.97190653 -1.4863011 1
2   0.06589991 -1.1560120 2
3  -1.23987805 -0.8257228 1
4   1.87433300 -0.4954337 2
5   0.25276523 -0.1651446 1
6  -1.22045645  0.1651446 2
7   0.45507643  0.4954337 1
8   0.77649606  0.8257228 2
9   0.56826358  1.1560120 1
10 -0.56059319  1.4863011 2
> str(D)
'data.frame':   10 obs. of  3 variables:
 $ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
  ..- attr(*, "scaled:center")= num 0.132
  ..- attr(*, "scaled:scale")= num 0.781
 $ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
  ..- attr(*, "scaled:center")= num 5.5
  ..- attr(*, "scaled:scale")= num 3.03
 $ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>