Remove NAs when using mapply for ttest in R

code123 picture code123 · Feb 22, 2015 · Viewed 9.7k times · Source

I would like to do a column-wise ttest between two dataframes in R. That is, ttest(df1$col1,df2$col1) ,ttest(df1$col2,df2$col2) and so on....The best option here is to use mapply or Map function. Something like:

mapply(t.test,tnav_DJF_histo.csv[,-1],tnav_DJF.csv[,-1])

works perfectly but if one of your df columns has NAs, it fails with this error:

Error in t.test.default(dots[[1L]][[1L]], dots[[2L]][[1L]]) : 
  not enough 'y' observations

Question: how can I use na.rm to get the job done? For example, if a column in tnav_DJF.csv[,-1] has Nas but no NAs in tnav_DJF_histo.csv[,-1], how can I tell mapply to to ignore or skip the analyses for these columns?

Many thanks.

aez.

Answer

LyzandeR picture LyzandeR · Feb 22, 2015

You can do this with mapply and an anonymous function as follows:

Example data:

df1 <- data.frame(a=runif(20), b=runif(20), c=rep(NA,20))
df2 <- data.frame(a=runif(20), b=runif(20), c=c(NA,1:18,NA))
#notice df1's third column is just NAs

Solution:

Use mapply with an anonymous function as follows:

#anonumous function testing for NAs
mapply(function(x, y) {
  if(all(is.na(x)) || all(is.na(y))) NULL else t.test(x, y, na.action=na.omit)
  }, df1, df2)

Output:

$a

    Welch Two Sample t-test

data:  x and y
t = 1.4757, df = 37.337, p-value = 0.1484
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0543192  0.3458648
sample estimates:
mean of x mean of y 
0.5217619 0.3759890 


$b

    Welch Two Sample t-test

data:  x and y
t = 1.1689, df = 37.7, p-value = 0.2498
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0815067  0.3041051
sample estimates:
mean of x mean of y 
0.5846343 0.4733351 


$c
NULL

P.S. There is no na.rm argument in the t.test function to use. There is only a na.action argument but even if you set that to na.omit (which I have) you will still get an error if all the column elements are NA.

P.S.2 If some of the elements of either x or y are NA then the t.test function will run properly by omitting those elements. If you want to ignore calculating the t.test if any of the columns contains even a single NA, then you need to change the all in the above function to any.