I am trying to get a point biserial correlation between a continuous vocabulary score and syntactic productivity (dichotomous: productive vs not_productive).
I tried both the ltm packages
> biserial.cor (lol$voc1_tvl, lol$synt, use = c("complete.obs"))
and the polycor package
> polyserial( lol$voc1_tvl, lol$synt, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999, bins=4)
The problem is that neither test gives me a p-value
How could I run a point biserial correlation test and get the associated p-value or alternatively calculate the p-value myself?
Since the point biserial correlation is just a particular case of the popular Peason's product-moment coefficient, you can use cor.test
to approximate (more on that later) the correlation between a continuous X and a dichotomous Y. For example, given the following data:
set.seed(23049)
x <- rnorm(1e3)
y <- sample(0:1, 1e3, replace = TRUE)
Running cor.test(x, y)
will give you the information you want.
Pearson's product-moment correlation
data: x and y
t = -1.1971, df = 998, p-value = 0.2316
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09962497 0.02418410
sample estimates:
cor
-0.03786575
As an indication of the similarity between the coefficients, notice how the calculated correlation of -0.03786575
is similar to what ltm::biserial.cor
gives you:
> library(ltm)
> biserial.cor(x, y, level = 2)
[1] -0.03784681
The diference lies on the fact that biserial.cor
is calculated on the population, with standard deviations being divided by n
, where cor
and cor.test
calculate standard deviations for a sample, dividing by n - 1
.
As cgage noted, you can also use the polyserial()
function, which in my example would yield
> polyserial(x, y, std.err = TRUE)
Polyserial Correlation, 2-step est. = -0.04748 (0.03956)
Test of bivariate normality: Chisquare = 1.891, df = 5, p = 0.864
Here, I believe the difference in the calculated correlation (-0.04748) is due to polyserial
using an optimization algorithm to approximate the calculation (which is unnecessary unless Y has more than two levels).