Canonical Correlation Analysis in R

Question 1

Canonical Correlation Analysis in R

r analysis correlation

Jota · May 1, 2011 · Viewed 10.8k times · Source

Answer

Answer

Background

Canonical Correlation Analysis (CCA) is an exploratory data analysis (EDA) technique providing estimates of the correlation relationship between two sets of variables collected on the same experimental units. Typically, users will have two matrices of data, X and Y, where the rows represent the experimental units, nrow(X) == nrow(Y).

In R, the base package provides the function cancor() to enable CCA. This is limited to cases where the number of observations is greater than the number of variables (features), nrow(X) > ncol(X).

The R package CCA is one of several which provide extended CCA functionality. Package CCA offers a set of wrapper functions around cancor() which enable consideration of cases where the feature count exceeds the count of experimental units, ncol(X) > nrow(X). Gonzalez et al (2008) CCA: An R Package to Extend Canonical Correlation Analysis, describes the workings in some detail. Version 1.2 of package CCA (published 2014-07-02) is current at the time of writing.

Probably also worth mentioning that packages kinship and accuracy mentioned in an earlier answer are no longer hosted on CRAN.

Diagnosis

Before jumping to other packages or applying unknown methods to your (presumably hard-won!) data, it's arguably beneficial to try and diagnose what the data problem may be.

The matrices passed to any of the CCA routines mentioned here should ideally be numerically complete (no missing values). The matrices passed to any of the CCA routines mentioned here should ideally be numerically complete (no missing values). The number of canonical correlates estimated by the procedure will be equal to the minimum column rank of X and Y, that is <= min(ncol(X), ncol(Y)). Ideally, the columns of each matrix will be linearly independent (not linear combinations of others).

Example:

library(CCA)
data(nutrimouse)
X <- as.matrix(nutrimouse$gene[,1:10])
Y <- as.matrix(nutrimouse$lipid)

cc(X,Y) ## works

X[,1] <- 2 * X[,9] ## column 9 no longer provides unique information

cc(X,Y)

Error in chol.default(Bmat) :
  the leading minor of order 9 is not positive definite

This is the symptom seen in the original post. One simple test is to try fitting without that column

cc(X[,-9],Y) ## works

So, while this may be frustrating in the sense that you are removing data from the analyis, that data is not providing information anyway. Your analyses can only ever be as good as the data you provide.

Also, sometimes numerical instability can be addressed by using standarized (see ?scale) variables for one (or both) of the input matrices:

X <- scale(X)

While we're here, it's perhaps worth pointing out that regularized CCA is essentially a two-step process. A cross-validation is undertaken to estimate regularization parameters (using estim.regul()), and those parameters are then used to perform the regularized CCA (with rcc()).

The example provided in the paper (arguments used verbatim in the original post)

res.regul <- estim.regul(X, Y, plt = TRUE,
                               grid1 = seq(0.0001, 0.2, l=51),
                               grid2 = seq(0, 0.2, l=51))

calls for cross-validation on a 51*51 = 2601 cell grid. While this produces a nice graphic for the paper, these are not sensible settings for initial tests on your own data. As the authors state, "The computation is not very demanding. It lasted less than one hour and half on a 'current use' computer for the 51 x 51 grid". Things have improved a little since 2008, but the default 5 x 5 grid produced by

estim.regul(X,Y,plt=TRUE)

is an excellent choice for exploratory purposes. If you're going to make mistakes, you might as well make them as quickly as possible.

Question 2

I'm using R (and package CCA) and trying to perform regularized canonical correlation analysis with two variable sets (species abundances and food abundances stored as the two matrices Y and X, respectively) in which the number of units (N=15) is less than the number of variables in the matrices, which is >400 (most of them being potential "explanatory" variables, with only 12-13 "response" variables). Gonzalez et al. (2008, http://www.jstatsoft.org/v23/i12/paper) note that the package "includes a regularized version of CCA to deal with data sets with more variables than units", which is certainly what I have with only 15 "units." Thus, I'm trying to perform regularized canonical correlation analysis using the CCA package in order to look at the relationships in my variable sets. I have been following the process Gonzalez et al (2008) go through in their paper. However, I get to an error message Error in chol.default(Bmat) : the leading minor of order 12 is not positive definite and I do not know what it means or what to do about it. Here is the code, and any ideas or knowledge on the subject would be appreciated.

library(CCA)
correl <- matcor(X, Y)
img.matcor(correl, type = 2)
res.regul <- estim.regul(X, Y, plt = TRUE,
    grid1 = seq(0.0001, 0.2, l=51),
    grid2 = seq(0, 0.2, l=51))

Error in chol.default(Bmat) : the leading minor of order 12 is not positive definite

(Note: estim.regul() takes a long time (~30-40 min) to complete when you use the sample data nutrimouse from CCA).

Any advice? Does anyone know what to do about this error? Is it because some of my columns have an NA in them? Could it be due to columns with too many 0's? Thanks in advance for any help you can offer to this combined stats & R novice.

Canonical Correlation Analysis in R

Answer

Background

Diagnosis

Related questions