Building a covariance matrix in Python

sbm picture sbm · Nov 5, 2015 · Viewed 10.4k times · Source

Problem I want to implement an algorithm from an unpublished paper by my supervisor and as part of that, I need to construct a covariance matrix C using some rules given in the paper. I'm coming from Matlab and wanted to take this opportunity to finally learn Python, hence my question: How do I do this in the most efficient (fast) way in Python (including numpy,scipy)?

Subproblem 1:

  • Option 1: I use 2 for loops , looping over all rows and all columns. I would assume that's the worst thing to do.
  • Option 2: Using list comprehension, I construct a list of euclidean pairs and then iterate over that list. That's what I'm doing now.

Is there any better way?

Subproblem 2

  • Option 1: I iterate over all elements in the matrix.
  • Option 2: I iterate only over the lower triangular part (without diagonal), then add the transpose (because covariance matrices are symmetric) and then add the diagonal.

I'm fairly convinced subproblem 1 is a no-brainer but I don't know about subproblem 2. I should probably also say that the matrix I'm dealing with is probably 2*10^4 x 2*10^4.

Thanks!

Edit I prefer not to give the actual covariance matrix but since people want to have an example, let's say we want to construct the Covariance matrix of a stochastic process called 'Brownian bridge'. It's structure is given by:

cov(Xs, Xt) = min{s,t} − st

for let's say s,t ∈ {1,...,100}. How would you build it?

Answer

Joe Kington picture Joe Kington · Nov 5, 2015

First off, for others who may come across this question in the future: If you did have data and were wanting to estimate a covariance matrix, as several people have noted, use np.cov or something similar.

Building Arrays From Patterns

However, your question is about how to build a large matrix given some pre-defined rules. To clear up some confusion in the comments: Your question doesn't seem to be about estimating a covariance matrix, it's about specifying one. In other words, you're asking how to build up a large array given some pre-defined rules.

Which way is most efficient is going to depend on what you're doing in detail. Most performance tricks in this case will involve exploiting symmetry in the calculation you're preforming. (For example, is one row going to be identical?)

It's hard to say anything specific without knowing exactly what you're doing. Therefore, I'll focus on how to do this type of thing in general. (Note: I just noticed your edit. I'll include an example for a Brownian Bridge in just a bit...)

Constant (or simple) Row/Column

The most basic case is a constant row or column in the output array. It's easy to create the array and assign values to a column or row using slicing syntax:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)

To set an entire column/row:

# Third column will be all 9's
cov[:,2] = 9

# Second row will be all 1's (will overwrite the 9 in col3)
cov[1,:] = 1

You can also assign arrays to columns/rows:

# 5th row will have random values
cov[4,:] = np.random.random(num_vars)

# 6th row will have a simple geometric sequence
cov[5,:] = np.arange(num_vars)**2

Stacking Arrays

In many cases, (but probably not this exact case) you'll want to build up your output from existing arrays. You can use vstack/hstack/column_stack/tile and many other similar functions for this.

A good example is if we're setting up a matrix for a linear inversion of a polynomial:

import numpy as np

num = 10
x = np.random.random(num) # Observation locations

# "Green's functions" for a second-order polynomial
# at our observed locations
A = np.column_stack([x**i for i in range(3)])

However, this will build up several temporary arrays (three, in this case). If we were working with at 10000-dimensional polynomial with 10^6 observations, the approach above would use too much RAM. Therefore, you might iterate over columns instead:

ndim = 2
A = np.zeros((x.size, ndim + 1), dtype=float)
for j in range(ndim + 1):
    A[:,j] = x**j

In most cases, don't worry about the temporary arrays. The colum_stack-based example is the right way to go unless you're working with relatively large arrays.

The most general approach

Without any more information, we can't exploit any sort of symmetry. The most general way is to just iterate through. Typically you'll want to avoid this approach, but sometimes it's unavoidable (especially if the calculation depends on a previous value).

Speed-wise this is identical to nested for loops, but it's easier (especially for >2D arrays) to use np.ndindex instead of multiple for loops:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)
for i, j in np.ndindex(cov.shape):
    # Logic presumably in some function...
    cov[i, j] = calculate_value(i, j)

Vectoring Index-based Calculations

If many cases, you can vectorize index-based calculations. In other words, operate directly on arrays of the indices of your output.

Let's say we had code that looked like:

import numpy as np

cov = np.zeros((10, 10)), dtype=float)
for i, j in np.ndindex(cov.shape):
    cov[i,j] = i*j - i

We could replace that with:

i, j = np.mgrid[:10, :10]
cov = i*j - i

As another example, let's build up a 100 x 100 "inverted cone" of values:

# The complex numbers in "mgrid" give the number of increments
# mgrid[min:max:num*1j, min:max:num*1j] is similar to
# meshgrid(linspace(min, max, num), linspace(min, max, num))
y, x = np.mgrid[-5:5:100j, -5:5:100j]

# Our "inverted cone" is just the distance from 0
r = np.hypot(x, y)

Brownian Bridge

This is a good example of something that can be easily vectorized. If I'm reading your example correctly, you'd want something similar to:

import numpy as np

st = np.mgrid[1:101, 1:101]
s, t = st
cov = st.min(axis=0) - s * t

Overall, I've only touched on a few general patterns. However, hopefully this gets you pointed in the right direction.