Separate mixture of gaussians in Python

Belegnar picture Belegnar · Jan 7, 2013 · Viewed 7.5k times · Source

There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.

Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?

Original data, for example:

Sample data

Answer

David Robinson picture David Robinson · Jan 7, 2013

This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).

This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:

# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture

random.seed(010713)  # to make it reproducible

# create a mixture of normals:
#  1000 from N(0, 1)
#  2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
                      np.random.normal(6, 2, [2000])])

# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")

All the above code does is generate and plot the mixture. It looks like this:

enter image description here

Now to actually use PyMix to figure out what the percentages are:

data = mixture.DataSet()
data.fromArray(mix)

# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])

# perform expectation maximization
m.EM(data, 40, .1)
print m

The output model of this is:

G = 2
p = 1
pi =[ 0.33307859  0.66692141]
compFix = [0, 0]
Component 0:
  ProductDist: 
  Normal:  [0.0360178848449, 1.03018725918]

Component 1:
  ProductDist: 
  Normal:  [5.86848468319, 2.0158608802]

Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.

A few notes:

  • This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
  • We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
  • We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.