Labelling new data using trained Gaussian Mixture Model

Samo Jerom picture Samo Jerom · Oct 8, 2014 · Viewed 8.2k times · Source

I am not sure how to do the prediction for some new data using trained Gaussian Mixture Model (GMM). For example, I have got some labelled data drawn from 3 different classes (clusters). For each class of data points, I fit a GMM (gm1, gm2 and gm3). Suppose we know the number of Gaussian mixture for each class (e.g., k1=2, k2=1 and k3=3) or it can be estimated (optimised) using Akaike information criterion (AIC). Then when I have got some new dataset, how can I know if it is more likely belong to class 1, 2 or 3?

Some Matlab script shows what I mean:

clc; clf; clear all; close all;

%% Create some artificial training data

% 1. Cluster 1 with two mixture of Gaussian (k1 = 2)
rng default;  % For reproducibility
mu1                 = [1 2];
sigma1              = [3 .2; .2 2];
mu2                 = [-1 -2];
sigma2              = [2 0; 0 1];
X1                  = [mvnrnd(mu1,sigma1,200); mvnrnd(mu2,sigma2,100)];

options1            = statset('Display', 'final');
k1                  = 2;
gm1                 = fitgmdist(X1, k1, 'Options', options1);


% 2. Cluster 2 with one mixture of Gaussian (k2 = 1)
mu3                 = [6 4];
sigma3              = [3 .1; .1 4];
X2                  = mvnrnd(mu3,sigma3,300);

options2            = statset('Display', 'final');
k2                  = 1;
gm2                 = fitgmdist(X2, k2, 'Options', options2);

% 3. Cluster 3 with three mixture of Gaussian (k3 = 3)
mu4                 = [-5 -6];
sigma4              = [1 .1; .1 1];
mu5                 = [-5 -10];
sigma5              = [6 .1; .1 1];
mu6                 = [-2 -15];
sigma6              = [8 .1; .1 4];
X3                  = [mvnrnd(mu4,sigma4,200); mvnrnd(mu5,sigma5,300); mvnrnd(mu6,sigma6,100)];

options3            = statset('Display', 'final');
k3                  = 3;
gm3                 = fitgmdist(X3, k3, 'Options', options3);

% Display
figure,
scatter(X1(:,1),X1(:,2),10,'ko'); hold on;
ezcontour(@(x,y)pdf(gm1, [x y]), [-12 12], [-12 12]);
scatter(X2(:,1),X2(:,2),10,'ko');
ezcontour(@(x,y)pdf(gm2, [x y]), [-12 12], [-12 12]);
scatter(X3(:,1),X3(:,2),10,'ko');
ezcontour(@(x,y)pdf(gm3, [x y]), [-12 12], [-12 12]); hold off;

We can get the figure:

Trained GMM

Then we have got some new testing data for example:

%% Create some artificial testing data
mut1                = [6.1 3.8];
sigmat1             = [3.1 .1; .1 4.2];
mut2                = [5.8 4.5];
sigmat2             = [2.8 .1; .1 3.8];
Xt1                 = [mvnrnd(mut1,sigmat1,500); mvnrnd(mut2,sigmat2,100)];

figure,
scatter(Xt1(:,1),Xt1(:,2),10,'ko');
xlim([-12 12]); ylim([-12 12]);

Testing data

I made the testing data similar to the Cluster 2 data on purpose. After we do the training using GMM, can we somehow predict the label of the new testing data? Is that possible to get some probabilities out like (p1 = 18%, p2 = 80% and p3 = 2%) for the prediction of each class. As we have got p2=80%, we can then have a hard classification that the new testing data is labelled as Cluster 2.

p.s.: I have found this post but it seems to theoretical to me (A similar post). If you can please put some simple Matlab script in your reply.

Thanks very much. A.


EDIT:

As Amro replied a solution for the problem, I have got more questions.

  1. Amro created a new GMM using the entire dataset with some initialisation:

    % initial parameters of the new GMM (combination of the previous three)
    % (note PComponents is normalized according to proportion of data in each subset)
    S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
      'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
      'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
    
    % train the final model over all instances
    opts = statset('MaxIter',1000, 'Display','final');
    gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
    

    What Amro got is something like below

    Amro's result

    which may not be suitable of my data because it separates my labelled cluster1, and cluster2 mixed with part of cluster1. This is what I am trying to avoid.

    Here what I present is an artificial numerical example; however, in my real application, it deals with problem of image segmentation (for example, cluster1 is my background image and cluster2 is the object I want to separate). Then I try to somehow 'force' the separate GMM to fit separate classes. If two clusters are far away (for example, cluster1 and cluster 3 in this example), there is no problem to use Amro's method to combine all the data and then do a GMM fitting. However, when we do the training on the image data, it will never be perfect to separate background from object due to the limitation of the resolution (caused partial volume effect); therefore, it is very likely we have the case of cluster1 overlapped with cluster2 as shown. I think maybe mix all the data and then do the fitting will cause some problem for further prediction of the new data, am I right?

    However, after a little bit of thinking, what I am trying to do now is:

    % Combine the mixture of Gaussian and form a new gmdistribution
    muAll               = [gm1.mu; gm2.mu; gm3.mu]; 
    sigmaAll            = cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma);
    
    gmAll               = gmdistribution(muAll, sigmaAll);
    
    pt1                 = posterior(gmAll, Xt1);
    

    What do you guys think? Or it is equivalent to Amro's method? If so, is there a method to force my trained GMM separated?

  2. Also, I have question about the rationale of using the posterior function. Essentially, I want to estimate the likelihood of my testing data given the GMM fitting. Then why we calculate the posterior probability now? Or it is just a naming issue (in other words, the 'posterior probability'='likelihood')?

  3. As far as I know, GMM has always been used as a unsupervised method. Someone even mentioned to me that GMM is a probability version of k-means clustering. Is it eligible to use it in such a 'supervised' style? Any recommended papers or references?

Thanks very much again for your reply! A.

Answer

Amro picture Amro · Oct 9, 2014

Effectively you have trained three GMM models not one, each being a mixture itself. Typically you would create one GMM with multiple components, where each component represents a cluster...

So what I would do in your case is create a new GMM model trained on the entire dataset (X1, X2, and X3) with the number of components equal to the total sum of all components from the three GMM (that is 2+1+3 = 6 Gaussian mixtures). This model would be initialized using the parameters of the individually trained ones.

Here the code to illustrate (I'm using the same variables you created in your example):

% number of instances in each data subset
n1 = size(X1,1);
n2 = size(X2,1);
n3 = size(X3,1);

% the entire dataset
X = [X1; X2; X3];
n = n1 + n2 + n3;
k = k1 + k2 + k3;

% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
  'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
  'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);

% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);

% display GMM density function over training data
line(X(:,1), X(:,2), 'LineStyle','none', ...
    'Marker','o', 'MarkerSize',1, 'Color','k')
hold on
ezcontour(@(x,y) pdf(gmm,[x y]), xlim(), ylim())
hold off
title(sprintf('GMM over %d training instances',n))

trained_GMM


Now that we have trained one GMM model over the entire training dataset (with k=6 mixtures), we can use it to cluster new data instances:

cIdx = cluster(gmm, Xt1);

This is the same as manually computing the posterior probability of components, and taking the component with the largest probability as cluster index:

pr = posterior(gmm, Xt1);
[~,cIdx] = max(pr,[],2);

As expected almost 95% of the test data was clustered as belonging to the same component:

>> tabulate(cIdx)
  Value    Count   Percent
      1       27      4.50%
      2        0      0.00%
      3      573     95.50%

Here is the matching Guassian parameters:

>> idx = 3;
>> gmm.mu(idx,:)
ans =
    5.7779    4.1731
>> gmm.Sigma(:,:,idx)
ans =
    2.9504    0.0801
    0.0801    4.0907

This indeed corresponds to the component in the upper-right side from the previous figure.

Similarly if you inspect the other component idx=1, it will be the one just on the left of the previous one, which explains how 27 out of the 600 test instances were "misclassified" if you will... Here's how confident the GMM was on those instances:

>> pr(cIdx==1,:)
ans =
    0.9813    0.0001    0.0186    0.0000    0.0000    0.0000
    0.6926    0.0000    0.3074    0.0000    0.0000    0.0000
    0.5069    0.0000    0.4931    0.0000    0.0000    0.0000
    0.6904    0.0018    0.3078    0.0000    0.0000    0.0000
    0.6954    0.0000    0.3046    0.0000    0.0000    0.0000
    <... output truncated ...>
    0.5077    0.0000    0.4923    0.0000    0.0000    0.0000
    0.6859    0.0001    0.3141    0.0000    0.0000    0.0000
    0.8481    0.0000    0.1519    0.0000    0.0000    0.0000

Here are the test instances overlayed on top of the previous figure:

hold on
gscatter(Xt1(:,1), Xt1(:,2), cIdx)
hold off
title('clustered test instances')

clustered_test_data


EDIT:

My example above was meant to show how to use GMMs for clustering data (unsupervised learning). From what I understand now, what you want instead is to classify data using existing trained models (supervied learning). I guess I was confused by your use of "clusters" term :)

Anyway it should be easy now; just compute the class-conditional probability density function of the test data using each model, and pick the model with the highest likelihood as class label (no need to combine models into one).

So continuing on your initial code, that would simply be:

p = [pdf(gm1,Xt), pdf(gm2,Xt), pdf(gm3,Xt)];    % P(x|model_i)
[,cIdx] = max(p,[],2);                          % argmax_i P(x|model_i)

cIdx is the class prediction (1, 2, or 3) of each instance in the test data.