Finite Mixture Modelling Model Specification, Estimation & Application

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Finite mixture models

The finite mixture distribution is given by


K
X

H(y |x, ) =

k Fk (y |x, k )

k=1

with

Finite Mixture Modelling

K
X

Model Specification, Estimation & Application

k=1

Bettina Grun

k = 1

k > 0 k.

In the following it is assumed that the component specific density functions


fk exist and determine the mixture density h.

Department of Statistics and Mathematics


Research Seminar, November 23 2007

Finite mixture models

Finite mixture models

Types of applications:

semi-parametric tool to estimate general distribution functions

modeling unobserved heterogeneity

Special cases:

model-based clustering
5

mixtures of regression models

10

Finite mixture models

50

20

0
2

10

10

yn

40

30

Finite mixture models

10

Finite mixture models

Estimation
Maximum-Likelihood: Expectation-Maximization (EM) Algorithm (Dempster, Laird and Rubin, 1977)

50

40

30

yn

20

10

Bayesian: Gibbs sampling (Diebolt and Robert, 1994)

General method for ML estimation in models with unobserved


latent variables: The complete likelihood containing the observed
and unobserved data is easier to estimate.
Iterates between
E-step, which computes the expectation of the complete
likelihood, and
M-step, where the expected complete likelihood is maximized.

6
x

10

Markov Chain Monte Carlo algorithm


Applicable when the joint posterior distribution is not known
explicitly, but the conditional posterior distributions of each
variable/subsets of variables are known.

Missing data

EM algorithm: E-step

The component-label vectors zn = (znk )k=1,...,K are treated as missing


data. It holds that

Given the current parameter estimates (i) replace the missing data znk
by the estimated a-posteriori probabilities
(i)

(i)

znk = P(k|yn, xn, (i)) =

znk {0, 1} and

(i)

k fk (yn|xn, k )
K
X

(i)
(i)
u fk (yn|xn, u )

u=1

P
K
k=1 znk = 1 for all k = 1, . . . , K.

The conditional expectation of log Lc() at the ith step is given by


The complete log-likelihood is given by
log Lc() =

N
K X
X

znk [log k + log fk (yn|xn, k )] .

Q(; (i)) = E(i) [log Lc()|y , x]


=

(i)

znk [log k + log fk (yn|xn, k )] .

k=1 n=1

k=1 n=1

EM algorithm: M-step

K X
N
X

M-step: Mixtures of Gaussian distributions

The next parameter estimate is given by:


(i+1) = arg max Q(; (i)).

The solutions for the M-step are given in closed form:


PN

The estimates for the prior class probabilities are given by:
(i+1)

N
1 X
(i)
z .
N n=1 nk

The component specific parameter estimates are determined by:

(i+1)
= arg max
k
k

N
X

(i)

znk log(fk (yn|xn, k )).

n=1

weighted ML estimation of the component specific model.

n=1
(i+1)
= P
k
N

(i)

znk yn

(i)
nk
n=1 z

(i+1)
k
=

PN
(i)
(i+1)
(i+1) 0
nk (yn k
)(yn k
)
n=1 z
PN
(i)
nk
n=1 z

Estimation: EM algorithm
Advantages:

EM algorithm: Number of components


Information criteria: e.g. AIC, BIC, ICL

The likelihood is increased in each step EM algorithm converges


for bounded likelihoods.
Relatively easy to implement:
Different mixture models require only different M-steps.
Weighted ML estimation of the component specific model is
sometimes already available.
Disadvantages:
Standard errors have to be determined separately as the information
matrix is not required during the algorithm.
Convergence only to a local optimum
Slow convergence

Likelihood ratio test statistic: Comparison of nested models where the


smaller model is derived by fixing one parameter at the border of the
parameter space.
Regularity conditions are not fulfilled.
The asymptotic null distribution is not the usual 2-distribution with
degrees of freedom equal to the difference beween the number of
parameters under the null and alternative hypotheses.
distributional results for special cases
bootstrapping

variants such as Stochastic EM (SEM) or Classification EM (CEM)

Bayesian estimation

Estimation: Gibbs sampling

Determine the posterior density using Bayes theorem

Starting with Z 0 = (zn0)n=1,...,N repeat the following steps for i =


1, . . . , I0, . . . , I + I0.

p(|Y , X ) h(Y |X , )p(),


where p() is the prior and Y = (yn)n and X = (xn)n.
Standard prior distributions:
Proper priors: Improper priors give improper posteriors.
Independent priors for the component weights and the component
specific parameters.
Conjugate priors for the complete likelihood
Dirichlet distribution D(e0,1, . . . , e0,K ) for the component weights
which is the conjugate prior for the multinomial distribution.
Priors on the component specific parameters depend on the
underlying distribution family.
Invariant priors, e.g. the parameter for the Dirchlet prior is constant
over all components: e0,k e0.

1. Parameter simulation conditional on the classification Z (i1):


P
(i1)
(a) Sample 1, . . . , K from D(( N
+ e0,k )k=1,...,K ).
n=1 znk
(b) Sample component specific parameters from the complete-data
posterior p(1, . . . , K |Z (i1), Y )
(i) (i)
Store the actual values of all parameters (i) = (k , k )k=1,...,K .
2. Classification of each observation (yn, xn) conditional on knowing
(i):
(i)
Sample zn from the multinomial distribution with parameter equal to
the posterior probabilities.
After discarding the burn-in draws the draws I0 + 1, . . . , I + I0 can be
used to approximate all quantities of interest.

Example: Gaussian distribution

Estimation: Gibbs sampling

Assume an independence prior


1
p(k , 1
k ) fN (k ; b0 , B0 )fW (k ; c0 , C0 ).

1. Parameter simulation conditional on the classification Z (i1):


(i)

(i)

(i1)

(a) Sample 1 , . . . , K from D(( N


+ e0,k )k=1,...,K ).
n=1 znk
(i) in each group k from a Wishart
(b) Sample (1
)
k
W(ck (Z (i1)), Ck (Z (i1))) distribution.
(i)
(c) Sample k in each group k from a N (bk (Z (i1)), Bk (Z (i1)))
distribution.
P

Advantages:
Relatively easy to implement
Different mixture models differ only in the parameter simulation
step.
Parameter simulation conditional on the classification is sometimes
already available.
Disadvantages:

2. Classification of each observation yn conditional on knowing (i):

Might fail to escape the attraction area of one mode not all posterior
modes are visited.

(i)
P(znk
= 1|yn, (i)) k fN (yn; k , k )

Gibbs sampling: Number of components

Bayes factors

Label switching

The posterior distribution is invariant under a permutation of the


components with the same component-specific model.
Determine a unique labelling for component-specific inference:

Sampling schemes with a varying number of components


reversible-jump MCMC
inclusion of birth-and-death processes

Impose a suitable ordering constraint, e.g. s < t s, t {1, . . . , S}


with s < t.
Minimize the distance to the Maximum-A-Posteriori (MAP) estimate.
Fix the component membership for some observations.
Relabelling algorithms.

Initialization

Extensions and special cases


Model-based clustering:
Latent class analysis: multivariate discrete observations where the
marginal distributions in the components are independent.
mixtures of factor analyzers
mixtures of t-distributions

Construct a suitable parameter vector (0).


random
other estimation methods: e.g. moment estimators

Classify observations/assign a-posteriori probabilities to each observation.

Mixtures of regressions:
mixtures of generalized linear models
mixtures of generalized linear mixed models

random
cluster analysis results: e.g. hierarchical clustering, k-means

Covariates for the component sizes: concomitant variable models


Impose equality constraints between component-specific parameters

Software in R

Software: FlexMix

Model-based clustering:
mclust (Fraley and Raftery, 2002) for Gaussian mixtures:
specify different models depending on the structure of the
variance-covariance matrices (volume, shape, orientation)
k = k Dk diag(ak )Dk0
initialize EM algorithm with the solution from an agglomerative
hierarchical clustering algorithm

Clusterwise regression:
flexmix (Leisch, 2004)

See also CRAN Task View Cluster Analysis & Finite Mixture Models.

The function flexmix() provides the E-step and all data handling.
The M-step is supplied by the user similar to glm() families.
Multiple independent responses from different families
Currently bindings to several GLM families exist (Gaussian, Poisson,
Gamma, Binomial)
Weighted, hard (CEM) and random (SEM) classification
Components with prior probability below a user-specified threshold are
automatically removed during iteration

FlexMix Design

Example: Clustering

Primary goal is extensibility: ideal for trying out new mixture models.
No replacement of specialized mixtures like mclust(), but complement.
Usage of S4 classes and methods
Formula-based interface
Multivariate responses:
combination of univariate families: assumption of independence
(given x), each response may have its own model formula, i.e.,
a different set of regressors
multivariate families: if family handles multivariate response directly,
then arbitrary multivariate response distributions are possible

Example: Clustering

Example: Clustering
>
+
+
1
2
3
4
5

600

1000

700

500

400

100
0
500

1000
insulin

1500

Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 1:5, nrep = 10)

300

200

(mix <- stepFlexmix(diabetes_data ~ 1, k = 1:5,


model = FLXMCmvnorm(diag = FALSE),
nrep = 10))
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *

sspg

500

500

sspg

> library("flexmix")
> data("diabetes", package = "mclust")
> diabetes_data <- as.matrix(diabetes[, 2:4])

100

200
glucose

300

1
2
3
4
5
400

iter converged k k0
logLik
AIC
BIC
ICL
2
TRUE 1 1 -2545.833 5109.666 5136.456 5136.456
12
TRUE 2 2 -2354.674 4747.347 4803.905 4811.644
24
TRUE 3 3 -2303.557 4665.113 4751.439 4770.353
36
TRUE 4 4 -2287.605 4653.210 4769.302 4793.502
60
TRUE 5 5 -2274.655 4647.309 4793.169 4822.905

500

> plot(mix)

Example: Clustering

Example: Clustering

AIC
BIC
ICL

5000

5100

4800

4900

prior size post>0 ratio


Comp.1 0.540
82
101 0.812
Comp.2 0.199
28
96 0.292
Comp.3 0.261
35
123 0.285

4700

log Lik. -2303.557 (df=29)


AIC: 4665.113
BIC: 4751.439

number of components

Example: Clustering

Example: Clustering

700

> table(cluster(getModel(mix)), diabetes$class)


chemical normal overt
1
10
72
0
2
1
0
27
3
25
4
6
> parameters(mix_best, component = 1, simplify = FALSE)
$center
glucose
insulin
sspg
91.00937 358.19098 164.14443

sspg

300

200

400

500

600

1000
500

$cov
glucose
insulin
sspg
glucose 58.21456
80.1404
16.8295
insulin 80.14039 2154.9810 347.6972
sspg
16.82950 347.6972 2484.1538
> plot(mix_best, mark = 2)

100

500

sspg

500

1000
insulin

1500

Cluster sizes:
1 2 3
82 28 35
convergence after 24 iterations
> summary(mix_best)
Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 3, nrep = 10)

> (mix_best <- getModel(mix))


Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 3, nrep = 10)

100

200
glucose

300

400

500

Example: Clustering

Example: Regression

Rootogram of posterior probabilities > 1e04


0.0

0.2

Comp. 1

0.4

0.6

0.8

1.0

Comp. 2

Comp. 3
25

20

15

10

Number of infected plants

0
0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

50

100

150

200

250

Number of aphids

Example: Regression

> data("aphids", package = "mixreg")


> (mix <- stepFlexmix(n.inf ~ n.aphids, k = 2, data = aphids,
+
nrep = 10))
2 : * * * * * * * * * *
Call:
stepFlexmix(n.inf ~ n.aphids, data = aphids, k = 2, nrep = 10)
Cluster sizes:
1 2
23 28
convergence after 17 iterations

Example: Regression

> posterior(mix)[1:4,]
[,1]
[,2]
[1,] 0.9949732 0.005026814
[2,] 0.9949769 0.005023128
[3,] 0.2098020 0.790198026
[4,] 0.2050383 0.794961704
> predict(mix, newdata = data.frame(n.aphids = c(0, 300)))
$Comp.1
[,1]
1 3.458813
2 20.047842
$Comp.2
[,1]
1 0.8679776
2 1.5740946

300

Example: Regression

Example: Regression

25

20

> refit(mix)
Call:
refit(mix)
Number of components: 2

15

$Comp.1

Estimate Std. Error z value Pr(>|z|)


(Intercept) 3.4585759 1.3730364 2.5189
0.01177
n.aphids
0.0552974 0.0090624 6.1019 1.048e-09

10

Number of infected plants

$Comp.2

50

100

150

200

250

Estimate Std. Error z value Pr(>|z|)


(Intercept) 0.8679003 0.5017007 1.7299 0.08365
n.aphids
0.0023539 0.0035375 0.6654 0.50578
> plot(refit(mix))

300

Number of aphids

Example: Regression
Comp. 1

Applications
Comp. 2

Market segmentation: find groups of customers who share


(Intercept)

characteristics: e.g. groups of tourists with similar behaviours at


their destination
reactions: e.g. customers with similar price and other marketing
mix elasticities in choice models
account for heterogeneity between customers

n.aphids

develop segment-specific marketing strategies

Monographs

References

D. Bohning.Computer
Assisted Analysis of Mixtures and Applications:
Meta-Analysis, Disease Mapping, and Others. Chapman & Hall/CRC,
London, 1999.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from


incomplete data via the EM-algorithm. Journal of the Royal Statistical
Society B, 39:138, 1977.

S. Fruhwirth-Schnatter.
Finite Mixture and Markov Switching Models.

Springer Series in Statistics. Springer, New York, 2006.

J. Diebolt and C. P. Robert. Estimation of finite mixture distributions through


Bayesian sampling. Journal of the Royal Statistical Society B, 56:363375,
1994.

B. G. Lindsay. Mixture Models: Theory, Geometry, and Applications. The


Institute for Mathematical Statistics, Hayward, California, 1995.
G. J. McLachlan and K. E. Basford. Mixture Models: Inference and
Applications to Clustering. Marcel Dekker, New York, 1988.
G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley, 2000.

C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis


and density estimation. Journal of the American Statistical Association, 97
(458):611631, 2002.
F. Leisch. FlexMix: A general framework for finite mixture models and
latent class regression in R. Journal of Statistical Software, 11(8), 2004.
URL http://www.jstatsoft.org/v11/i08/.

You might also like