19 Aos1924

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

The Annals of Statistics

2020, Vol. 48, No. 6, 3161–3182


https://doi.org/10.1214/19-AOS1924
© Institute of Mathematical Statistics, 2020

SINGULARITY, MISSPECIFICATION AND


THE CONVERGENCE RATE OF EM

B Y R AAZ DWIVEDI1,* , N HAT H O1,† , KOULIK K HAMARU2 ,


M ARTIN J. WAINWRIGHT1,2,3,‡ , M ICHAEL I. J ORDAN1,2,§ AND B IN Y U1,2,¶
1 Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, * [email protected];
[email protected]; ‡ [email protected]; § [email protected]; ¶ [email protected]
2 Department of Statistics, University of California, Berkeley, [email protected]
3 Voleon Group

A line of recent work has analyzed the behavior of the Expectation-


Maximization (EM) algorithm in the well-specified setting, in which the
population likelihood is locally strongly concave around its maximizing ar-
gument. Examples include suitably separated Gaussian mixture models and
mixtures of linear regressions. We consider over-specified settings in which
the number of fitted components is larger than the number of components in
the true distribution. Such mis-specified settings can lead to singularity in the
Fisher information matrix, and moreover, the maximum likelihood estimator
1
based on n i.i.d. samples in d dimensions can have a nonstandard O((d/n) 4 )
rate of convergence. Focusing on the simple setting of two-component mix-
tures fit to a d-dimensional Gaussian distribution, we study the behavior of
the EM algorithm both when the mixture weights are different (unbalanced
case), and are equal (balanced case). Our analysis reveals a sharp distinction
between these two cases: in the former, the EM algorithm converges geomet-
1
rically to a point at Euclidean distance of O((d/n) 2 ) from the true parameter,
whereas in the latter case, the convergence rate is exponentially slower, and
1
the fixed point has a much lower O((d/n) 4 ) accuracy. Analysis of this singu-
lar case requires the introduction of some novel techniques: in particular, we
make use of a careful form of localization in the associated empirical process,
and develop a recursive argument to progressively sharpen the statistical rate.

1. Introduction. The growth in the size and scope of modern data sets has presented the
field of statistics with a number of challenges, one of them being how to deal with various
forms of heterogeneity. Mixture models provide a principled approach to modeling hetero-
geneous collections of data (that are usually assumed i.i.d.). In practice, it is frequently the
case that the number of mixture components in the fitted model does not match the number of
mixture components in the data-generating mechanism. It is known that such mismatch can
lead to substantially slower convergence rates for the maximum likelihood estimate (MLE)
for the underlying parameters. In contrast, relatively less attention has been paid to the com-
putational implications of this mismatch. In particular, the algorithm of choice for fitting
finite mixture models is the Expectation-Maximization (EM) algorithm, a general framework
that encompasses various types of divide-and-conquer computational strategies. The goal of
this paper is to gain a fundamental understanding of the behavior of EM when used to fit
overspecified mixture models.

Statistical issues with overspecification. While density estimation in finite mixture mod-
els is relatively well understood [12, 26], characterizing the behavior of maximum likelhood

Received September 2019.


MSC2020 subject classifications. Primary 62F15, 62G05; secondary 62G20.
Key words and phrases. Mixture models, expectation-maximization, Fisher information matrix, empirical pro-
cess, nonasymptotic convergence guarantees, localization argument.
3161
3162 R. DWIVEDI ET AL.

for parameter estimation has remained challenging. The main difficulty for analyzing the
MLE in such settings arises from label switching between the mixtures [23, 25], and lack
of strong concavity in the likelihood. Such issues do not interfere with density estimation,
since the standard divergence measures like the Kullback–Leibler and Hellinger distances
remain invariant under permutations of labels, and strong concavity is not required. An im-
portant contribution to the understanding of parameter estimation in finite mixture models
was made by Chen [5]. He considered a class of overspecified finite mixture models; here,
the term “overspecified” means that the model to be fit has more mixture components than
1
the distribution generating the data. In an interesting contrast to the usual n− 2 convergence
rate for the MLE based on n samples, Chen showed that for estimating scalar location param-
eters in a certain class of overspecified finite mixture models, the corresponding rate slows
1
down to n− 4 . This theoretical result has practical significance, since methods that overspec-
ify the number of mixtures are often more feasible than methods that first attempt to estimate
the number of components, and then estimate the parameters using the estimated number of
components [24]. In subsequent work, Nguyen [21] and Heinrich et al. [14] have character-
ized the (minimax) convergence rates of parameter estimation rates for mixture models in
both exactly-fitted or overspecified settings in terms of the Wasserstein distance.

Computational concerns with mixture models. While the papers discussed above address
the statistical behavior of a global maximum of the log-likelihood, they do not consider the
associated computational issues of obtaining such a maximum. In general settings, noncon-
vexity of the log-likelihood makes it impossible to guarantee that the iterative algorithms used
in practice converge to the global optimum, or equivalently the MLE. Perhaps the most widely
used algorithm for computing the MLE is the expectation-maximization (EM) algorithm [8].
Early work on the EM algorithm [29] showed that its iterates converge asymptotically to a
local maximum of the log-likelihood function for a broad class of incomplete data models;
this general class includes the fitting of mixture models as a special case. The EM algorithm
has also been studied in the specific setting of Gaussian mixture models; here, we find re-
sults both for the population EM algorithm, which is the idealized version of EM based on
an infinite sample size, as well as the usual sample-based EM algorithm that is used in prac-
tice. For Gaussian mixture models, the population EM algorithm is known to exhibit various
convergence rates, ranging from linear to superlinear (quasi-Newton like) convergence if the
overlap between the mixture components tends to zero [20, 31]. It has also been noted in sev-
eral papers [20, 22] that the convergence of EM can be prohibitively slow when the mixtures
are not well separated.

Prior work on EM. Balakrishnan et al. [1] laid out a general theoretical framework for
analysis of the EM algorithm, and in particular how to prove nonasymptotic bounds on the
Euclidean distance between sample-based EM iterates and the true parameter. When applied
to the special case of two-component Gaussian location mixtures, assumed to be well spec-
ified and suitably separated, their theory guarantees that (1) population EM updates enjoy a
geometric rate of convergence to the true parameter when initialized in a sufficiently small
neighborhood around the truth, and (2) sample-based EM updates converge to an estimate at
1
Euclidean distance of order (d/n) 2 , based on n i.i.d. draws from a finite mixture model in Rd .
Further work in this vein has characterized the behavior of EM in a variety of settings for two
Gaussian mixtures, including convergence analysis with additional sparsity constraints [13,
28, 33], global convergence of population EM [30], guarantees of geometric convergence
under less restrictive conditions on the two mixture components [7, 16], analysis of EM
with unknown mixture weights, means and covariances for two mixtures [3] and the analysis
of EM to more than two Gaussian components [13, 32]. Other related work has provided
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3163

optimization-theoretic guarantees for EM by viewing it in a generalized surrogate function


framework [18], and analyzed the statistical properties of confidence intervals based on an
EM estimator [6].
An assumption common to all of this previous work is that there is no misspecification in
the fitting of the Gaussian mixtures; in particular, it is assumed that the data is generated from
a mixture model with the same number of components as the fitted model. A portion of our
recent work [9] has shown that EM retains its fast convergence behavior—albeit to a biased
estimate—in underspecified settings where the number of components in the fitted model are
less than that in the true model. However, as noted above, in practice, it is most common to
use overspecified mixture models. For these reasons, it is desirable to understand how the EM
algorithm behaves in the overspecified settings.

Our contributions. The goal of this paper is to shed some light on the nonasymptotic
performance of the EM algorithm for overspecified mixtures. We provide a comprehensive
study of overspecified mixture models when fit to a particularly simple (nonmixture) data-
generating mechanism; a multivariate normal distribution N (0, σ 2 Id ) in d dimensions with
known scale parameter σ > 0. This setting, despite its simplicity, suffices to reveal some
rather interesting properties of EM in the overspecified context. In particular, we obtain the
following results:
• Two-mixture unbalanced fit: For our first model class, we study a mixture of two
location-Gaussian distributions with unknown location, known variance and known unequal
weights for the two components. For this case, we establish that the population EM updates
converge at a geometric rate to the true parameter; as an immediate consequence, the sample-
1
based EM algorithm converges in O(log(n/d)) steps to a ball of radius (d/n) 2 . The fast
convergence rate of EM under the unbalanced setting provides an antidote to the pessimistic
belief that statistical estimators generically exhibit slow convergence for overspecified mix-
tures.
• Two-mixture balanced fit: In the balanced version of the problem in which the mixture
weights are equal to 12 for both components, we find that the EM algorithm behaves very
differently. Beginning with the population version of the EM algorithm, we show that it con-
verges to the true parameter from an arbitrary initialization. However, the rate of convergence
varies as a function of the distance of the current iterate from the true parameter value, be-
coming exponentially slower as the iterates approach the true parameter. This behavior is in
sharp contrast to well-specified settings [1, 7, 32], where the population updates converge at
a geometric rate. We also show that our rates for population EM are tight. By combining the
slow convergence of population EM with a novel localization argument, one involving the
empirical process restricted to an annulus, we show that the sample-based EM iterates con-
1 1 1
verge to a ball of radius (d/n) 4 around the true parameter after O((n/d) 2 ) steps. The n− 4
component of the Euclidean error matches known guarantees for the global maximum of the
MLE [5]. The localization argument in our analysis is of independent interest, because such
techniques are not required in analyzing the EM algorithm in well-specified settings when the
population updates are globally contractive. We note that ball-based localization methods are
known to be essential in deriving sharp statistical rates for M-estimators (e.g., [2, 17, 26]); to
the best of our knowledge, the use of an annulus-based localization argument in analyzing an
algorithm is novel.
Moreover, we show via extensive numerical experiments that the fast convergence of EM
for the unbalanced fit is a special case; and that the slow behavior of EM proven for the
1
balanced fit (in particular the rate of order n− 4 ) arises in several general (including more than
two components) overspecified Gaussian mixtures with known variance, known or unknown
weights and unknown location parameters.
3164 R. DWIVEDI ET AL.

Organization. The remainder of the paper is organized as follows. In Section 2, we pro-


vide illustrative simulations of EM in different settings in order to motivate the settings an-
alyzed later in the paper. We then provide a thorough analysis of the convergence rates of
EM when overfitting Gaussian data with two components in Section 3 and the key ideas of
the novel proof techniques in Section 4. We provide a thorough discussion of our results in
Section 5, exploring their general applicability, and presenting further simulations that sub-
stantiate the value of our theoretical framework. Detailed proofs of our results and discussion
of certain additional technical aspects of our results are provided in the Supplementary Ma-
terial [10].

Notation. For any two sequences an and bn , the notation an  bn or an = O(bn ) means
that an ≤ Cbn for some universal constant C. Similarly, the notation an  bn or an = (bn )
denotes that both the conditions, an  bn and bn  an , hold. Throughout this paper, π denotes
a variable and π denotes the mathematical constant “pi.”

Experimental settings. We summarize a few common aspects of the numerical experi-


ments presented in the paper. Population-level computations were done using numerical in-
tegration on a sufficiently fine grid. With finite samples, the stopping criteria for the con-
vergence of EM were: (1) the change in the iterates was small enough, or (2) the number
of iterations was too large (greater than 100,000). Experiments were averaged over several
repetitions (ranging from 25 to 400). In majority of the runs, for each case, criteria (1) led to
convergence. In our plots for sample EM, we report m e + 2 se on the y-axis, where m e , 
se ,
respectively, denote the mean and standard deviation across the experiments for the metric
under consideration, for example, the parameter estimation error. Furthermore, whenever a
slope is provided, it is the slope for the least-squares fit on the log-log scale for the quantity on
y-axis when fitted with the quantity reported on the x-axis. For instance, in Figure 1(b), we
plot |
θn − θ ∗ | on the y-axis value versus the sample size n on the x-axis, averaged over 400
experiments, accounting for the deviation across these experiments. Furthermore, the green
dotted line with legend π = 0.3 and the corresponding slope −0.48 denote the least-squares
fit and the respective slope for log |θn − θ ∗ | (green solid dots) with log n for the experiments
corresponding to the setting π = 0.3.

2. Motivating simulations and problem set-up. In this section, we explore a wide


range of behavior demonstrated by EM for certain settings of overspecified location Gaussian
mixtures. We begin with several simulations that illustrate fast and slow convergence of EM
for various settings, and serve as a motivation for the theoretical results derived later in the
paper. We provide basic background on EM in Section 2.3, and describe the problems to be
tackled.

2.1. Problem set-up. Let φ(·; μ, ) denote the density of a Gaussian random vector with
mean μ and covariance . Consider the two component Gaussian mixture model with density
     
(1) f x; θ ∗ , σ, π := πφ x; θ ∗ , σ 2 Id + (1 − π)φ x; −θ ∗ , σ 2 Id .
Given n samples from the distribution (1), suppose that we use the EM algorithm to fit a two-
component location Gaussian mixture with fixed weights and variance1 and special structure
on the location parameters—more precisely, we fit the model with density
   
(2) f (x; θ, σ, π) := πφ x; θ, σ 2 Id + (1 − π)φ x; −θ, σ 2 Id

1 Refer to Section 5 for a discussion for the case of unknown weights and variances.
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3165

using the EM algorithm, and take the solution2 as an estimate of θ ∗ . An important aspect of
the problem at hand is the signal strength, which is measured as the separation between the
means of mixture components relative to the spread in the components. For the model (1),
the signal strength is given by the ratio θ ∗ 2 /σ . When this ratio is large, we refer to it as the
strong signal case; otherwise, it corresponds to the weak signal case. Of particular interest
to us is the behavior of EM in the limit of weak signal when there is no separation; that is,
θ ∗ 2 = 0. For such cases, we call the fit (2) an unbalanced fit when π = 12 and a balanced fit
when π = 12 . Note that the setting of θ ∗ = 0 corresponds to the simplest case of overspecified
fit, since the true model has just one component (standard normal distribution irrespective of
the parameter π) but the fitted model has two (one extra) component (unless the EM estimate
is also 0). We now present the empirical behavior of EM for these models and defer the
derivation of EM updates to Section 2.3.

2.2. Numerical experiments: Fast to slow convergence of EM. We begin with a numer-
ical study of the effect of separation among the mixtures on the statistical behavior of the
estimates returned by EM. Our main observation is that weak or no separation leads to rel-
atively low accuracy estimates. Additional simulations for more general mixtures, including
more than two components, are provided in Section 5.3. Next, via numerical integration on
a grid with sufficiently small discretization width, we simulate the behavior of the popula-
tion EM algorithm width—an idealized version of EM in the limit of infinite samples—in
order to understand the effect of signal strength on EM’s algorithmic rate of convergence,
that is, the number of steps needed for population EM to converge to a desired accuracy. We
observe a slow down of EM on the algorithmic front when the signal strength approaches
zero.

2.2.1. Effect of signal strength on sample EM. In Figure 1, we show simulation results
for data generated from the model (1) in dimension d = 1 and noise variance σ 2 = 1, and
for three different values of the weight π ∈ {0.1, 0.3, 0.5}. In all cases, we fit a two-location
Gaussian mixture with fixed weights and variance as specified by equation (2). The two pan-
els show the estimation error of the EM solution as a function of n for two distinct cases
of the data-generating mechanism: (a) in the strong signal case, we set θ ∗ = 5 so that the
data has two well-separated mixture components, and (b) to obtain the limiting case of no
signal, we set θ ∗ = 0, so that the two mixture components in the data-generating distribu-
tion collapse to one, and we are simply fitting the data from a standard normal distribu-
tion.
In the strong signal case, it is well known [1, 7] that EM solutions have an estimation error
(measured by the Euclidean distance between the EM estimate and the true parameter θ ∗ ) that
1
achieves the classical (parametric) rate n− 2 ; the empirical results in Figure 1(a) are simply a
confirmation of these theoretical predictions. More interesting is the case of no signal (which
is the limiting case with weak signal), where the simulation results shown in panel (b) of
Figure 1 reveal a different story. In this case, whereas the EM solution (with random standard
1
normal initialization) has an error that decays as n− 2 when π = 1/2, its error decays at the
1
considerably slower rate n− 4 when π = 1/2. We return to these cases in further detail in
Section 3.

2 Strictly speaking, different initialization of EM may converge to different estimates. For the settings analyzed
theoretically in this work, the EM always converges toward the same estimate in the limit of infinite steps, and we
use a stopping criterion to determine the final estimate. See the discussion on experimental settings in Section 1
for more details.
3166 R. DWIVEDI ET AL.

θn − θ ∗ | in the EM solution versus the sample size n, focusing on the effect of signal
F IG . 1. Plots of the error |
strength on EM solution accuracy. The true data distribution is given by πN (θ ∗ , 1) + (1 − π)N (−θ ∗ , 1) and
we use EM to fit the model πN (θ, 1) + (1 − π)N (−θ, 1), generating the EM estimate  θn based on n samples.
1
(a) When the signal is strong, the estimation rate decays at the parametric rate n− 2 , as revealed by the −1/2
slope in a least-square fit of the log error based on the log sample size log n. (b) When there is no signal (θ ∗ = 0),
1
then depending on the choice of weight π in the fitted model, we observe two distinct scalings for the error: n− 2
1
when π = 0.5, and, n− 4 when π = 0.5, again as revealed by least-squares fits of the log error using the log sample
size log n.

2.2.2. Interesting behavior of population EM. The intriguing behavior of the sample EM
algorithm in the “no signal” case motivated us to examine the behavior of population EM
for this case. To be clear, while sample EM is the practical algorithm that can actually be
applied, it can be insightful for theoretical purposes to first analyze the convergence of the
population EM updates, and then leverage these findings to understand the behavior of sample
EM [1]. Our analysis follows a similar road-map. Interestingly, for the case with θ ∗ = 0, the
population EM algorithm behaves significantly differently for the unbalanced fit (π = 12 ) as
compared to the balanced fit (π = 12 ) (equation (2)). In Figure 2, we plot the distance of the
population EM iterate θ t to the true parameter value, θ ∗ = 0, on the vertical axis, versus the
iteration number t on the horizontal axis. With the vertical axis on a log scale, a geometric
convergence rate of the algorithm shows up as a negatively sloped line (disregarding transient
effects in the first few iterations).
For the unbalanced mixtures in panel (a), we see that EM converges geometrically quickly,
although the rate of convergence (corresponding to the slope of the line) tends toward zero
as the mixture weight π tends toward 1/2 from below. For π = 1/2, we obtain a balanced
mixture, and, as shown in the plot in panel (b), the convergence rate is now subgeometric. In
fact, the behavior of the iterates is extremely well characterized by the recursion θ → 1+θ
θ
2.
The theory to follow provides a precise characterization of the behavior seen in Fig-
ures 1(b) and 2. Furthermore, in Section 5, we provide further support for relevance of our
theoretical results in explaining the behavior of EM for other classes of overspecified mod-
els, including Gaussian mixture models with unknown weights as well as mixtures of linear
regressions.

2.3. EM updates for the model fit (2). In this section, we provide a quick introduction to
the EM updates. Readers familiar with the literature can skip directly to the main results in
Section 3. Recall that the two-component model fit is based on the density
   
(3) πφ x; θ, σ 2 Id + (1 − π)φ x; −θ, σ 2 Id .
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3167

F IG . 2. Behavior of the (numerically computed) population EM updates (8) when the underlying data distribu-
tion is N (0, 1). (a) Unbalanced mixture fits (2) with weights (π, 1 − π): We observe geometric convergence toward
θ ∗ = 0 for all π = 0.5 although the rate of convergence gets slower as π → 0.5. (b) Balanced mixture fits (2) with
weights (0.5, 0.5): We observe two phases of convergence. First, EM quickly converges to ball of constant ra-
dius and then it exhibits slow convergence toward θ ∗ = 0. Indeed, we see that during the slow convergence, the
population EM updates track the curve given by θ t+1 = θ t /(1 + (θ t )2 ) very closely, as predicted by our theory.

From now on, we assume that the data is drawn from the zero-mean Gaussian distribution
N (0, σ 2 Id ). Note that the model fit described above contains the true model with θ ∗ = 0
and it is referred to as an overspecified fit since for any nonzero θ , the fitted model has two
components.
The maximum likelihood estimate is obtained by solving the following optimization prob-
lem:

1 n
     
(4) θnMLE ∈ arg max log πφ xi ; θ, σ 2 Id + (1 − π)φ xi ; −θ, σ 2 Id .
θ ∈ n
i=1
In general, there is no closed-form expression for 
θnMLE . The EM algorithm circumvents
this problem via a minorization-maximization scheme. Indeed, population EM is a surrogate
method to compute the maximizer of the population log-likelihood
     
(5) L(θ ) := EX log πφ X; θ, σ 2 Id + (1 − π)φ X; −θ, σ 2 Id ,
where the expectation is taken over the true distribution. On the other hand, sample EM at-
tempts to estimate  θnMLE . We now describe the expressions for both the sample and population
EM updates for the model-fit (3).
Given any point θ , the EM algorithm proceeds in two steps: (1) compute a surrogate func-
tion Q(·; θ ) such that Q(θ
; θ ) ≤ L(θ
) and Q(θ ; θ ) = L(θ ); and (2) compute the maximizer
of Q(θ
; θ ) with respect to θ
. These steps are referred to as the E-step and the M-step, re-
spectively. In the case of two-component location Gaussian mixtures, it is useful to describe
a hidden variable representation of the mixture model. Consider a binary indicator variable
Z ∈ {0, 1} with the marginal distribution P(Z = 1) = π and P(Z = 0) = 1 − π, and define the
conditional distributions
   
(X | Z = 0) ∼ N −θ, σ 2 Id and (X | Z = 1) ∼ N θ, σ 2 Id .
These marginal and conditional distributions define a joint distribution over the pair (X, Z),
and by construction, the induced marginal distribution over X is a Gaussian mixture of the
form (3). For EM, we first compute the conditional probability of Z = 1 given X = x:
θ −x22
π exp(− )
(6) wθ (x) = wθπ (x) := 2σ 2
.
θ −x2 θ +x2
π exp(− 2σ 2 2 ) + (1 − π) exp(− 2σ 2 2 )
3168 R. DWIVEDI ET AL.

Then, given a vector θ , the E-step in the population EM algorithm involves computing the
minorization function θ
→ Q(θ
, θ ). Doing so is equivalent to computing the expectation
  1 

 


(7) Q θ
; θ = − E wθ (X)
X − θ

22 + 1 − wθ (X)
X + θ

22 ,
2
where the expectation is taken over the true distribution (here N (0, σ 2 Id )). In the M-step, we
maximize the function θ
→ Q(θ
; θ ). Doing so defines a mapping M : Rd → Rd , known as
the population EM operator, given by
   
(8) M(θ ) = arg max Q θ
, θ = E 2wθ (X) − 1 X .
θ
∈Rd
In this definition, the second equality follows by computing the gradient ∇θ
Q, and setting it
to zero. In summary, for the two-component location mixtures considered in this paper, the
population EM algorithm is defined by the sequence θ t+1 = M(θ t ), where the operator M is
defined in equation (8).
We obtain the sample EM update by simply replacing the expectation E in equations (7)
and (8) by the empirical average based on an observed set of samples. In particular, given a
set of i.i.d. samples {Xi }ni=1 , the sample EM operator Mn : Rd → Rd takes the form
1 n
 
(9) Mn (θ ) := 2wθ (Xi ) − 1 Xi .
n i=1
Overall, the sample EM algorithm generates the sequence of iterates given by θ t+1 = Mn (θ t ).
In the sequel, we study the convergence of EM both for the population EM algorithm in
which the updates are given by θ t+1 = M(θ t ), and the sample-based EM sequence given by
θ t+1 = Mn (θ t ). With this notation in place, we now turn to the main results of this paper.

3. Main results. In this section, we state our main results for the convergence rates of
the EM updates under the unbalanced and balanced mixture fit. We start with the easier case
of unbalanced mixture fit in Section 3.1 followed by the more delicate (and interesting) case
of the balanced fit in Section 3.2.

3.1. Behavior of EM for unbalanced mixtures. We begin with a characterization of both


the population and sample-based EM updates in the setting of unbalanced mixtures. In partic-
ular, we assume that the fitted two-components mixture model (3) has known weights π and
1 − π, where π ∈ (0, 1/2). The following result characterizes the behavior of the EM updates
for this setup.

T HEOREM 1. Suppose that we fit an unbalanced instance (i.e., π = 12 ) of the mixture


model (3) to N (0, σ 2 Id ) data. Then:
(a) The population EM operator (8) is globally strictly contractive, meaning that


 
(10a)
M(θ )
≤ 1 − ρ 2 /2 θ2 for all θ ∈ Rd ,
2
where ρ := |1 − 2π| ∈ (0, 1).
(b) There are universal constants c, c
such that given any δ ∈ (0, 1) and a sample size
2
n ≥ c σρ 4 (d + log(1/δ)), the sample EM sequence θ t+1 = Mn (θ t ) generated by the update (9)
satisfies the upper bound



t


θ

θ 0
1 −
ρ 2 t c
(θ 0 2 σ 2 + ρσ ) d + log(1/δ)
(10b) 2 2 + 2
,
2 ρ n
with probability at least 1 − δ.

See Appendix A.1 in the Supplementary Material [10] for the proof of this theorem.
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3169

Fast convergence of population EM. The bulk of the effort in proving Theorem 1 lies
in establishing the guarantee (10a) for the population EM iterates. Such a contraction bound
immediately yields the exponential fast convergence of the population EM updates θ t+1 =
M(θ t ) to θ ∗ = 0:


T


θ

1 θ 0 2
(11) 2 for T ≥ · log .
log (1−ρ12 /2)

Since the mixture weights (π, 1 − π) are bounded away from 1/2, we have that ρ = |1 − 2π| is
bounded away from zero, and thus population EM iterates converge in O(log(1/
)) steps to
an
-ball around θ ∗ = 0. This result is equivalent to showing that in the unbalanced instance
(π = 1/2), the log-likelihood is strongly concave around the true parameter.

Statistical rate of sample EM. Once the bounds (10a) and (11) have been established, the
proof of the statistical rate (10b) for sample EM utilizes the scheme laid out by Balakrishnan
et al. [1]. In particular, we prove a nonasymptotic uniform law of large numbers (Lemma 1
stated in Section 4.1) that allows for the translation from population to sample EM iterates.
Roughly speaking, Lemma 1 guarantees that for any radius r > 0, tolerance δ ∈ (0, 1), and
sufficiently large n, we have

 


d + log(1/δ)
(12) P sup
Mn (θ ) − M(θ )
2 ≤ c2 σ (σ r + ρ) ≥ 1 − δ.
θ2 ≤r n
This bound, when combined with the contractive behavior (10a) or equivalently the expo-
nentially fast convergence (11) of the population EM iterates allows us to establish the stated
bound (10b). (See, e.g., Theorem 2 in the paper [1].)
Putting the pieces together, we conclude that the sample EM updates converge to an esti-
1
mate of θ ∗ —that has Euclidean error of the order (d/n) 2 —after a relatively small number
of steps that are of the order log(n/d). Note that this theoretical prediction is verified by the
simulation study in Figure 1(b) for the univariate setting (d = 1) of the unbalanced mixture-
fit. In Figure 3, we present the scaling of the radius of the final EM iterate3 with respect to
the sample size n and the dimension d, averaged over 400 runs of sample EM for various
settings of (n, d). Linear fits on the log-log scale in these simulations suggest a rate close to
1
(d/n) 2 as claimed in Theorem 1.

Remark. We make two comments in passing. First, the value of θ 0 2 in the convergence
rate of sample EM updates in Theorem 1 can be assumed to be of constant order; this as-
sumption stems 0
√ from the fact the population EM operator maps any θ to a vector with norm
smaller than 2/π (cf. Lemma 5 in Appendix C.1). Second, when the weight parameter π
is assumed to be unknown in the model fit (3), the EM algorithm exhibits fast convergence
when π is initialized sufficiently away from 12 ; see Section 5.1 for more details.

From unbalanced to balanced fit. The bound (11) shows that the extent of unbalanced-
ness in the mixture weights plays a crucial role in the geometric rate of convergence for the
population EM. When the mixtures become more balanced, that is, weight π approaches 1/2
or equivalently ρ approaches zero, the number of steps T required to achieve
-accuracy
scales as O(log(θ 0 2 /
)/ρ 2 ) and in the limit ρ → 0, this bound degenerates to ∞ for any
finite
. Indeed, the bound (10a) from Theorem 1 simply states that the population EM op-
erator is nonexpansive for balanced mixtures (ρ = 0), and does not provide any particular

3 Refer to the discussion before Section 2 for details on the stopping rule for EM.
3170 R. DWIVEDI ET AL.

F IG . 3. Scaling of the Euclidean error  θn,d − θ ∗ 2 for EM estimates  θn,d computed using the unbalanced
(π = 2 ) mixture-fit (3). Here, the true data distribution is N (0, Id ), that is, θ ∗ = 0, and 
1 θn,d denotes the EM
iterate upon convergence when we fit a two-mixture model with mixture weights (0.3, 0.7) using n samples in d
dimensions. (a) Scaling with respect to d for n ∈ {1600, 12,800}. (b) Scaling with respect to n for d ∈ {1, 128}. We
ran experiments for several other pairs of (n, d) and the conclusions were the same. The empirical results here
1
show that that our theoretical upper bound of the order (d/n) 2 on the EM solution is sharp in terms of n and d.

rate of convergence for this case. It turns out that the EM algorithm is worse in the balanced
case, both in terms of the optimization speed and in terms of the statistical rate. This slower
statistical rate is in accord with existing results for the MLE in overspecified mixture mod-
els [5]; the novel contribution here is the rigorous analysis of the analogous behavior for the
EM algorithm.

3.2. Behavior of EM for balanced mixtures. In this section, we first provide a sharp char-
acterization of the algorithmic rate of convergence of the population EM update for the bal-
anced fit (see Section 3.2.1). We then provide sharp bound for the statistical rate for the
sample EM updates (cf. Section 3.2.2).

3.2.1. Slow convergence of population EM. We now analyze the behavior of the popu-
lation EM operator for the balanced fit. We show that it is globally convergent, albeit with a
contraction parameter that depends on θ , and degrades toward 1 as θ 2 → 0. Our statement
involves the constant p := P(|X| ≤ 1) + 12 P(|X| > 1), where X ∼ N (0, 1) denotes a standard
normal variate. (Note that p < 1.)

T HEOREM 2. Suppose that we fit a balanced instance (π = 12 ) of the mixture model (3)
to N (0, σ 2 Id ) data. Then the population EM operator (8) θ → M(θ ) has the following prop-
erties:
(a) For all nonzero θ , we have
M(θ )2 p
(13a) ≤ γup (θ ) := 1 − p + < 1.
θ2 θ 2
1 + 2σ 22
5σ 2
(b) For all nonzero θ such that θ 22 ≤ 8 , we have
M(θ )2 1
(13b) ≥ γlow (θ ) := .
θ2 2θ 22
1 + σ2
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3171

See Appendix A.2 in the Supplementary Material [10] for the proof of Theorem 2.
The salient feature of Theorem 2 is that the contraction coefficient γup (θ ) is not globally
bounded away from 1 and in fact satisfies limθ →0 γup (θ ) = 1. In conjunction with the lower
bound (13b), we see that

M(θ )2 θ 2
(14)  1 − 22 for small θ2 .
θ2 σ
This precise contraction behavior of the population EM operator is in accord with that of the
simulation study in Figure 2(b).
The preceding results show that the population EM updates should exhibit two phases of
behavior. In the first phase, up to a relatively coarse accuracy of the order σ , the
√ iterates
exhibit geometric convergence. Concretely, we are guaranteed to have θ T0 2 ≤ 2σ after
log(θ 0 2 /(2σ 2 ))
running the algorithm for T0 := log(2/(2−p))
2
steps. In the second phase, as the error de-
√ √
creases from 2σ to a given
∈ (0, 2σ ), the convergence rate becomes subgeometric;
concretely, we have

T +t
cσ 2
(15)
θ 0

for t ≥
log(σ/
).
2

2
Note that the conclusion (15) shows that for small enough
, the population EM takes
(log(1/
)/
2 ) steps to find
-accurate estimate of θ ∗ = 0. This rate is extremely slow com-
pared to the geometric rate O(log(1/
)) derived for the unbalanced mixtures in Theorem 1.
Hence, the slow rate establishes a qualitative difference in the behavior of the EM algorithm
between the balanced and unbalanced setting.
Moreover, the subgeometric rate of EM in the balanced case is also in stark contrast
with the favorable behavior of EM for the exact-fitted settings analyzed in past work.
Balakrishnan et al. [1] showed that when the EM algorithm is used to fit a two-component
Gaussian mixture with sufficiently large value of θσ 2 (known as the high signal-to-noise

ratio, or high SNR for short), the population EM operator is contractive, and hence geo-
metrically convergent, within a neighborhood of the true parameter θ ∗ . In a later work on the
two-component balanced mixture fit model, Daskalakis et al. [7] showed that the convergence
is in fact geometric for any nonzero value of the SNR. The model considered in Theorem 2
can be seen as the limiting case of weak signal for a two mixture model—which degener-
ates to the Gaussian distribution when the SNR becomes exactly zero. For such a limit, we
observe that the fast convergence of population EM sequence no longer holds.

3.2.2. Upper and lower bounds on sample EM. We now turn to the statements of upper
and lower bounds on the rate of the sample EM iterates for the balanced fit on Gaussian data.
We begin with an upper bound, which involves the previously defined function γup (θ ) :=
θ 22
1 − p + p/(1 + 2σ 2
).

T HEOREM 3. Consider the sample EM updates θ t = Mn (θ t−1 ) for the balanced instance
(π = 12 ) of the mixture model (3) based on n i.i.d. N (0, σ 2 Id ) samples. Then there exist uni-
versal constants {ck
}4k=1 such that for any scalars α ∈ (0, 14 ) and δ ∈ (0, 1), any sample size
n ≥ c1
(d + log(log(1/α)/δ)) and any iterate number t ≥ c2
log θσ 2d n + c3
( dn ) 2 −2α log( dn ) ×
0 2 1

log( α1 ), we have
  2 1

t


θ

θ
·


0
t−1  j
σ (d + log log(4/
) ) 4 −α
(16) 2 2 γup θ + c4 σ δ
,
j =0
n
with probability at least 1 − δ.
3172 R. DWIVEDI ET AL.

See Section 4 for a discussion of the techniques employed to prove this theorem. The
detailed proof is provided in Appendix A.3 in the Supplementary Material [10], where we
also provide some more details on the definitions of these constants.
As we show in our proofs, once the iteration number t satisfies the lower bound stated
in the theorem, the second term on the right-hand side of the bound (16) dominates the first
term; therefore, from this point onwards, the the sample EM iterates have Euclidean norm
1
of the order (d/n) 4 −α . Note that α ∈ (0, 14 ) can be chosen arbitrarily close to zero, so at the
expense of increasing the lower bound on the number of iterations t by a logarithmic factor
1
log(1/α), we can obtain rates arbitrarily close to (d/n) 4 .
We note that earlier studies of parameter estimation for overspecified mixtures, in both the
1
frequentist [5] and Bayesian settings [15, 21], have derived a rate of n− 4 for the global maxi-
mum of the log likelihood. To the best of our knowledge, Theorem 3 is the first nonasymptotic
algorithmic result that shows that such rates apply to the fixed points and dynamics of the EM
algorithm, which need not converge to the global optima.
The preceding discussion was devoted to an upper bound on sample EM for the balanced
fit. Let us now match this upper bound, at least in the univariate case d = 1, by showing that
1
any nonzero fixed point of the sample EM updates has Euclidean norm of the order n− 4 . In
particular, we prove the following lower bound.

T HEOREM 4. There are universal positive constants c, c


such that for any nonzero so-
lution 
θn to the sample EM fixed-point equation θ = Mn (θ ) for the balanced mixture fit, we
have
 1
(17) θn | ≥ cn− 4 ≥ c
.
P |

See Appendix A.4 in the Supplementary Material [10] for the proof of this theorem.
Since the iterative EM scheme converges only to one of its fixed points, the theorem shows
1
that one cannot obtain a high-probability bound for any radius smaller than n− 4 . As a conse-
1
quence, with constant probability, the radius of convergence n− 4 for sample EM convergence
in Theorem 3 for the univariate setting is tight.

4. New techniques for sharp analysis of sample EM. In this section, we highlight the
new proof techniques introduced in this work that are required to obtain the sharp charac-
terization of the sample EM updates in the balanced case (Theorem 3). We begin in Sec-
tion 4.1 by elaborating that a direct application of the previous frameworks leads to subop-
timal statistical rates for sample EM in the balanced fit. This suboptimality motivates the
development of new methods for analyzing the behavior of the sample EM iterates, based
on an annulus-based localization argument over a sequence of epochs, which we sketch out
in Sections 4.2 and 4.3. We remark that our novel techniques, introduced here for analyzing
EM with the balanced fit, are likely to be of independent interest. We believe that they can
potentially be extended to derive sharp statistical rates in other settings when the algorithm
under consideration does not exhibit an geometrically fast convergence.

4.1. A suboptimal guarantee. Let us recall the set-up for the procedure suggested by
Balakrishnan et al. [1], specializing to the case where the true parameter θ ∗ = 0, as in our
specific set-up. Using the triangle inequality, the norm of the sample EM iterates θ t+1 =
Mn (θ t ) can be upper bounded by a sum of two terms as follows:

t+1

 

   

 

(18)
θ
=
Mn θ t

Mn θ t − M θ t
+
M θ t
,
2 2 2 2
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3173

for all t ≥ 0. The first term on the right-hand side corresponds to the deviations between the
sample and population EM operators, and can be controlled via empirical process theory.
The second term corresponds to the behavior of the (deterministic) population EM operator,
as applied to the sample EM iterate θ t , and needs to be controlled via a result on population
EM.
Theorem 2 from Balakrishnan et al. [1] is based on imposing generic conditions on each
of these two terms, and then using them to derive a generic bound on the sample EM iterates.
In the current context, their theorem can be summarized as follows. For given tolerances
δ ∈ (0, 1),
> 0 and starting radius r > 0, suppose that there exists a function ε(n, δ) > 0,
decreasing in terms of the sample size n, and a contraction coefficient κ ∈ (0, 1) such that
M(θ )2 


(19a) sup ≤κ and P sup
Mn (θ ) − M(θ )
2 ≤ ε(n, δ) ≥ 1 − δ.
θ2 ≥
θ2 θ 2 ≤r

Then for a sample size n sufficiently large and


sufficiently small to ensure that
(i) ε(n, δ) (ii)
(19b)
≤ ≤ r,
1−κ
the sample EM iterates are guaranteed to converge to a ball of radius ε(n, δ)/(1 − κ) around
the true parameter θ ∗ = 0.
In order to apply this theorem to the current setting, we need to specify a choice of ε(n, δ)
for which the bound on the empirical process holds. The following auxiliary result provides
such control for us.

L EMMA 1. There exists universal positive constants c1 and c2 such that for any positive
radius r, any threshold δ ∈ (0, 1), and any sample size n ≥ c2 d log(1/δ), we have

 


d + log(1/δ)
(20) P sup
Mn (θ ) − M(θ )
2 ≤ c1 σ (σ r + ρ) ≥ 1 − δ,
θ2 ≤r n
where ρ = |1 − 2π| denotes the imbalance in the mixture fit (3).

The proof of this lemma is based on Rademacher complexity arguments; see Appendix B.1
in the Supplementary Material [10] for the details.
With the choice r = θ 0 2 ,√Lemma 1 guarantees that the second inequality in line (19a)
holds with ε(n, δ)  σ 2 θ 0 2 d/n. On the other hand, Theorem 2 implies that for any θ
such that θ2 ≥
, we have that population EM is contractive with parameter bounded above
by κ(
)  1 −
2 . In order to satisfy inequality (i) in equation (19b), we solve the equation
ε(n, δ)/(1 − κ(
)) =
. Tracking only the dependency on d and n, we obtain4

d/n  1
(21) 2
=
=⇒
= O (d/n) 6 ,

which shows that the Euclidean norm of the sample EM iterate is bounded by a term of order
1
(d/n) 6 .
1
While this rate is much slower than the classical (d/n) 2 rate that we established in the
1
unbalanced case, it does not coincide with the n− 4 rate that we obtained in Figure 1(b) for
balanced setting with d = 1. Thus, the proof technique based on the framework of Balakrish-
nan et al. [1] appears to be nonoptimal. The suboptimality of this approach necessitates the

4 Moreover, with this choice of


, inequality (ii) in equation (19b) is satisfied with a constant r, as long as n is
sufficiently large relative to d.
3174 R. DWIVEDI ET AL.

F IG . 4. Scaling of the Euclidean error  θn,d −θ ∗ 2 for EM estimates  θn,d computed using the balanced (π = 12 )
mixture-fit (2). Here, the true data distribution is N (0, Id ), that is, θ ∗ = 0, and 
θn,d denotes the EM iterate upon
convergence when we fit a balanced mixture with n samples in d dimensions. (a) Scaling with respect to d for
n ∈ {1600, 12,800}. (b) Scaling with respect to n for d ∈ {1, 128}. We ran experiments for several other pairs of
1
(n, d) and the conclusions were the same. Clearly, the empirical results suggest a scaling of order (d/n) 4 for the
final iterate of sample-based EM.

development of a more refined technique. Before sketching this technique, we now quantify
empirically the convergence rate of sample EM in terms of both dimension d and sample
size n for the balanced mixture fit. In Figure 4, we summarize the results of these experi-
ments. The two panels in the figure exhibit that the error in the sample EM estimate scales
1
as (d/n) 4 , thereby providing further numerical evidence that the preceding approach indeed
led to a suboptimal result.

4.2. Annulus-based localization over epochs. Let us try to understand why the preced-
ing argument led to a suboptimal bound. In brief, its “one-shot” nature contains two major
deficiencies. First, the tolerance parameter
is used both (a) for measuring the contractivity
of the updates, as in the first inequality in equation (19a), and (b) for determining the final
accuracy that we achieve. At earlier phases of the iteration, the algorithm will converge more
quickly than suggested by the worst-case analysis based on the final accuracy. A second de-
ficiency is that the argument uses the radius r only once, setting it to a constant to reflect
the initialization θ 0 at the start of the algorithm. This means that we failed to “localize” our
bound on the empirical process in Lemma 1. At later iterations of the algorithm, the norm
θ t 2 will be smaller, meaning that the empirical process can be more tightly controlled.
We note that ideas of localizing the radius r for an empirical process plays a crucial role in
obtaining sharp bounds on the error of M-estimation procedures [2, 17, 26, 27].
A novel aspect of the localization argument in our setting is the use of an annulus instead
of a ball. In particular, we analyze the iterates from the EM algorithm assuming that they lie
within a prespecified annulus, defined by an inner and an outer radius. On one hand, the outer
radius of the annulus helps to provide a sharp control on the perturbation bounds between the
population and sample operators. On the other hand, the inner radius of the annulus is used
to tightly control the algorithmic rate of convergence.
We now summarize our key arguments. The entire sequence of sample EM iterations is
broken up into a sequence of different epochs. During each epoch, we localize the EM iterates
to an annulus. In more detail:
• We index epochs by the integer  = 0, 1, 2, . . . , and associate them with a sequence
{α }≥0 of scalars in the interval [0, 14 ]. The input to epoch  is the scalar α , and the output
from epoch  is the scalar α+1 .
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3175

• The th epoch is defined to be the set of all iterations t of the sample EM algorithm such
that the sample EM iterate θ t lies in the following annulus:
α+1 α
d

d
(22) ≤
θ t − θ ∗
2 ≤ .
n n
We establish that the sample-EM operator is nonexpansive so that each epoch is well-defined
(and that subsequent iterations can only correspond to subsequent epochs).
• Upon completion of epoch  at iteration T , the EM algorithm returns an estimate θ T
such that θ T 2  (d/n)α+1 , where
1 1
(23) α+1 = α + .
3 6
Note that the new scalar α+1 serves as the input to epoch  + 1.
The recursion (23) is crucial in our analysis: it tracks the evolution of the exponent acting
upon the ratio d/n, and the rate (d/n)α+1 is the bound on the Euclidean norm of the sample
EM iterates achieved at the end of epoch .
A few properties of the recursion (23) are worth noting. First, given our initialization
α0 = 0, we see that α1 = 16 , which agrees with the outcome of our one-step analysis from
above. Second, as the recursion is iterated, it converges from below to the fixed point α ∗ = 14 .
1
Thus, our argument will allow us to prove a bound arbitrarily close to (d/n) 4 , as stated for-
mally in Theorem 3 to follow. Refer to Figures 5 and 6 for an illustration of the definition of
these annuli, epochs and the associated conclusions.

4.3. How does the key recursion (23) arise? Let us now sketch out how the key recur-
sion (23) arises. Consider epoch  specified by input α < 14 , and consider an iterate θ t in
the following annulus: θ t 2 ∈ [(d/n)α+1 , (d/n)α ]. We begin by proving that this initial
condition ensures that θ t 2 is less than level (d/n)α for all future iterations; for details,
see Lemma 4 stated in the Supplementary Material. Given this guarantee, our second step is

F IG . 5. Illustration of the annulus-based-localization argument part (I): Defining the epochs or equivalently the
annuli. (a) Outer radius for the th epoch is given by n−α (tracking dependency only on n). (b) For any given
epoch , we analyze the behavior of the EM sequence θ t+1 = Mn (θ t ), when θ t lies in the annulus around θ ∗ with
inner and outer radii given by n−α+1 and n−α , respectively. √We prove that EM iterates move from one epoch
to the next epoch (e.g., epoch  to epoch  + 1) after at most n iterations. Given the definition of α , we see
1
that the inner and outer radii − 4 Consequently, after at
√ of the aforementioned annulus converges linearly to n . −1/4+α
most log(1/α) epochs (or n log(1/α) iterations), the EM iterate lies in a ball of radius n around θ ∗ . We
illustrate the one-step dynamics in any given annulus in Figure 6.
3176 R. DWIVEDI ET AL.

F IG . 6. Illustration of the annulus-based-localization argument part (II): Dynamics of EM in the th epoch
or equivalently the annulus n−α+1 ≤ θ t − θ ∗ 2 ≤ n−α . For a given epoch , we analyze the behavior of the
EM sequence θ t+1 = Mn (θ t ), when θ t lies in the annulus with inner and outer radii given by n−α+1 , and
n−α , respectively. In this epoch, the population EM operator M(θ t ) contracts with a contraction coefficient
that depends on n−α+1 , which is the inner radius of the disc, while the perturbation error Mn (θ t ) − M(θ)2
between the sample and population EM operators depends
√ on n−α , which is the outer radius of the disc. Overall,
we prove that Mn is nonexpansive and after at most n steps, the sample EM updates move from epoch  to epoch
 + 1.

to make use of the inner radius of the considered annulus to apply Theorem 2 for the popu-
lation EM operator, for all iterations t such that θ t 2 ≥ (d/n)α+1 . Consequently, for these
iterations, we have


 t 
p
t


M θ
≤ 1 − p +
θ

2 θ 2 2
1 + 2σ 22
(24a)
α
  d
 1 − (d/n)2α+1 (d/n)α ≤ γ ,
n
2α+1
where γ := e−(d/n) . On the other hand, using the outer radii of the annulus and applying
Lemma 1 for this epoch, we obtain that
α

   


Mn θ t − M θ t

d d d α +1/2
(24b) 2 = ,
n n n
for all t in the epoch. Unfolding the basic triangle inequality (18) for T steps, we find that

t+T

   
 

θ

Mn θ t − M θ t
1 + γ
2 2 T −1 + γ
+ ··· + γ T θt 2
1

 t   
2α+1
≤ Mn θ − M θ t
2 + e−T (d/n) (d/n)α .
1 − γ
The second term decays exponentially in T , and our analysis shows that it is dominated by
the first term in the relevant regime of analysis. Examining the first term, we find that θ t+T
has Euclidean norm of the order
−2α+1 α +1/2

t+T

   

(25)
θ
 1
Mn θ t − M θ t
≈ d d
.
2
1 − γ 2

n 
n 
=:r

The epoch is said to be complete once θ t+T 2  ( dn )α+1 .


Disregarding constants, this con-
dition is satisfied when r = ( n )
d α+1
, or equivalently when
−2α+1 α +1/2 α+1
d d d
= .
n n n
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3177

Viewing this equation as a function of the pair (α+1 , α ) and solving for α+1 in terms of
α yields the recursion (23). Refer to Figure 6 for a visual illustration of the localization
argument summarized above for a given epoch.
Of course, the preceding discussion is informal, and there remain many details to be ad-
dressed in order to obtain a formal proof. We refer the reader to Appendix A.3 for the com-
plete argument.

5. Generality of results and future work. Thus far, we have characterized the behav-
ior of the EM algorithm for different settings of over-specified location Gaussian mixtures.
We established rigorous statistical guarantees of EM under two particular but representative
settings of over-specified location Gaussian mixtures: the balanced and unbalanced mixture-
fit. The log-likelihood for the unbalanced fit remains strongly log-concave5 (due to the fixed
weights and location parameters being sign flips), and hence the Euclidean error of the final
1
iterate of EM decays at the usual rate (d/n) 2 with n samples in d dimensions. However, in
the balanced case, the log-likelihood is no longer strongly log-concave and the error decays
1
at the slower rate (d/n) 4 . We view our results as the first step in understanding and possibly
improving the EM algorithm in nonregular settings. We now provide a detailed discussion
that sheds light on the general applicability of our results. In particular, we discuss the be-
havior of EM under the following settings: (i) overspecified mixture models with unknown
weight parameters (Section 5.1), (ii) overspecified mixture of linear regression (Section 5.2)
and (iii) more general settings with overspecified mixture models (Section 5.3). We conclude
the paper with a discussion of several future directions that arise from the previous settings
in Section 5.4.

5.1. When the weights are unknown. Our theoretical analysis so far assumed that the
weights were fixed, an assumption common to a number of previous papers in the area [1, 7,
18]. In Appendix C.2, we consider the case of unknown weights for the model fit (3). In this
context, our main contribution is to show that if the weights are initialized far away from 12 —
meaning that the initial mixture is highly unbalanced—then the EM algorithm converges
quickly, and the results from Theorem 1 are valid. (See Lemma 6 in Appendix C.2 for the
details.) On the other hand, if the initial mixture is not heavily imbalanced, we observe the
slow convergence of EM consistent with Theorems 2 and 3.

5.2. Slow rates for mixture of regressions. Thus far, we have considered the behavior of
the EM algorithm in application to parameter estimation in mixture models. Our findings turn
out to hold somewhat more generally, with Theorems 2 and 3 having analogues when the EM
algorithm is used to fit a mixture of linear regressions in overspecified settings. Concretely,
suppose that (Y1 , X1 ), . . . , (Yn , Xn ) ∈ R × Rd are i.i.d. samples generated from the model
(26) Yi = Xi θ ∗ + σ ξi for i = 1, . . . , n,
where {ξi }ni=1 are i.i.d. standard Gaussian variates, and the covariate vectors Xi ∈ Rd are also
i.i.d. samples from the standard multivariate Gaussian N (0, Id ). Of interest is to estimate the
parameter θ ∗ using these samples and EM is a popular method for doing so. When θ ∗ has
sufficiently large Euclidean norm, a setting referred to as the strong signal case, Balakrish-
1
nan et al. [1] showed that the estimate returned by EM is at a distance (d/n) 2 from the true
parameter θ ∗ with high probability. On the other hand, our analysis shows that when θ ∗ 2

5 Moreover, in Appendix D we differentiate the unbalanced and balanced fit based on the log-likelihood and the
Fisher matrix and provide a heuristic justification for the different rates between the two cases.
3178 R. DWIVEDI ET AL.

decays to zero—leading to an overspecified setting—the convergence of EM becomes slow.


In particular, the EM algorithm takes significantly more steps and returns an estimate that is
1
statistically worse, lying at Euclidean distance of the order (d/n) 4 from the true parameter.
While the EM operators in this case are slightly different when compared to the overspecified
Gaussian mixture analyzed before, the proof techniques remain similar. More concretely, we
first show that the convergence of population EM is slow (similar to Theorem 2) and then use
the annulus-based localization argument (similar to the proof of Theorem 3 from Section 4)
to derive a sharp rate. For completeness, we present these results formally in Lemma 7 and
Corollary 2 in Appendix E.

5.3. Slow rates for general mixtures. We now present several experiments that provide
1
numerical backing to the claim that the slow rate of order n− 4 is not merely an artifact of
the special balanced fit ((3) with π = 12 ). We demonstrate that the slow convergence of EM
is very likely to arise while fitting general overspecified location Gaussian mixtures with
unknown weights (and known covariance). We consider three settings: (A) fitting several
general overspecified location Gaussian mixture fits to Gaussian data (Figure 7), (B) fitting a
special three-component mixture fit to a two mixture of Gaussians (Figure 8), and (C) fitting
mixtures with unknown weights and location parameters when the number of components
in the fitted model is overspecified by two (Figure 9). We now turn to the details of these
settings.

General overspecified mixture fits on Gaussian data. First, we remark that the fast con-
vergence in the unbalanced fit (Theorem 1) was a joint result of the facts that (a) the weights
were fixed and unequal, and (b) the parameters were constrained to be a sign flip. If ei-
ther of these conditions is violated, the EM algorithm exhibits slow convergence on both
algorithmic and statistical fronts. Theorems 2, 3 and 4 provide rigorous details for the case
of equal and fixed weights (balanced fit). When the weights are unknown, EM can exhibit
slow rate (see Section 5.1 and Appendix C.2 for further details). When the weights are
fixed and unequal, but the location parameters are estimated freely—that is, with the model
πφ(x; θ1 , 1) + (1 − π)φ(x; θ2 , 1), as illustrated in Figure 7(a)—then the EM estimates have
error 6 of order n− 14 . In such cases, the parameter estimates approximately satisfy the relation


k πk θk,n ≈ 0, since the mean of the data is close to zero; moreover, for a two-components
mixture model, the location estimates become weighted sign flips of each other. The features
are the intuitive reason underlying the similarity of behavior of EM between this fit and the
balanced fit. Finally, when we fit a two mixture model with unknown weight parameter and
1
free location parameters, the final error also has a scaling of order n− 4 . Refer to Figure 7 for
a numerical validation of these results.

Overspecified fits for mixtures of Gaussian data. Using similar reasoning as above, let
us sketch out how our theoretical results also yield usable predictions for more general over-
specified models. Roughly speaking, whenever there are extra number of components to be
estimated, parameters of some of them are likely to end up satisfying certain form of local
constraint. More concretely, suppose that we are given data generated from a k-component
mixture, and we use the EM algorithm to fit the location parameters of a mixture model with

6 For more general cases, we measure the error of parameter estimation using the Wasserstein metric of second-
 to account for label-switching between the components. When the true model is standard Gaussian this
order W 2,n
 1
metric is simply the weighted Euclidean error: ( πk 2 ) 2 , where π and 
θk,n k θk,n , respectively, denote the mixture
weight and the location parameter of the kth component of the mixture.
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3179

F IG . 7. Plots of the Wasserstein error W 


2,n associated with EM fixed points versus the sample size for fit-
ting various kinds of location mixture models to standard normal N (0, 1) data. We fit mixture models with ei-
ther two or three components, with all location parameters estimated in an unconstrained manner. The lines
are obtained by a linear regression of the log error on the sample size n. (a) Fitting a two-mixture model
πN (θ1 , 1) + (1 − π)N (θ2 , 1) with three different fixed values of weights π ∈ {0.1, 0.3, 0.5} and two (uncon-
strained) location parameters, along with least-squares fits to the log errors. (b) Data plotted as red trian-
gles is obtained by fitting a two-component model with unknown mixture weights and two location parameters
πN (θ1 , 1) + (1 − π)N (θ2 , 1), whereas green circles correspond to results fitting a three-component mixture model
3 1 − 14 statistical rate for the error in parameter
i=1 3 N (θi , 1). In all cases, the EM solutions exhibit the slow n
estimation. Also see Figure 9.

k + 1 components. Loosely speaking, the EM estimates corresponding to a set of k − 1 com-


ponents are likely to converge quickly, leaving the two remaining components to fit a single
component in the true model. If the other components are far away, the EM updates for the
parameters of these two components are unaffected by them and start to behave like the bal-
anced case. See Figure 8 for a numerical illustration of this intuition in an idealized setting
where we use k + 1 = 3 components to fit data generated from a k = 2 component model. In
1
this idealized setting, the error for one of the parameter scales at the fast rate of order n− 2 ,

F IG . 8. Behavior of EM for an overspecified Gaussian mixture. True model: 12 N (θ1∗ , 1) + 12 N (θ2∗ , 1) where
θ1∗ = 0 and θ2∗ = 10. We fit a model 14 N (−θ1 , 1) + 14 N (θ1 , 1) + 12 N (θ2 , 1), where we initialize θ10 close to θ1∗
and θ20 close to θ2∗ . (a) Population EM updates: We observe that while θ1t converges slowly to θ1∗ = 0, the iterates
θ2t converge exponentially fast to θ2∗ = 10. (b) We plot the statistical error for the two parameters. While the strong
1 1
signal component has a parametric n− 2 rate, for the no signal component EM has the slower n− 4 rate, which is
in good agreement with the theoretical results derived in the paper. (We remark that the error floor for θ2t in panel
(a) arises from the finite precision inherent to numerical integration.)
3180 R. DWIVEDI ET AL.

F IG . 9. Plots of Wasserstein error when both weights and location parameters are unknown and estimated using
EM and the fitted multivariate mixture model is overspecified. (a) True model: N ([0, 0] , I2 ), and fitted model
3 2  3  4
i=1 wi N (θi , I2 ) and (b) true model: 5 N ([0, 0] , I2 ) + 5 N ([4, 4] , I2 ) and fitted model: i=1 wi N (θi , I2 ).
1
In both cases, once again we see the scaling of order n− 4 for the final error (similar to results in Figure 7 and 8).

1
and that of the parameter that is locally overfitted exhibits a slow rate of order n− 4 . Finally,
1
we see that the statistical error of order n− 4 also arises when we overspecify the number of
components by more than one. In particular, we observe in Figure 7(b) (green dashed dotted
1
line with solid circles) and Figure 9 (both curves) that a similar scaling of order n− 4 arises
when we overspecify the number of components by 2 and estimate the weight and location
parameters.
Besides formally analyzing EM in these general cases, several other future directions arise
from our work which we now discuss.

5.4. Future directions. In our current work, we assumed that only the location parame-
ters were unknown and that the scale parameters of the underlying model are known. Never-
theless in practice, this assumption is rather restrictive and it is natural to ask what happens
if the scale parameters were also unknown. We note that the MLE is known to have even
slower statistical rates for the estimation error with such higher-order mixtures; therefore, it
would be interesting to determine if the EM algorithm also suffers from a similar slow down
when the scale parameters are unknown. We refer the readers to a recent preprint [11], where
we establish that the EM algorithm can suffer from a further slow-down on the statistical and
computational ends when overspecified mixtures are fitted with an unknown scale parameter.
Another important direction is to analyze the behavior of EM under different models for
generating the data. While our analysis is focused on Gaussian mixtures, the nonstandard
1
statistical rate n− 4 also arises in other types of overspecified mixture models, such as those
involving mixtures with other exponential family members, or Student-t distributions, suit-
able for heavy-tailed data. We believe that the analysis of our paper can be generalized to a
broader class of finite mixture models that includes the aforementioned models.
A final direction of interest is whether the behavior of EM—slow versus fast conver-
gence—can be used as a statistic in a classical testing problem: testing the simple null of
a standard multivariate Gaussian versus the compound alternative of a two-component Gaus-
sian mixture. This problem is known to be challenging due to the break-down of the (gen-
eralized) likelihood ratio test, due the singularity of the Fisher information matrix; see the
papers [4, 19] for some past work on the problem. The results of our paper suggest an al-
ternative approach, which is based on monitoring the convergence rate of EM. If the EM
algorithm converges slowly for a balanced fit, then we may accept the null, whereas the op-
posite behavior can be used as an evidence for rejecting the null. We leave for future work
the analysis of such a testing procedure based on the convergence rates of EM.
SINGULAR MODELS AND SLOW CONVERGENCE OF EM 3181

Acknowledgments R. Dwivedi, N. Ho and K. Khamaru contributed equally to this work.


M. J. Wainwright was partially supported by Office of Naval Research Grant DOD ONR-
N00014-18-1-2640 and National Science Foundation Grant NSF-DMS-1612948. B. Yu was
partially supported by National Science Foundation Grant NSF-DMS-1613002. M. I. Jordan
was partially supported by Army Research Office Grant W911NF-17-1-0304.

SUPPLEMENTARY MATERIAL
Supplement to “Singularity, misspecification and the convergence rate of EM” (DOI:
10.1214/19-AOS1924SUPP; .pdf). In this supplemental material, we present the proofs for
all the technical results in the paper.

REFERENCES
[1] BALAKRISHNAN , S., WAINWRIGHT, M. J. and Y U , B. (2017). Statistical guarantees for the EM algorithm:
From population to sample-based analysis. Ann. Statist. 45 77–120. MR3611487 https://doi.org/10.
1214/16-AOS1435
[2] BARTLETT, P. L., B OUSQUET, O. and M ENDELSON , S. (2005). Local Rademacher complexities. Ann.
Statist. 33 1497–1537. MR2166554 https://doi.org/10.1214/009053605000000282
[3] C AI , T. T., M A , J. and Z HANG , L. (2019). CHIME: Clustering of high-dimensional Gaussian mixtures
with EM algorithm and its optimality. Ann. Statist. 47 1234–1267. MR3911111 https://doi.org/10.
1214/18-AOS1711
[4] C HEN , J. and L I , P. (2009). Hypothesis test for normal mixture models: The EM approach. Ann. Statist. 37
2523–2542. MR2543701 https://doi.org/10.1214/08-AOS651
[5] C HEN , J. H. (1995). Optimal rate of convergence for finite mixture models. Ann. Statist. 23 221–233.
MR1331665 https://doi.org/10.1214/aos/1176324464
[6] C HEN , Y. C. (2018). Statistical inference with local optima. Preprint. Available at arXiv:1807.04431.
[7] DASKALAKIS , C., T ZAMOS , C. and Z AMPETAKIS , M. (2017). Ten steps of EM suffice for mixtures of two
Gaussians. In Proceedings of the 2017 Conference on Learning Theory.
[8] D EMPSTER , A. P., L AIRD , N. M. and RUBIN , D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. MR0501537
[9] DWIVEDI , R., H O , N., K HAMARU , K., WAINWRIGHT, M. J. and J ORDAN , M. I. (2018). Theoretical
guarantees for EM under misspecified Gaussian mixture models. In NeurIPS 31.
[10] DWIVEDI , R., H O , N., K HAMARU , K., WAINWRIGHT, M. J., J ORDAN , M. I. and Y U , B. (2020). Sup-
plement to “Singularity, misspecification and the convergence rate of EM.” https://doi.org/10.1214/
19-AOS1924SUPP
[11] DWIVEDI , R., H O , N., K HAMARU , K., WAINWRIGHT, M. J., J ORDAN , M. I. and Y U , B. (2019). Chal-
lenges with EM in application to weakly identifiable mixture models. Preprint. Available at arXiv:
1902.00194.
[12] G HOSAL , S. and VAN DER VAART, A. W. (2001). Entropies and rates of convergence for maximum likeli-
hood and Bayes estimation for mixtures of normal densities. Ann. Statist. 29 1233–1263. MR1873329
https://doi.org/10.1214/aos/1013203453
[13] H AO , B., S UN , W. W., L IU , Y. and C HENG , G. (2018). Simultaneous clustering and estimation of hetero-
geneous graphical models. J. Mach. Learn. Res. 18 Art. ID 217. MR3827105
[14] H EINRICH , P. and K AHN , J. (2018). Strong identifiability and optimal minimax rates for finite mixture
estimation. Ann. Statist. 46 2844–2870. MR3851757 https://doi.org/10.1214/17-AOS1641
[15] I SHWARAN , H., JAMES , L. F. and S UN , J. (2001). Bayesian model selection in finite mixtures by marginal
density decompositions. J. Amer. Statist. Assoc. 96 1316–1332. MR1946579 https://doi.org/10.1198/
016214501753382255
[16] K LUSOWSKI , J. M., YANG , D. and B RINDA , W. D. (2019). Estimating the coefficients of a mixture
of two linear regressions by expectation maximization. IEEE Trans. Inform. Theory 65 3515–3524.
MR3959002 https://doi.org/10.1109/TIT.2019.2891628
[17] KOLTCHINSKII , V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization.
Ann. Statist. 34 2593–2656. MR2329442 https://doi.org/10.1214/009053606000001019
[18] K UMAR , R. and S CHMIDT, M. (2017). Convergence rate of expectation-maximization. In 10th NIPS Work-
shop on Optimization for Machine Learning.
[19] L I , P., C HEN , J. and M ARRIOTT, P. (2009). Non-finite Fisher information and homogeneity: An EM ap-
proach. Biometrika 96 411–426. MR2507152 https://doi.org/10.1093/biomet/asp011
3182 R. DWIVEDI ET AL.

[20] M A , J., X U , L. and J ORDAN , M. I. (2000). Asymptotic convergence rate of the EM algorithm for Gaussian
mixtures. Neural Comput. 12 2881–2907.
[21] N GUYEN , X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann.
Statist. 41 370–400. MR3059422 https://doi.org/10.1214/12-AOS1065
[22] R EDNER , R. A. and WALKER , H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm.
SIAM Rev. 26 195–239. MR0738930 https://doi.org/10.1137/1026034
[23] R ICHARDSON , S. and G REEN , P. J. (1997). On Bayesian analysis of mixtures with an unknown number of
components. J. Roy. Statist. Soc. Ser. B 59 731–792. MR1483213 https://doi.org/10.1111/1467-9868.
00095
[24] ROUSSEAU , J. and M ENGERSEN , K. (2011). Asymptotic behaviour of the posterior distribution in overfitted
mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 689–710. MR2867454 https://doi.org/10.
1111/j.1467-9868.2011.00781.x
[25] S TEPHENS , M. (2000). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B. Stat.
Methodol. 62 795–809. MR1796293 https://doi.org/10.1111/1467-9868.00265
[26] VAN DE G EER , S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
[27] WAINWRIGHT, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Se-
ries in Statistical and Probabilistic Mathematics 48. Cambridge Univ. Press, Cambridge. MR3967104
https://doi.org/10.1017/9781108627771
[28] WANG , Z., G U , Q., N ING , Y. and L IU , H. (2015). High-dimensional expectation-maximization algorithm:
Statistical optimization and asymptotic normality. In Advances in Neural Information Processing Sys-
tems 28.
[29] W U , C.-F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95–103.
MR0684867 https://doi.org/10.1214/aos/1176346060
[30] X U , J., H SU , D. and M ALEKI , A. (2016). Global analysis of expectation maximization for mixtures of two
Gaussians. In Advances in Neural Information Processing Systems 29.
[31] X U , L. and J ORDAN , M. I. (1996). On convergence properties of the EM algorithm for Gaussian mixtures.
Neural Comput. 8 129–151. https://doi.org/10.1162/neco.1996.8.1.129
[32] YAN , B., Y IN , M. and S ARKAR , P. (2017). Convergence of gradient EM on multi-component mixture of
Gaussians. In Advances in Neural Information Processing Systems 30.
[33] Y I , X. and C ARAMANIS , C. (2015). Regularized EM algorithms: A unified framework and statistical guar-
antees. In Advances in Neural Information Processing Systems 28.

You might also like