Université Paris-Sud
École Doctorale de mathématiques de la Région Paris-Sud
Laboratoire de Mathématiques d’Orsay
THÈSE
présentée pour obtenir le grade de
docteur en sciences de l’Université Paris-Sud
Discipline : Mathématiques
par
Emilie Devijver
Modèles de mélange pour la régression en
grande dimension, application aux données
fonctionnelles.
Soutenue le 02 juillet 2015 devant la commission d’examen :
Francis
Christophe
Yannig
Jean-Michel
Pascal
Jean-Michel
Bach
Biernacki
Goude
Loubes
Massart
Poggi
INRIA Paris-Rocquencourt
Université de Lille 1
EDF R&D
Université de Toulouse
Université Paris-Sud
Université Paris-Sud
Examinateur
Rapporteur
Examinateur
Rapporteur
Directeur de thèse
Directeur de thèse
Thèse préparée sous la direction de Pascal Massart
et de Jean-Michel Poggi
au Département de Mathématiques d’Orsay
Laboratoire de Mathématiques (UMR 8625),
Bât. 425, Université Paris Sud
91405 Orsay Cedex
Modèles de mélange pour la régression en grande dimension,
application aux données fonctionnelles.
Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse
et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on
étudie des prédicteurs de grande dimension et une réponse de grande dimension.
Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On
s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1 .
On propose aussi deux procédures pour pallier ce problème de classification en grande dimension.
La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité
conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur
de type Lasso.
La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer
la dimension.
Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour
sélectionner un modèle proche de l’oracle.
On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse
peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque
procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations
et des données réelles. En particulier, on illustre la première méthode par des données de
consommation électrique.
Mots-clés : modèles de mélange en régression, classification non supervisée, grande dimension,
sélection de variables, sélection de modèles, inégalité oracle, données fonctionnelles, consommation électrique, ondelettes.
4
High-dimensional mixture regression models, application to functional data.
Finite mixture regression models are useful for modeling the relationship between a response and
predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional
predictors and a high-dimensional response.
First of all, we provide an ℓ1 -oracle inequality satisfied by the Lasso estimator. We focus on this
estimator for its ℓ1 -regularization properties rather than for the variable selection procedure.
We also propose two procedures to deal with this issue. The first procedure leads to estimate
the unknown conditional mixture density by a maximum likelihood estimator, restricted to the
relevant variables selected by an ℓ1 -penalized maximum likelihood estimator.
The second procedure considers jointly predictor selection and rank reduction for obtaining
lower-dimensional approximations of parameters matrices.
For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion,
depending on the complexity of the random model collection. We extend these procedures to
the functional case, where predictors and responses are functions. For this purpose, we use
a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our
methods both on simulations and real datasets. In particular, we illustrate the first procedure
on an electricity load consumption dataset.
Keywords: mixture regression models, clustering, high dimension, variable selection, model
selection, oracle inequality, functional data, electricity consumption, wavelets.
9
À mon grand-père.
10
Contents
Résumé
6
Remerciements
7
Introduction
15
Notations
38
1 Two procedures
1.1 Introduction . . . . . . . . .
1.2 Gaussian mixture regression
1.3 Two procedures . . . . . . .
1.4 Illustrative example . . . . .
1.5 Functional datasets . . . . .
1.6 Conclusion . . . . . . . . . .
1.7 Appendices . . . . . . . . .
45
46
48
52
54
60
67
67
. . . . .
models .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 An ℓ1 -oracle inequality for the Lasso in finite mixture of multivariate Gaussian
regression models
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Proof of the oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Proof of the theorem according to T or T c . . . . . . . . . . . . . . . . . . . . . .
2.6 Some details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 An
3.1
3.2
3.3
3.4
3.5
3.6
oracle inequality for the Lasso-MLE procedure
Introduction . . . . . . . . . . . . . . . . . . . . . . .
The Lasso-MLE procedure . . . . . . . . . . . . . . .
An oracle inequality for the Lasso-MLE model . . . .
Numerical experiments . . . . . . . . . . . . . . . . .
Tools for proof . . . . . . . . . . . . . . . . . . . . .
Appendix: technical results . . . . . . . . . . . . . .
4 An
4.1
4.2
4.3
4.4
4.5
oracle inequality for the Lasso-Rank
Introduction . . . . . . . . . . . . . . . .
The model and the model collection . .
Oracle inequality . . . . . . . . . . . . .
Numerical studies . . . . . . . . . . . . .
Appendices . . . . . . . . . . . . . . . .
11
procedure
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
76
77
79
81
85
90
.
.
.
.
.
.
99
100
101
104
107
110
118
.
.
.
.
.
127
128
130
133
136
137
CONTENTS
5 Clustering electricity consumers using high-dimensional regression
models.
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Typical workflow using the example of the aggregated consumption . . .
5.4 Clustering consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .
12
mixture
147
. . . . . 148
. . . . . 149
. . . . . 150
. . . . . 154
. . . . . 161
List of Figures
1
2
3
4
Exemple de données simulées
Illustration de l’estimateur du
Saut de dimension . . . . . .
Heuristique des pentes . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
21
28
28
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
Number of FR and TR . . . . . . . . . . . . . . . . . . . . . . . . . .
Zoom in on number of FR and Tr . . . . . . . . . . . . . . . . . . . .
Slope graph for our Lasso-Rank procedure . . . . . . . . . . . . . . .
Slope graph for our Lasso-MLE procedure . . . . . . . . . . . . . . .
Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . .
Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . .
Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . .
Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . .
Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Plot of the 70-sample of half-hour load consumption, on the two days
Plot of a week of load consumption . . . . . . . . . . . . . . . . . . .
Summarized results for the model 1 . . . . . . . . . . . . . . . . . . .
Summarized results for the model 1 . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
57
58
58
58
58
59
59
59
59
63
63
63
65
66
3.1
3.2
3.3
3.4
3.5
Boxplot of the Kullback-Leibler divergence
Boxplot of the Kullback-Leibler divergence
Boxplot of the ARI . . . . . . . . . . . . .
Summarized results for the model 1 . . . .
Summarized results for the model 1 . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
104
108
108
109
109
5.1
5.2
Load consumption of a sample of 5 consumers over a week in winter . . . . . . .
Projection of a load consumption for one day into Haar basis, level 4. By construction, we get s = A4 + D4 + D3 + D2 + D1 . On the left side, the signal is
considered with reconstruction of dataset, the dotted being preprocessing 1 and
the dotted-dashed being the preprocessing 2 . . . . . . . . . . . . . . . . . . . . .
We select the model m̂ using the slope heuristic . . . . . . . . . . . . . . . . . . .
Minimization of the penalized log-likelihood. Interesting models are branded by
red squares, the selected one by green diamond . . . . . . . . . . . . . . . . . . .
Representation of the regression matrix βk for the preprocessing 1. . . . . . . . .
Representation of the regression matrix βk for the preprocessing 2. . . . . . . . .
For the selected model, we represent β̂ in each cluster . . . . . . . . . . . . . . .
149
5.3
5.4
5.5
5.6
5.7
. . . .
Lasso
. . . .
. . . .
.
.
.
.
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
152
152
152
152
152
LIST OF FIGURES
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
5.19
5.20
5.21
14
For the model selected, we represent Σ in each cluster . . . . . . . . . . . . . . . 153
Assignment boxplots per cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Clustering representation. Each curve is the mean in each cluster . . . . . . . . . 153
Clustering representation. Each curve is the mean in each cluster . . . . . . . . . 154
Saturday and Sunday load consumption in each cluster. . . . . . . . . . . . . . . 155
Proportions in each cluster for models constructed by our procedure . . . . . . . 156
Regression matrix in each cluster for the model with 2 clusters . . . . . . . . . . 157
Daily mean consumptions of the cluster centres along the year for 2 (top) and 5
clusters (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Daily mean consumptions of the cluster centres in function of the daily mean
temperature for 2 (on the left) and 5 clusters (on the right) . . . . . . . . . . . . 158
Average (over time) week of consumption for each centre of the two classifications
(2 clusters on the top and 5 on the bottom) . . . . . . . . . . . . . . . . . . . . . 158
Out of bag error of the random forest classifiers in function of the number of trees 159
RMSE on Thursday prediction for each procedure over all consumers . . . . . . . 160
Daily mean consumptions of the cluster centres in function of the daily mean
temperature for 5 clusters, clustering done by observing Thursday and Wednesday
in summer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Daily mean consumptions of the cluster centres along the year for 3 clusters,
clusering done on weekend observation . . . . . . . . . . . . . . . . . . . . . . . . 161
Introduction
Faire des classes pour mieux modéliser un échantillon est une méthode classique en statistiques.
L’exemple pratique développé dans cette thèse est la classification de consommateurs électriques
en Irlande. Habituellement, on classe des individus dans un problème d’estimation, mais ici, pour
souligner l’aspect prédictif, on classe les observations qui ont le même type de comportement d’un
jour sur l’autre. C’est une classification plus fine, qui agit sur des régresseurs homoscédastiques.
Dans notre exemple, on classifie les consommateurs sur un modèle de régression d’un jour par
rapport à leur consommation de la veille. Dans cette thèse, on a développé ce contexte d’un
point de vue méthodologique, en proposant deux procédures de classification en régression par
modèles de mélange qui pallient le problème de grande dimension, mais aussi d’un point de vue
théorique en justifiant la partie sélection de modèles de nos procédures, et d’un point de vue
pratique en travaillant sur le jeu de données réelles de consommation électrique en Irlande. Dans
cette introduction, nous présentons les notions développées dans cette thèse.
Régression linéaire
Les méthodes de régression consistent à chercher le lien qui existe entre deux variables aléatoires
X et Y . La variable X représente les régresseurs, les variables explicatives, alors que Y décrit
la réponse. Le modèle linéaire Gaussien, le modèle le plus simple en régression, suppose que Y
dépend de X de façon linéaire, à un bruit Gaussien près. Plus formellement, si X et Y sont deux
vecteurs aléatoires, X ∈ Rp et Y ∈ R, étudier un modèle linéaire Gaussien sur (X, Y ) consiste
à trouver β ∈ Rp tel que
Y = βX + ǫ
(1)
où ǫ ∼ N (0, σ 2 ), σ 2 étant connue ou à estimer suivant les cas. L’étude de ce modèle est, à ce jour,
assez complète. Soit (x, y) = ((x1 , y1 ), . . . , (xn , yn )) un échantillon. Si l’échantillon est de taille
suffisante, on connaît un estimateur consistant (c’est-à-dire qui converge vers la vraie valeur),
l’estimateur des moindres carrés (qui coïncide dans ce cas avec l’estimateur du maximum de
vraisemblance). On notera (β̂, σ̂ 2 ) cet estimateur, où
β̂ = (xt x)−1 xt y ;
σ̂ 2 =
||y − xβ̂||2
.
n−p
On connaît la loi de cet estimateur : β̂ ∼ N (β, σ 2 (xt x)−1 ) et (n − p)σ̂ 2 /σ 2 ∼ χ2n−p , ce qui nous
permet de déduire des intervalles de confiance pour chaque paramètre.
On peut généraliser ce modèle à une variable Y ∈ Rq multivariée. Dans ce cas, on observe
un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , et on construit un estimateur
15
16
Introduction
(β̂, Σ̂) ∈ Rq×p × S++
q , où
β̂ = (xt x)−1 xt y ;
Σ̂ =
h(y − xβ̂)l1 , (y − xβ̂)l2 i
n−p
!
(2)
;
1≤l1 ≤q
1≤l2 ≤q
et S++
est l’ensemble des matrices symétriques définies positives de taille q.
q
Ce modèle de régression est utilisé, par exemple, pour prédire de nouvelles valeurs. Si on connaît
le modèle sous-jacent à nos données, c’est-à-dire que l’on a déterminé β̂ et Σ̂ à partir d’un
échantillon ((x1 , y1 ), . . . , (xn , yn )), et que l’on observe un nouvel xn+1 , on peut calculer ŷn+1 =
β̂xn+1 . Dans ce cas ŷn+1 est la valeur dite prédite.
Si on s’intéresse au couple des variables aléatoires (X, Y ), on peut estimer la densité du couple,
mais on peut aussi étudier la densité conditionnelle. C’est cette dernière quantité que l’on a
décrite par un modèle linéaire dans (1). Les covariables peuvent aussi avoir différents statuts : soit
elles sont fixées, déterministes, soit elles sont aléatoires, et on travaille alors conditionnellement
à la loi sous-jacente. Dans cette thèse, on s’intéresse à la loi conditionnelle, pour des régresseurs
fixes ou aléatoires.
Cependant, cette hypothèse de modèle linéaire est très contraignante : si on observe un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )), on suppose que chaque yi dépend de la même façon de
xi , à un bruit près, pour i ∈ {1, . . . , n}. Si le modèle est adapté aux données, le bruit sera petit,
ce qui signifie que les coefficients de la matrice de covariance Σ seront petits. Cependant, de
nombreux jeux de données ne peuvent pas être bien résumés par un modèle linéaire.
Modèles de mélange en régression
Pour affiner ce modèle, on peut choisir de construire plusieurs classes, et de faire dépendre nos
estimateurs β̂ et Σ̂ de la classe. Plus formellement, on étudie un modèle de mélange en régression
de K Gaussiennes : si on observe un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n ,
et si on suppose que l’observation i appartient à la classe k, alors il existe βk et Σk tels que
yi = β k x i + ǫ i
où ǫi ∼ N (0, Σk ). Dans cette thèse, on considère un mélange avec un nombre de classes K fini.
Les modèles de mélange sont traditionnellement étudiés dans des problèmes d’estimation (citons par exemple McLachlan et Basford, [MB88], McLachlan et Peel, [MP04], qui sont deux
livres fondateurs pour les modèles de mélange, Fraley et Raftery, [FR00], pour un état de l’art,
ou encore Celeux et Govaert, [CG93], pour l’étude d’une densité de mélange dans un but de
classification).
L’idée principale est d’estimer la densité inconnue s∗ par un mélange de densités classiques
(sθk )1≤k≤K : on peut alors écrire
∗
s =
K
X
πk s θ k
k=1
K
X
πk = 1
k=1
Dans cette thèse, on s’intéresse à un mélange de modèles linéaires Gaussiens à poids constants,
les densités sθk sont donc des densités Gaussiennes conditionnelles.
17
Introduction
Plusieurs idées émergent avec les modèles de mélange, en régression ou non. Si on connaît les
paramètres de notre modèle, on peut calculer pour chaque observation la probabilité a posteriori
d’appartenir à une classe. D’après le principe du maximum a posteriori (noté MAP), on peut
alors affecter une classe à chaque observation. Pour ce faire, on calcule la probabilité a posteriori
de chaque observation d’appartenir à une classe, et on affecte les observations aux classes les
plus probables. On peut ainsi caractériser chaque observation par les paramètres de la densité
conditionnelle associée à la classe.
Dans le cadre de la classification supervisée, on connaît l’affectation de chaque observation, et
on cherche à comprendre la formation des classes pour classer une nouvelle observation ; quand
on fait de la classification semi-supervisée, on connaît certaines affectations, et on cherche à
comprendre le modèle (souvent, c’est très coûteux de connaître les affectations des observations) ;
dans le cadre de la classification non supervisée, on ne connaît pas du tout les affectations.
Nous nous plaçons dans cette thèse dans le contexte de classification non supervisée, avec une approche en régression. Cela a déjà été envisagé, citons par exemple Städler et coauteurs ([SBG10])
ou Meynet ([Mey13]), qui travaillent avec ce modèle pour des réponses Y univariées.
On observe des couples (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , et on veut faire des classes,
pour regrouper les observations (xi , yi ) pour i ∈ {1, . . . , n} qui ont la même relation entre yi et
xi . Sur certains jeux de données, cette approche semble naturelle, et on connaît le nombre de
classes.
14
12
10
8
6
4
2
0
−2
−4
−6
−4
−3
−2
−1
0
1
2
3
4
5
Figure 1 – Exemple de données simulées où les classes en régression se comprennent par la
réalisation graphique. On a ici envie de construire 3 classes, représentées par les différents symboles : ∗ pour la classe 1, ♦ pour la classe 2, pour la classe 3. Les données sont en dimension
un, (X, Y ) ∈ R × R. Les observations x sont issues d’une loi Gaussienne centrée réduite, et
y est simulée suivant un mélange de Gaussiennes de moyennes βk x et de variance 0.25, où
β = [−1, 0.1, 3].
Néanmoins, sur un jeu de données quelconque, on va surtout chercher à mieux comprendre la
relation entre les variables aléatoires Y et X en regroupant les observations qui ont la même
dépendance entre Y et X. La structure de groupes n’est pas forcément clairement induite par
les données, et on ne sait pas forcément combien de classes on a intérêt à former : dans certains
cas, il va falloir estimer K.
Si on considère le modèle de mélange de Gaussiennes multivariées en régression à K classes, on
18
Introduction
peut décrire ce modèle à l’aide des outils statistiques classiques.
Si on suppose que les vecteurs aléatoires Y sont indépendants conditionnellement à X, on considère la densité conditionnelle sK
ξ , où
q
sK
ξ :R →R
y 7→ sK
ξ (y|x) =
1
πk
t −1
(y
−
β
x)
Σ
(y
−
β
x)
;
exp
−
k
k
k
q/2 det(Σ )1/2
2
(2π)
k
k=1
K
X
où les paramètres à estimer sont ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK , avec
ΞK = ΠK × Rq×p
(
ΠK =
K
× S++
q
(π1 , . . . , πK ) ∈ [0, 1]
K
K
;
K
X
πk = 1
k=1
)
.
A partir de cette densité conditionnelle, on peut définir l’estimateur du maximum de vraisemblance par
1
EM V
(3)
ξˆK
= argmin − l(ξ, x, y) ;
n
ξ∈ΞK
où la log-vraisemblance est définie par
l(ξ, x, y) =
n
X
i=1
log sK
ξ (yi |xi ) .
Comme dans la plupart des modèles de mélange classiques, dans le cadre de la régression, l’estimateur du maximum de vraisemblance n’est pas explicite. Cela complique l’analyse théorique :
on n’a pas accès à la loi des estimateurs, ni des affectations. D’un point de vue pratique, on a recours à l’algorithme EM pour approcher cet estimateur. Cet algorithme, introduit par Dempster
et coauteurs dans [DLR77], permet d’estimer les paramètres d’un modèle de mélange. Il consiste
à alterner deux étapes, jusqu’à convergence des paramètres ou d’une fonction des paramètres.
Décrivons ce résultat.
On note Z = (Z1 , . . . , Zn ) le vecteur aléatoire des affectations des observations, et Z l’ensemble
de toutes les partitions possibles : Zi,k = 1 si l’observation i est issue de la classe k, 0 sinon.
On notera aussi, pour i ∈ {1, . . . , n}, f (zi |xi , yi , ξ) la probabilité que Zi = zi sachant (xi , yi ) et
connaissant le paramètre ξ. On définit la log-vraisemblance complétée par
lc (ξ, (x, y, z)) =
n
X
i=1
log f (zi |xi , yi , ξ) + log sK
ξ (yi |xi ) .
Alors, on peut décomposer la log-vraisemblance comme suit :
˜ − H(ξ|ξ)
˜
l(ξ, x, y) = Q(ξ|ξ)
˜ est l’espérance sur les variables latentes Z de la vraisemblance complétée,
où Q(ξ|ξ)
˜ =
Q(ξ|ξ)
n
XX
z∈Z i=1
˜
lc (ξ, xi , yi , zi )f (zi |xi , yi , ξ)
19
Introduction
et
˜ =
H(ξ|ξ)
n
XX
z∈Z i=1
˜
log f (z|xi , yi , ξ)f (zi |xi , yi , ξ)
l’espérance étant prise sur les variables latentes Z. Dans l’algorithme EM, on itère le calcul
jusqu’à la convergence et la maximisation de Q(ξ|ξ (ite) ), où ξ (ite) est l’estimation des paramètres
à l’itération (ite) ∈ N∗ de l’algorithme EM. En effet, si on fait croître ξ 7→ Q(ξ|ξ (ite) ), on fait
aussi croître la vraisemblance. Dempster, dans [DLR77], a donc proposé l’algorithme suivant.
Algorithme 1 : Algorithme EM
Data : x,y,K
Result : ξˆEM V
K
1. Initialisation de ξ (0)
2. Itération jusqu’à atteindre un critère d’arrêt
— Étape E : calcul, pour tout ξ de
Q(ξ|ξ (ite) )
— Étape M : calcul de ξ (ite+1) tel que
ξ (ite+1) ∈ argmax Q(ξ|ξ (ite) )
Dans la première étape, on affecte les observations à des classes, et dans la seconde étape on
met à jour l’estimation des paramètres. Pour l’étude générale de cet algorithme, on peut citer le
livre de McLachlan et Krishnan dans [MK97]. Un des points problématiques est l’initialisation
de cet algorithme : même si on peut, dans de nombreux cas (le nôtre par exemple), montrer que
l’algorithme converge vers la valeur voulue, il faut l’initialiser correctement. On peut citer Biernacki, Celeux et Govaert, [BCG03], qui décrivent diverses stratégies d’initialisation, ou encore
Yang, Lai et Lin dans [YLL12] qui proposent un algorithme EM robuste pour l’initialisation.
Tous ces problèmes, classiques dans l’étude des modèles de mélange, se retrouvent dans notre
cadre. D’un point de vue théorique, un résultat important pour les modèles de mélange est
l’identifiabilité. Rappelons qu’un modèle paramétrique est dit identifiable si différentes valeurs
des paramètres génèrent différentes distributions de probabilité. Ici, les modèles de mélange ne
sont pas identifiables, car il est possible d’intervertir les étiquettes des classes sans changer la
densité (ce que l’on appelle le label switching). Pour un point de vue détaillé sur ces questions,
citons par exemple Titterington, dans [TSM85].
A partir d’un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , on peut construire un
modèle de mélange de régression, à K classes, en estimant les paramètres par exemple avec
l’estimateur (3). Avec ce modèle, on peut obtenir une classification des données en calculant les
probabilités a posteriori. Chaque observation (xi , yi ) est alors affectée à une classe k̂i , et on a accès
au lien β̂k̂i qui existe entre xi et yi , et au bruit Σ̂k̂i associé à cette classe grâce à l’estimateur
(3). Tant qu’on a de bonnes propriétés sur l’estimateur du maximum de vraisemblance (par
exemple quand la taille de l’échantillon n est grande), si on règle les soucis d’initialisation et de
convergence de l’algorithme EM, on peut bien estimer les paramètres de notre modèle.
Cet algorithme a été généralisé par Städler et coauteurs dans [SBG10] pour l’estimation des
paramètres d’un modèle de mélange de Gaussiennes en régression univariée. On l’a généralisé,
dans cette thèse, pour les données multivariées.
20
Introduction
Sélection de variables et estimateur du Lasso
Nous nous sommes intéressés à un problème de classification en régression non supervisée de
données en grande dimension, c’est-à-dire que les vecteurs aléatoires X ∈ Rp et Y ∈ Rq peuvent
être des vecteurs de grande taille, éventuellement plus grande que le nombre d’observations n.
Ce problème est très étudié actuellement. En effet, avec l’amélioration de l’informatique, on a
de plus en plus de données de grande dimension, et on a accès à de plus en plus de variables
explicatives pour décrire un même objet.
Dans le cadre du modèle linéaire, si p > n, on ne peut plus calculer β̂ avec la formule (2) : la
matrice xt x n’est plus inversible. En fait, on cherche à estimer, dans le cas du modèle linéaire,
pq + q 2 paramètres, ce qui peut être plus grand que le nombre d’observations n si p et q sont
grands.
Dans un premier temps, restreignons-nous au modèle linéaire. L’estimateur du Lasso, introduit
par Tibshirani dans [Tib96], et parallèlement par Chen et coauteurs en théorie du signal dans
[CDS98], est un estimateur classique pour pallier le problème de la grande dimension. On peut
aussi citer l’estimateur de Dantzig, introduit par Candès et Tao dans [CT07] ; l’estimateur Ridge,
qui utilise une pénalité ℓ2 plutôt que la pénalité ℓ1 exploitée par l’estimateur du Lasso, introduit
par Hoerl et Kennard dans [HK70] ; l’Elastic net, introduit par Zou et Hastie dans [ZH05], et qui
est défini avec une double pénalisation ℓ1 et ℓ2 , qui fait donc un compromis entre l’estimateur
Ridge et l’estimateur Lasso ; le Fused Lasso, introduit par Tibshirani dans [TSR+ 05] ; le Lasso
adaptatif, introduit par Zou dans [Zou06], l’estimateur du Group-Lasso, introduit par Yuan et
Lin dans [YLL06], pour avoir de la parcimonie par groupes, . . .
Dans cette thèse, nous nous concentrerons sur l’estimateur du Lasso (ou du Group-Lasso), même
si certains résultats théoriques restent valables pour n’importe quelle méthode de sélection de
variables.
L’idée introduite par [Tib96] est de supposer que la matrice β dans le modèle linéaire est creuse,
ce qui réduit le nombre de paramètres à estimer. En effet, s’il y a peu de coefficients non nuls,
en notant J ⊂ P ({1, . . . , q} × {1, . . . , p}) l’ensemble des indices des coefficients de la matrice
de régression non nuls et |J| son cardinal, la taille de l’échantillon pourra être plus grande que
le nombre de paramètres à estimer, qui est alors |J| + q 2 . Dans le cas du modèle linéaire (1),
l’estimateur du Lasso est défini par
β̂ Lasso (λ) = argmin ||Y − βX||22 + λ||β||1 ;
(4)
β∈Rq×p
avec λ > 0 un paramètre de régularisation à préciser. On peut aussi le définir, de manière
équivalente, par
β̃ Lasso (A) = argmin ||Y − βX||22 .
β∈Rq×p
||β||1 ≤A
21
Introduction
β̂
β̂
Figure 2 – Illustration de l’estimateur du Lasso (à gauche) et de l’estimateur Ridge (à droite).
β̂ représente l’estimateur des moindres carrés, les lignes correspondant à l’erreur des moindres
carrés, la boule ℓ1 correspond au problème de l’estimateur du Lasso et la boule ℓ2 correspond
au problème de l’estimateur Ridge. Cette figure est issue de [Tib96].
Cet estimateur a été beaucoup étudié ces dernières années. Citons Friedman, Hastie et Tibshirani, [HTF01] qui ont étudié le chemin de régularisation du Lasso ; Bickel, Ritov et Tsybakov,
[BRT09], qui ont étudié cet estimateur en comparaison avec le sélecteur Dantzig. On peut aussi
citer Osborne, [OPT99], qui étudie l’estimateur du Lasso et sa forme duale.
Notons que la norme ℓ1 apparaît ici comme une relaxation convexe de la pénalité ℓ0 (où ||β||0 =
Card(j|βj 6= 0)), qui vise à estimer les coefficients par 0.
Grâce à la géométrie de la boule ℓ1 , le Lasso a tendance à annuler des coefficients, comme on
peut le voir sur la figure 2. Plus λ est grand, plus on pénalise, et plus on a de coefficients de
β̂ Lasso (λ) qui sont nuls. Si λ = 0, on retrouve l’estimateur du maximum de vraisemblance.
Résumons les résultats principaux obtenus ces dernières années pour l’estimateur du Lasso. Cet
estimateur peut facilement être approximé par l’algorithme LARS, introduit dans [EHJT04].
La relaxation convexe de la pénalité ℓ0 par la pénalité ℓ1 permet d’approcher numériquement
plus facilement cet estimateur. On sait qu’il est linéaire par morceaux, et on peut expliciter ses
valeurs grâce aux conditions de Karush-Kuhn-Tucker. Citons par exemple [EHJT04] ou [ZHT07].
D’un point de vue théorique, sous des hypothèses plus ou moins fortes, il existe des inégalités
oracle pour l’erreur de prédiction
ou l’erreur ℓq des coefficients estimés, avec un paramètre de
p
régularisation de l’ordre de log(p)/n. On peut citer par exemple Bickel, Ritov et Tsybakov,
[BRT09], qui obtiennent une inégalité oracle pour le risque en prédiction dans un modèle général
non paramétrique en régression, et une inégalité oracle pour la perte ℓp en estimation (1 ≤ p ≤ 2)
pour le modèle linéaire. Les hypothèses nécessaires et les résultats obtenus dans ce sens sont
résumés par van de Geer et Bühlmann dans [vdGB09].
Il est aussi à noter que l’estimateur du Lasso a de bonnes propriétés de sélection de variables.
Citons par exemple Meinshausen et Bühlmann [MB06], Zhang et Huang, dans [ZH08], Zhao et
Yu, dans [ZY06], ou Meinshausen et Yu, [MY09], qui montrent que le Lasso est consistant en
sélection de variables sous diverses hypothèses plus ou moins contraignantes.
Il paraît donc cohérent d’utiliser l’estimateur du Lasso pour sélectionner les variables importantes.
Dans les modèles de mélange de modèles linéaires Gaussiens, on peut étendre la définition de
l’estimateur du Lasso par
1
Lasso
ˆ
(5)
ξ
(λ) = argmin − lλ (ξ, x, y) ;
n
ξ∈ΞK
22
Introduction
où
lλ (ξ, x, y) = l(ξ, x, y) − λ
K
X
k=1
πk ||Pk βk ||1 ;
ΞK = ΠK × (Rq×p )K × S++
q
K
;
où Pk est la racine de Cholesky de l’inverse de la matrice de covariance, i.e. Pkt Pk = Σ−1
k , pour
tout k ∈ {1, . . . , K}.
Cette définition a été introduite par Städler et coauteurs dans [SBG10]. La pénalité est différente
de celle de l’estimateur (4). Premièrement, on ne pénalise non pas par la moyenne conditionnelle
dans chaque classe, mais par une version reparamétrisée par la variance. En effet, dans les
modèles de mélange, il est important d’avoir un bon estimateur de la variance, et de l’estimer
en même temps que la moyenne, pour ne pas favoriser les classes à variance trop élevée. De
plus, pour avoir un estimateur invariant par changement d’échelle, il faut prendre en compte la
variance dans la pénalité ℓ1 . Ensuite, l’étude de la racine de Cholesky de l’inverse de la matrice
de covariance plutôt que de la matrice de covariance permet d’obtenir une optimisation convexe,
ce qui facilitera la partie algorithmique. Enfin, on pénalise la vraisemblance par la somme de
l’estimateur de la moyenne reparamétrisé par la variance, pondérée sur chaque classe, pour
prendre en compte la différence de taille entre les différentes classes.
Notons que la sélection de variables dans les modèles de mélange a déjà été envisagée dans des
problèmes d’estimation. Citons par exemple Raftery et Dean, [RD06], ou Maugis et Michel dans
[MM11b].
L’estimateur (5) peut s’approcher algorithmiquement à l’aide d’une généralisation de l’algorithme EM, introduite pas Städler et coauteurs dans le cas univarié, et étendue dans cette thèse.
Dans le cadre des modèles de mélange, en régression, on connaît principalement deux résultats
théoriques pour l’estimateur du Lasso, valables pour Y réel et à nombre de classes K fixé et
connu. Pour Y ∈ R, et X ∈ Rp , Städler et coauteurs, dans [SBG10], ont montré que, sous
la condition de valeurs propres restreintes (notée REC, citée ci-dessous), l’estimateur du Lasso
vérifie une inégalité oracle pour des covariables fixes. Rappelons le contexte de mélange de Gaussiennes en régression univariées. Si Y , conditionnellement à X, appartient à la classe k, on note
Y = βk X + ǫ, avec ǫ ∼ N (0, σk2 ). On note de plus φk = σk−1 βk , et J l’ensemble des indices des
coefficients non nuls de la matrice de régression.
Hypothèse REC. Il existe κ ≥ 1 tel que, pour tout φ ∈ (Rp )K vérifiant ||φJ c || ≤ 6||φJ ||1 , on a
||φJ ||22
où Σ̂x =
1
n
Pn
≤κ
2
K
X
φtk Σ̂x φk
k=1
t
i=1 xi xi .
Dans le même cadre, mais sans l’hypothèse REC, Meynet, dans [Mey13], a montré une inégalité
oracle ℓ1 pour l’estimateur du Lasso.
Dans cette thèse, nous nous sommes intéressés aux propriétés de régularisation ℓ1 de l’estimateur du Lasso, dans notre cadre de modèles de mélange en régression multivariée. On fixe les
variables explicatives x = (x1 , . . . , xn ). Sans restriction, on peut supposer que xi ∈ [0, 1]p pour
tout i ∈ {1, . . . , n}. On suppose qu’il existe (Aβ , aΣ , AΣ , aπ ) des réels positifs, qui définissent
23
Introduction
l’ensemble des paramètres
(
Ξ̃K =
ξ ∈ ΞK pour tout k ∈ {1, . . . , K}, max
sup |[βk x]z | ≤ Aβ ,
z∈{1,...,q} x∈[0,1]p
aΣ ≤
m(Σ−1
k )
≤
M (Σ−1
k )
≤ AΣ , aπ ≤ πk ; (6)
où m(A) et M (A) désignent respectivement la valeur absolue de la plus petite et de la plus
grande valeur propre de la matrice A. Soit, pour ξ = (π, β, Σ) ∈ ΞK ,
[2]
N1 (sK
ξ ) = ||β||1 =
p X
q
K X
X
k=1 j=1 z=1
|[βk ]z,j |.
(7)
la pénalité envisagée, et KLn la divergence de Kullback-Leibler à design fixé :
n
1X
KLn (s, t) =
KL(s(.|xi ), t(.|xi ))
n
i=1
n
1X
s(.|xi )
=
.
Es log
n
t(.|xi )
i=1
Voici le théorème que nous obtenons.
Inégalité oracle ℓ1 pour le Lasso. Soit (x, y) = (xi , yi )1≤i≤n les observations, issues d’une
densité conditionnelle inconnue s∗ = sξ0 , où ξ0 ∈ Ξ̃K , cet ensemble étant défini par l’équation
[2]
(6), et le nombre de classes K étant fixé. On notera a ∨ b = max(a, b). Soit N1 (sK
ξ ) définie par
Lasso
(7). Pour λ ≥ 0, on définit l’estimateur du Lasso, noté ŝ
(λ), par
!
n
X
1
[2] K
log(sK
(8)
ŝLasso (λ) = argmin −
ξ (yi |xi )) + λN1 (sξ ) ;
n
K
s ∈S
i=1
ξ
avec
o
n
.
S = sK
,
ξ
∈
Ξ̃
K
ξ
Si
1
λ ≥ κ AΣ ∨
aπ
r
p
K
log(n)
2
1 + 4(q + 1)AΣ Aβ +
1 + q log(n) K log(2p + 1)
aΣ
n
avec κ une constante positive, alors l’estimateur (8) vérifie l’inégalité suivante :
E[KLn (s∗ , ŝLasso (λ))] ≤(1 + κ−1 ) inf
r
sK
ξ ∈S
[2] K
KLn (s∗ , sK
)
+
λN
(s
)
+λ
ξ
ξ
1
1
K e− 2 π q/2 aπ p
2q
+κ
q/2
n
AΣ
3/2
log(n)
1
′K
2
+κ √
AΣ ∨
1 + 4(q + 1)AΣ Aβ +
aπ
aΣ
n
2
q
;
× 1 + Aβ +
aΣ
′
où κ′ est une constante positive.
Introduction
24
Ce théorème peut être vu comme une inégalité oracle ℓ1 , mais ce n’est pas l’approche que l’on
souhaite développer ici. En effet, la démonstration de ce théorème passe par un théorème de
sélection de modèles, mais ce thème sera abordé dans la partie correspondante. Ici, on voit
plutôt ce théorème comme une assurance que l’estimateur du Lasso fonctionne bien pour la
régularisation ℓ1 . La particularité de ce résultat est qu’il ne requiert que peu d’hypothèses : on
travaille avec des prédicteurs fixés, qui sont supposés (sans restriction) être inclus dans [0, 1]p , et
les paramètres de nos densités conditionnelles sont supposés bornés, au sens où ils appartiennent
à Ξ̃K . Cette hypothèse est nécessaire, pour assurer en particulier que la vraisemblance est finie.
On la retrouvera dans les autres théorèmes démontrés dans cette thèse. On remarque aussi que
la borne sur le paramètre de régularisation λ n’est pas celle, optimale, classique, obtenue dans
d’autres cas plus généraux, mais cela est dû au fait que nous ne faisons pas d’hypothèses sur les
régresseurs. L’article de van de Geer, [vdG13], permet d’obtenir cette borne optimale sous des
hypothèses plus fortes sur le design. Il est à noter que le design est supposé fixe ici.
Comme l’estimateur du Lasso surestime les paramètres (citons par exemple Fan et Peng, [FP04],
ou Zhang [Zha10]), nous proposons de l’utiliser pour la sélection de variables, et non pour
l’estimation des paramètres. Ainsi, pour un paramètre de régularisation λ ≥ 0 à définir, on
sélectionnera les variables importantes pour expliquer Y en fonction de X. Plus formellement,
définissons la notion de variable active pour la classification.
Définition. Une variable est active pour la classification si elle est non nulle dans au moins une
classe : la variable d’indice (z, j) ∈ {1, . . . , q} × {1, . . . , p} est active s’il existe k ∈ {1, . . . , K}
tel que [βk ]z,j 6= 0.
Ainsi, pour un certain λ ≥ 0, on peut estimer ξˆLasso (λ) et en déduire l’ensemble Jλ des variables
actives pour la classification.
Ré-estimations
En se restreignant aux variables sélectionnées indicées par Jλ par l’estimateur du Lasso de
paramètre de régularisation λ ≥ 0, on travaille avec un modèle de dimension plus petite. En
effet, plutôt que (pq + q 2 + 1)K − 1 paramètres à estimer, on en a (|Jλ | + q 2 + 1)K − 1. Si on
suppose de plus que la matrice de covariance est diagonale (ce qui implique que les variables
sont non corrélées), on obtient un modèle de dimension (|Jλ | + q + 1)K − 1, et la dimension
du modèle peut devenir plus petite que le nombre d’observations, ou au moins être raisonnable.
On peut alors utiliser un autre estimateur, restreint aux variables sélectionnées, qui aura de
bonnes propriétés d’estimation en dimension raisonnable (meilleures que celles de l’estimateur
du Lasso).
Ré-estimer les paramètres sur les variables sélectionnées n’est pas une idée nouvelle. On veut
tirer parti des avantages de la sélection de variables par l’estimateur du Lasso (ou par une
autre technique), mais on veut aussi diminuer le biais induit par cet estimateur. Citons par
exemple Belloni et Chernozhukov, [BC11], qui obtiennent une inégalité oracle pour montrer que
l’estimateur du maximum de vraisemblance calculé sur les variables sélectionnées par le Lasso
fonctionne mieux que l’estimateur du Lasso, pour un modèle linéaire en grande dimension. On
peut aussi citer Zhang et Sun, [SZ12], qui estiment le bruit et la matrice de régression dans
un modèle linéaire en grande dimension par l’estimateur des moindres carrés après sélection de
modèles.
On propose dans une première procédure, appelée procédure Lasso-EMV, d’estimer les paramètres, en se restreignant aux variables actives, par l’estimateur du maximum de vraisemblance,
qui a de bonnes propriétés pour un échantillon suffisamment grand.
25
Introduction
On propose aussi, dans une seconde procédure que l’on appellera procédure Lasso-Rang, d’utiliser le maximum de vraisemblance avec une contrainte de faible rang. En effet, jusqu’ici, nous
n’avons pas tenu compte de la structure matricielle de β. Comme les matrices de covariance
(Σk )1≤k≤K sont supposées diagonales, on aurait pu travailler avec chaque coordonnée de Y
comme q problèmes distincts et indépendants. En cherchant une structure de faible rang, on
suppose que peu de combinaisons linéaires de prédicteurs suffisent à expliquer la réponse. C’est
aussi une seconde méthode pour diminuer la dimension, au cas où la sélection de variables par
pénalisation ℓ1 ne soit pas suffisante. Certains jeux de données sont particulièrement propices à
cette réduction dimensionnelle par faible rang. On peut citer par exemple l’analyse d’image fMRI
(Harrison, Penny, Frishen, [FHP03]), l’analyse de décodage de données EEG (Anderson, Stolz,
Shamsunder, [ASS98]), la modélisation de réponse de neurones (Brown, Kass, Mitra, [BKM04]),
ou encore l’analyse de données génomiques (Bunea, She, Wegkamp, [BSW11]). D’un point de
vue plus théorique, on peut citer Izenman ([Ize75]) qui a introduit cette méthode dans le cas du
modèle linéaire, Giraud ([Gir11]) ou Bunea et coauteurs ([Bun08]) qui ont complété l’étude théorique et pratique de sélection de rang. Dans cette thèse, nous employons ces méthodes dans un
cadre de mélange en régression. Il est à noter qu’on aura sélectionné, par l’estimateur du Lasso,
des colonnes, ou des lignes, ou les deux, mais que les lignes ou les colonnes où seuls certains
coefficients sont inactifs ne seront pas sélectionnées, la structure matricielle étant requise pour
estimer un paramètre par faible rang. On n’impose pas que toutes les moyennes conditionnelles
aient le même rang.
Notons ξˆJEM V et ξˆJRang les estimateurs associés à chaque procédure, avec J comme ensemble des
variables actives :
1
ξˆJEM V = argmin − l(ξ [J] , x, y) ;
(9)
n
ξ∈Ξ(K,J)
1 [J]
Rang
ˆ
(10)
ξJ
= argmin − l(ξ , x, y) ;
n
ξ∈Ξ̌(K,J,R)
où Ξ̌(K,J,R) = ξ = (π, β, Σ) ∈ Ξ(K,J) | rang(βk ) = Rk pour tout k ∈ {1, . . . , K} , où ξ [J] signifie que l’on a sélectionné les variables d’indice J, et où Ξ(K,J) est défini par
K
Ξ(K,J) = ΠK × (Rq×p )K × (S++
q )
(
)
K
X
K
ΠK = (π1 , . . . , πK ) ∈ (0, 1)
πk = 1
k=1
A noter que l’on ré-estime tous les paramètres de notre modèle : les moyennes conditionnelles,
les variances et les poids.
D’un point de vue pratique, la généralisation de l’algorithme EM (Algorithme 1 page 19) permet
de calculer l’estimateur du maximum de vraisemblance, sous contrainte de faible rang ou non,
dans le cas de mélange de Gaussiennes en régression.
Sélection de modèles
Pour un paramètre de régularisation λ ≥ 0 fixé, après avoir ré-estimé nos paramètres, nous
obtenons un modèle associé à nos observations (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n qui
est de dimension raisonnable et qui est bien estimé, si λ a été bien choisi. Cependant, on a dû
faire de nombreux choix pour construire ce modèle. Pour le modèle de mélange, il est possible
qu’on ne connaisse pas au préalable le nombre de classes K, il faut donc le choisir ; pour la
26
Introduction
sélection de variables, construire Jλ correspond à sélectionner un paramètre de régularisation
λ ≥ 0 ; dans les cas de ré-estimation par faible rang, on doit sélectionner le vecteur des rangs
dans chaque composante.
Dans chacun de ces trois cas, différentes méthodes existent dans la littérature. Pour le paramètre
λ de régularisation du Lasso, on peut citer par
p exemple le livre de van de Geer et Bühlmann,
[BvdG11], où une valeur λ proportionnelle à log(p)/n est considérée comme optimale.
Pour le choix du nombre de classes K et du vecteur de rangs R, de nombreux auteurs se ramènent
à un problème de sélection de modèles.
Dans cette thèse, nous allons voir le problème de sélection de paramètres comme un problème de
sélection de modèles, en construisant une collection de modèles, avec plus ou moins de classes,
et plus ou moins de coefficients actifs. Il restera à choisir un modèle parmi cette collection.
En toute généralité, notons S = (Sm )m∈M la collection de modèles que l’on considère, indicée
par M. Il est à noter que, contrairement à l’idée que l’on pourrait s’en faire, avoir une collection
de modèles trop grande peut porter préjudice, par exemple en sélectionnant des estimateurs non
consistants (voir Bahadur, [Bah58]) ou sous-optimaux (voir Birgé et Massart, [BM93]). C’est ce
qu’on appelle le paradigme du choix de modèle.
On considère une fonction de contraste, notée γ, telle que s∗ = argmin E(γ(t)). La fonction de
t∈S
perte associée, notée l, est définie par
pour tout t ∈ S,
l(s∗ , t) = E(γ(t)) − E(γ(s∗ )).
On définit aussi le contraste empirique γn par
n
pour tout t ∈ S,
γn (t) =
1X
γ(t, xi , yi )
n
i=1
pour un échantillon (x, y) = (xi , yi )1≤i≤n . Pour le modèle m, on considère ŝm la densité qui
minimise le contraste empirique γn . C’est cette densité que l’on va utiliser pour représenter ce
modèle. Par exemple, on peut prendre la log-vraisemblance comme contraste, et la divergence
de Kullback-Leibler comme fonction de perte.
Le but de la sélection de modèles est de sélectionner le meilleur estimateur parmi la collection
(ŝm )m∈M . Le meilleur estimateur peut être défini comme l’estimateur qui minimise le risque
avec la vraie densité notée s∗ . Cet estimateur, et le modèle correspondant, seront appelés dans
cette thèse oracle (voir Donoho et Johnstone par exemple, [DJ94]). On note
mO = argmin[l(s∗ , ŝm )].
(11)
m∈M
Malheureusement, on ne peut pas évaluer cette quantité, puisque l’on n’a pas accès à s∗ . On
va utiliser l’oracle en théorie pour évaluer notre sélection de modèles : on veut que le risque
associé à l’estimateur du modèle sélectionné soit le plus près possible de celui de l’oracle. Une
partie des résultats théoriques en sélection de modèles sont des inégalités oracle, qui permettent
d’assurer la cohérence de la sélection de modèles. Ces inégalités sont de la forme, pour m̂ l’indice
du modèle sélectionné,
E(l(s∗ , ŝm̂ )) ≤ C1 E(l(s∗ , sO )) +
C2
n
(12)
où (C1 , C2 ) sont des constantes absolues, avec C1 la plus proche possible de 1. L’inégalité oracle
est dite exacte si C1 = 1.
27
Introduction
Désormais, décrivons comment sélectionner un modèle. On va minimiser un critère pénalisé,
pour parvenir à un compromis biais/variance. En effet, on peut décomposer le risque de la façon
suivante :
l(s∗ , ŝm ) = l(s∗ , sm ) + E (γ(sm ) − γ(ŝm ));
| {z } |
{z
}
biaism
variancem
où sm ∈ argmin [E(γ(t))] (c’est une des meilleures approximations de s∗ dans Sm ). Or, pour
t∈Sm
minimiser le biais, il faut un modèle complexe, qui colle de très près aux données ; et pour minimiser la variance, il ne faut pas considérer des modèles trop complexes, pour ne pas surapprendre
des données.
Une méthode pour rendre compte de cette remarque est de considérer le contraste empirique Ln
pénalisé par la dimension : on va pénaliser les modèles trop complexes, qui suraprennent de nos
données. Soit pen : M → R+ une pénalité à construire ; on va sélectionner
m̂ = argmin {−γn (ŝm ) + pen(m)} .
m∈M
Akaike et Schwarz ont introduit cette méthode pour l’estimation avec la vraisemblance, voir
respectivement [Aka74] et [Sch78]. Ils proposaient les critères désormais classiques AIC et BIC,
où la pénalité vaut respectivement
penAIC (m) = Dm ;
log(n)Dm
penBIC (m) =
;
2
où Dm est la dimension du modèle m, et n est la taille de l’échantillon considéré. Ces critères
sont fortement utilisés aujourd’hui. Il est à noter qu’ils sont basés sur des approximations asymptotiques (AIC sur le théorème de Wilks, et BIC sur une approche bayésienne), et on peut se
méfier de leur comportement en non asymptotique. Ils supposent de plus, en toute rigueur, une
collection de modèles fixée.
Mallows, au même moment, dans [Mal73], a étudié cette méthode dans le cadre de la régression
linéaire. Il a obtenu
penM allows (m) =
2Dm σ 2
n
où σ 2 est la variance des erreurs, supposée connue.
Birgé et Massart, dans [BM01], ont introduit l’heuristique des pentes, qui est une méthodologie
non asymptotique pour sélectionner un modèle parmi une collection de modèles.
Décrivons les idées de cette heuristique. On cherche une pénalité proche de m ∈ M 7→ l(s∗ , ŝm )−
γn (ŝm ). Comme on ne connaît pas s∗ , on va essayer d’approcher cette quantité :
l(s∗ , ŝm ) − γn (ŝm ) = E(γ(ŝm )) − E(γ(sm )) + E(γ(sm )) − E(γ(s∗ ))
|
{z
} |
{z
}
vm
(1)
− (γn (ŝm ) − γn (sm )) − (γn (sm ) − γn (s∗ )) −γn (s∗ )
{z
} |
{z
}
|
v̂m
(2)
où vm peut être vue comme un terme de variance, et v̂m comme une version empirique. On
définit ∆n (sm ) = (1) + (2), qui correspond à la différence entre le terme de biais et sa version
28
Introduction
empirique. Si on choisit pen(m) = v̂m , on va choisir un modèle qui limite le biais mais pas la
variance : on va sélectionner un modèle trop complexe. Cette pénalité est minimale : si on pose
pen(m) = κv̂m , si κ < 1 on va choisir un modèle trop complexe, et si κ > 1, la dimension du
modèle sera plus raisonnable.
En fait, la pénalité optimale est le double de la pénalité minimale. Comme v̂m est la version empirique de vm , vm ≈ v̂m . Comme ∆n (sm ) est d’espérance nulle, on peut contrôler ses fluctuations.
On a donc envie de choisir pen(m) = 2vm .
Ainsi, on peut trouver, sur un jeu de données, si on utilise l’heuristique des pentes, la pénalité
qui nous permettra de sélectionner un modèle : soit on cherche le plus grand saut de dimension,
soit on regarde la pente asymptotique de γn (sm ), ce qui nous donne la pénalité minimale, et il
suffit de la multiplier par deux pour obtenir la pénalité optimale.
Les figures 3 et 4 illustrent ces idées.
Dimension du modele
100
m̂
50
0
0
5
κ̂
κ
2κ̂
−10
x 10
Figure 3 – Illustration de l’heuristique des pentes : on estime κ par κ̂ le plus grand saut de
dimension. On sélectionne alors le modèle qui minimise la log-vraisemblance pénalisée par 2κ̂.
−10
0
x 10
Log−vraisemblance
−0.5
−1
−1.5
−2
−2.5
−3
0
100
200
300
400
500
Dimension du modele
600
700
800
900
Figure 4 – Illustration de l’heuristique des pentes : on estime κ par κ̂ la pente asymptotique
de la log-vraisemblance.
D’un point de vue pratique, on utilise le package Capushe, développé par Baudry et coauteurs
dans [BMM12] sur le logiciel Matlab.
D’un point de vue théorique, on a obtenu des inégalités oracle, qui justifient la sélection de
modèles dans chacune de nos procédures.
Citons le théorème général issu de [Mas07] qui est à la base de nos résultats théoriques.
On travaille avec la log-vraisemblance comme contraste empirique. On note KL la divergence de
29
Introduction
Kullback-Leibler, définie par
Es log s
si s << t
t
KL(s, t) =
+ ∞ sinon.
D’abord, nous avons besoin d’une hypothèse structurelle. C’est une condition sur le crochet
d’entropie du modèle Sm par rapport à la distance de Hellinger, définie par
Z
1
2
(dH (s, t)) =
(s − t)2 .
2
Un crochet [l, u] est une paire de fonctions telles que pour tout y, l(y) ≤ s(y) ≤ u(y). Pour
ǫ > 0, on définit l’entropie à crochet H[.] (ǫ, S, dH ) d’un ensemble S par le logarithme du nombre
minimal de crochets [l, u] de largeur dH (l, u) inférieure à ǫ telle que toutes les densités de S
appartiennent à l’un de ces crochets.
Soit m ∈ M.
Hypothèse Hm . Il existe une fonction croissante φm telle que ̟ 7→
sur (0, +∞) et telle que pour tout ̟ ∈ R+ et tout sm ∈ Sm ,
Z ̟q
H[.] (ǫ, Sm (sm , ̟), dH )dǫ ≤ φm (̟);
1
̟ φm (̟)
est décroissante
0
où Sm (sm , ̟) = {t ∈ Sm , dH (t, sm ) ≤ ̟}. La complexité du modèle Dm est alors définie par
2 avec ̟ l’unique solution de
n̟m
m
√
1
φm (̟) = n̟.
̟
(13)
Notons que la complexité du modèle ne dépend pas du crochet d’entropie des modèles globaux
Sm , mais des ensembles plus petits, localisés. C’est une hypothèse plus faible.
Pour des raisons techniques, une hypothèse de séparabilité est aussi nécessaire.
′
′
Hypothèse Sepm . Il existe un ensemble dénombrable Sm de Sm et un ensemble Ym avec
′
λ(Rq \ Ym ) = 0, pour λ la mesure de Lebesgue, tel que pour tout t ∈ Sm , il existe une
′
′
suite (tl )l≥1 d’éléments de Sm telle que pour tout y ∈ Ym , log(tl (y)) tend vers log(t(y)) quand l
tend vers l’infini.
On a aussi besoin d’une hypothèse de théorie de l’information sur notre collection de modèles.
Hypothèse K. La famille de nombres positifs (wm )m∈M vérifie
X
e−wm ≤ Ω < +∞.
m∈M
Alors, on peut écrire le théorème général de sélection de modèles.
Inégalité oracle pour une famille d’EMV. Soient (X1 , . . . , Xn ) des variables aléatoires de
densité inconnue s∗ . On en observe une réalisation (x1 , . . . , xn ). Soit {Sm }m∈M une collection
de modèles au plus dénombrable, où, pour tout m ∈ M, les éléments de Sm sont des densités de
probabilité, et Sm vérifie l’hypothèse Sepm . On considère de plus la collection d’estimateurs de
maxima de vraisemblance à ρ près notés (ŝm )m∈M : on a, pour tout m ∈ M,
n
n
i=1
i=1
1X
1X
ln(ŝm (xi )) ≤ inf −
ln(t(xi )) + ρ.
−
t∈Sm
n
n
30
Introduction
Soient {wm }m∈M une famille de nombres positifs vérifiant l’hypothèse K, et, pour tout m ∈ M,
on considère φm qui vérifie la condition Hm , avec ̟m l’unique solution positive de l’équation
√
φm (̟) = n̟2 .
On suppose de plus que, pour tout m ∈ M, l’hypothèse Sepm est vérifiée.
Soit pen : M → R+ et soit le critère de log-vraisemblance pénalisé
n
crit(m) = −
1X
ln(ŝm (xi ))) + pen(m).
n
i=1
Alors, il existe des constantes κ et C telles que, lorsque
wm
2
pen(m) ≥ κ ̟m
+
n
pour tout m ∈ M, alors il existe m̂ qui minimise le critère crit sur M, et de plus,
Ω
2
inf KL(s, t) + pen(m) + ρ +
Es (dH (s, ŝm̂ )) ≤ C inf
m∈M t∈Sm
n
où dH est la distance de Hellinger, et KL la divergence de Kullback-Leibler.
Ce théorème nous indique que, si notre collection de modèles est bien construite (c’est-à-dire
satisfait les hypothèses Hm , K et Sepm ), on peut trouver une pénalité telle que le modèle minimisant le critère pénalisé satisfasse une inégalité oracle.
Cette approche a déjà été envisagée pour sélectionner le nombre de classes d’un modèle de
mélange. On peut citer par exemple Maugis et Michel, [MM11b], ou Maugis et Meynet, [MMR12].
Ces auteurs voient le problème de sélection du nombre de composantes et de sélection de variables
comme un problème de sélection de modèles.
Pour le rang, Giraud, dans [Gir11], et Bunea, dans [Bun08], proposent une pénalité pour choisir
de façon optimale le rang. Ils obtiennent, à variance connue et inconnue, des inégalités oracle qui
permettent de sélectionner le rang, où la pénalité est proportionnelle au rang. Ma et Sun, dans
[MS14], obtiennent pour ces modèles une borne minimax, ce qui revient à dire que la pénalité
construite est optimale pour la sélection du rang.
Les procédures d’estimation que l’on a décrites dans la partie précédente sont sujettes à des choix
de paramètres (le nombre de classes, le paramètre de régularisation du Lasso, et le rang dans
la deuxième procédure). On peut réécrire la sélection de ces paramètres comme un problème de
sélection de modèles : en faisant varier ces paramètres, on obtient une collection de modèles.
Commençons par définir les collections de modèles associés à chacune de nos procédures.
Pour la procédure Lasso-EMV, pour (K, J) ∈ K × J ,
n
o
(K,J)
S(K,J) = y ∈ Rq 7→ sξ
(y|x)
(14)
K
X
πk
1
(K,J)
[J] t −1
[J]
p
sξ
(y|x) =
exp − (y − βk x) Σk (y − βk x)
2
2π det(Σk )
k=1
[J]
[J]
ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ Ξ(K,J)
K
Ξ(K,J) = ΠK × (Rq×p )K × (S++
q )
31
Introduction
Pour la procédure Lasso-Rang, pour (K, J, R) ∈ K × J × R,
o
n
(K,J,R)
S(K,J,R) = y ∈ Rq 7→ sξ
(y|x)
K
X
1
πk
(K,J,R)
Rk [J] t −1
Rk [J]
p
exp − (y − (βk ) x) Σk (y − (βk ) x)
(y|x) =
sξ
2
2π det(Σk )
(15)
k=1
RK [J]
ξ = (π1 , . . . , πK , (β1R1 )[J] , . . . , (βK
) , Σ1 , . . . , ΣK ) ∈ Ξ(K,J,R)
K
Ξ̌(K,J,R) = ΠK × Ψ(K,J,R) × (S++
q )
n
o
RK [J]
Ψ(K,J,R) = ((β1R1 )[J] , . . . , (βK
) ) ∈ (Rq×p )K pour tout k ∈ {1, . . . , K}, Rang(βk ) = Rk .
où K est l’ensemble des valeurs possibles pour le nombre de composantes, J est l’ensemble des
ensembles d’indices de variables actives possibles, et R est l’ensemble des valeurs possibles pour
les vecteurs de rang.
Pour obtenir des résultats théoriques, on a besoin de borner nos paramètres. On considère
n
o
(K,J)
B
S(K,J)
= sξ
∈ S(K,J) ξ ∈ Ξ̃(K,J)
(16)
Ξ̃(K,J) = ΠK × ([−Aβ , Aβ ]|J| )K × ([aΣ , AΣ ]q )K
(17)
et
n
o
(K,J,R)
B
S(K,J,R)
= sξ
∈ S(K,J,R) ξ ∈ Ξ̃(K,J,R)
(18)
Ξ̃(K,J,R) = ΠK × Ψ̃(K,J,R) × ([aΣ , AΣ ]q )K
(
Ψ̃(K,J,R) =
RK
(β1R1 , . . . , βK
) ∈ Ψ(K,J,R) pour tout k ∈ {1, . . . , K}, βkRk =
Rk
X
l=1
(19)
)
σl utl vl , σl < Aσ
.
Comme on travaille en régression, on définit une version tensorisée de la divergence de KullbackLeibler
#
n
X
1
KL(s(.|xi ), t(.|xi )) ;
KL⊗n (s, t) = E
n
"
i=1
une version tensorisée de la distance d’Hellinger,
#
" n
1X 2
⊗n 2
dH (s(.|xi ), t(.|xi )) .
(dH ) (s, t) = E
n
i=1
On a aussi besoin de la divergence de Jensen-Kullback-Leibler, définie par, pour ρ ∈ (0, 1), par
JKLρ (s, t) =
et de sa version tensorisée
n
JKL⊗
ρ (s, t)
"
1
KL(s, (1 − ρ)s + ρt);
ρ
#
n
1X
=E
JKLρ (s(.|xi ), t(.|xi )) .
n
i=1
Le nombre de classes est supposé inconnu, on va l’estimer ici, contrairement à l’inégalité oracle
ℓ1 pour le Lasso. On suppose que les covariables, bien qu’aléatoires, sont bornées. Pour simplifier
la lecture, on suppose que X ∈ [0, 1]p .
Dans cette thèse, on obtient alors les deux théorèmes suivants.
32
Introduction
Inégalité oracle Lasso-EMV. Soit (xi , yi )i∈{1,...,n} les observations, issues d’une densité conditionnelle inconnue s∗ . Soit S(K,J) définie par (14). On considère J L ⊂ J la sous-collection
d’ensembles d’indices construite en suivant le chemin de régularisation de l’estimateur du Lasso.
B
Pour (K, J) ∈ K × J L , notons S(K,J)
le modèle défini par (16).
On considère l’estimateur du maximum de vraisemblance
(
)
n
1X
(K,J)
(K,J)
log(sξ
(yi |xi )) .
ŝ
= argmin
−
n
(K,J)
B
s
∈S
ξ
i=1
(K,J)
(K,J)
B
Notons D(K,J) la dimension du modèle S(K,J)
, D(K,J) = K(|J| + q + 1) − 1. Soit s̄ξ
telle que
δKL
(K,J)
;
) ≤ inf KL⊗n (s∗ , t) +
KL⊗n (s∗ , s̄ξ
B
n
t∈S(K,J)
B
∈ S(K,J)
(K,J)
≥ e−τ s∗ . Soit pen : K × J → R+ , et supposons qu’il existe une
et soit τ > 0 tel que s̄ξ
constante absolue κ > 0 telle que, pour tout (K, J) ∈ K × J ,
D(K,J)
D(K,J) 2
2
B (Aβ , AΣ , aΣ ) − log
B (Aβ , AΣ , aΣ ) ∧ 1
pen(K, J) ≥ κ
n
n
4epq
;
+(1 ∨ τ ) log
(D(K,J) − q 2 ) ∧ pq
ˆ
où les constantes Aβ , AΣ , aΣ sont définies par (17). Si on sélectionne le modèle indicé par (K̂, J),
où
( n
)
X
ˆ = argmin
(K̂, J)
−
log(ŝ(K,J) (yi |xi )) + pen(K, J) ,
(K,J)∈K×J L
i=1
alors l’estimateur ŝ(K̂,J)
ˆ vérifie, pour tout ρ ∈ (0, 1),
E
∗
n
JKL⊗
ρ (s , ŝm̂ )
≤C E
inf
(K,J)∈K×J L
inf
t∈S(K,J)
⊗n
KL
∗
(s , t) + pen(K, J)
+
4
;
n
pour une constante C absolue.
Ce théorème nous donne une pénalité théorique pour laquelle le modèle minimisant le critère
pénalisé a de bonnes propriétés d’estimation. Les constantes, bien que non optimales, sont explicites, en fonction des bornes de l’espace des paramètres. La pénalité est presque proportionnelle
(à un terme logarithmique près) à la dimension du modèle. Le terme logarithmique a été étudié
dans la thèse de Meynet, [Mey12]. Ici, en pratique, on prend la pénalité proportionnelle à la
dimension.
Inégalité oracle Lasso-Rang. Soit (xi , yi )i∈{1,...,n} les observations, issues d’une densité condiB
tionnelle inconnue s∗ . Pour (K, J, R) ∈ K × J × R, soit S(K,J,R) définie par (15), et S(K,J,R)
définie par (18). Soit J L ⊂ J une sous-collection construite en suivant le chemin de régularisation de l’estimateur du Lasso.
B
Soit s̄(K,J,R) ∈ S(K,J,R)
telle que, pour δKL > 0,
KL⊗n (s∗ , s̄(K,J,R) ) ≤
inf
B
t∈S(K,J,R)
KL⊗n (s∗ , t) +
δKL
n
33
Introduction
et telle qu’il existe τ > 0 tel que
s̄(K,J,R) ≥ e−τ s∗ .
(20)
B
On considère la collection d’estimateurs {ŝ(K,J,R) }(K,J,R)∈K×J ×R de S(K,J,R)
, vérifiant
ŝ
(K,J,R)
=
argmin
(K,J,R)
sξ
B
∈S(K,J,R)
(
)
n
1X
(K,J,R)
log sξ
(yi |xi )
.
−
n
i=1
B
Notons D(K,J,R) la dimension du modèle S(K,J,R)
. Soit pen : K × J × R → R+ définie par, pour
tout (K, J, R) ∈ K × J × R,
D(K,J,R) 2
2
B (Aβ , AΣ , aΣ , Aσ ) ∧ 1
2B (Aβ , AΣ , aΣ , Aσ ) − log
n
!)
K
X
4epq
+
Rk
,
(D(K,J,R) − q 2 ) ∧ pq
D(K,J,R)
pen(K, J, R) ≥κ
n
+ log
k=1
avec κ > 0 une constante absolue.
Alors, l’estimateur ŝ(K̂,J,
ˆ R̂) , avec
ˆ R̂) =
(K̂, J,
argmin
(K,J,R)∈K×J L ×R
(
vérifie, pour tout ρ ∈ (0, 1),
ˆ R̂)
∗ (K̂,J,
⊗n
E JKLρ s , ŝ
≤CE
inf
(K,J,R)∈K×J L ×R
)
n
1X
(K,J,R)
log(ŝ
(yi |xi )) + pen(K, J, R) ,
−
n
i=1
(21)
inf
t∈S(K,J,R)
KL⊗n (s∗ , t) + pen(K, J, R)
+
4
,
n
pour une constante C > 0.
Ce théorème nous donne une pénalité théorique pour laquelle le modèle minimisant le critère
pénalisé a de bonnes propriétés d’estimation.
Ces deux théorèmes ne sont pas des inégalités oracle exactes, à cause de la constante C, mais on
contrôle cette constante. Ces résultats sont non-asymptotiques, ce qui nous permet de les utiliser
dans notre cadre de grande dimension. Ils justifient l’utilisation de l’heuristique des pentes.
Données fonctionnelles
Ce travail de classification de données en régression a été développé en toute généralité, mais aussi
plus particulièrement dans le cas des données fonctionnelles. L’analyse des données fonctionnelles
s’est beaucoup développée, grâce notamment aux progrès techniques récents qui permettent
d’enregistrer des données sur des grilles de plus en plus fines. Pour une analyse générale de ce
type de données, on peut citer par exemple le livre de Ramsay et Silverman, [RS05] ou celui de
Ferraty et Vieu, [FV06].
Dans cette thèse, on prend le parti de projeter les fonctions observées sur une base orthonormale.
Cette approche a l’avantage de considérer l’aspect fonctionnel, par rapport à l’analyse multivariée
de la discrétisation du signal, et de résumer le signal en peu de coefficients, si la base est bien
choisie.
34
Introduction
Plus précisément, on choisit de projeter les fonctions étudiées sur des bases d’ondelettes. On
décompose de façon hiérarchique les signaux dans le domaine temps-échelle. On peut alors
décrire une fonction à valeurs réelles par une approximation de cette fonction, et un ensemble de
détails. Pour une étude générale des ondelettes, on peut citer Daubechies, [Dau92], ou Mallat,
[Mal99].
Il est à noter que, par données fonctionnelles, dans notre cadre de modèles de mélange en régression, on entend soit des régresseurs fonctionnels et une réponse vectorielle (typiquement,
l’analyse spectrophotométrique de données, comme le jeu de données classiques Tecator où la
proportion de gras d’un morceau de viande est exprimée en fonction de la courbe de spectrophotométrie), soit des régresseurs vectoriels et une réponse fonctionnelle, soit des régresseurs et
une réponse fonctionnels (comme la régression de la consommation électrique d’un jour sur la
veille).
On considère un échantillon de signaux (fi )1≤i≤n observés sur une grille de temps {t1 , . . . , tT }.
On peut considérer l’échantillon (fi (tj )) 1≤i≤n pour faire notre analyse (ces observations sont soit
1≤j≤p
les régresseurs, soit la réponse, soit les deux suivant la nature des données), mais on peut aussi
considérer les projections de l’échantillon sur une base orthonormée B = {αj }j∈N∗ . Dans ce cas,
il existe (bj )j∈N∗ tels que, pour tout f , pour tout t,
f (t) =
∞
X
bj αj (t).
j=1
On peut choisir une base d’ondelettes B = {φ, ψl,h }
l≥0
0≤h≤2l −1
, où
R
— ψ est une ondelette réelle, vérifiant ψ ∈ L1 ∩ L2 , tψ ∈ L1 , et R ψ(t)dt = 0.
— pour tout t, ψl,h (t) = 2l/2 ψ(2l t − h) pour (l, h) ∈ Z2 .
— φ une fonction d’échelle associée à ψ.
— pour tout t, φl,h (t) = 2l/2 φ(2l t − h) pour (l, h) ∈ Z2 .
On peut alors écrire, pour tout f , pour tout t,
X
XX
f (t) =
βL,h (f )φL,h (t) +
dl,h (f )ψl,h (t)
h∈Z
h∈Z l≤L
où
βl,h (f ) =< f, φl,h > pour tout (l, h) ∈ Z2 ,
dl,h (f ) =< f, ψl,h > pour tout (l, h) ∈ Z2 .
Alors, plutôt que de considérer l’échantillon (fi )1≤i≤n , on peut travailler avec l’échantillon
(xi )1≤i≤n = (βL,h (fi ), (dl,h (fi ))l≥L,h∈Z )1≤i≤n . Comme la base est orthonormale, supposer que
les (fi )1≤i≤n suivent une loi Gaussienne revient à supposer que les (xi )1≤i≤n suivent une loi
Gaussienne.
D’un point de vue pratique, la décomposition d’un signal sur une base d’ondelettes est très
efficace. Citons Mallat, [Mal99], pour une description détaillée des ondelettes et de leur utilisation. L’intérêt majeur est la décomposition temps-échelle du signal, qui permet d’analyser les
coefficients. De plus, si on choisit bien la base, on peut résumer un signal en peu de coefficients,
ce qui permet de réduire la dimension du problème. D’un point de vue plus pratique, on peut
citer Misiti et coauteurs, [MMOP07] pour la mise en pratique de la décomposition d’un signal
sur une base d’ondelettes.
L’application principale de cette thèse est la classification des consommateurs électriques dans
un but de prédiction de la consommation agrégée. Si on prévoit la consommation de chaque
consommateur, et qu’on somme ces prédictions, on va sommer les erreurs de prédiction, donc
35
Introduction
on peut faire beaucoup d’erreurs. Si on prévoit la consommation totale, on risque de faire des
erreurs en n’étudiant pas assez les variations de chaque consommation individuelle. Cela explique
le besoin de faire de la classification, et la classification en régression est faite dans un but
de prédiction. Cependant, on sait que la prédiction de la consommation électrique peut être
améliorée avec des modèles beaucoup plus adaptés. L’objectif de cette procédure est de classer
ensemble, dans une étape préliminaire, les consommateurs qui ont le même comportement d’un
jour à l’autre, ces groupes seront alors construits dans un but de prédiction.
Plan de la thèse
Cette thèse est principalement centrée sur les modèles de mélange en régression, et les problèmes
de classification avec des données de régression en grande dimension. Elle est découpée en 5
chapitres, qui peuvent tous être lus de manière indépendante.
Le premier chapitre est consacré à l’étude du modèle principal. On décrit le modèle de mélange
de Gaussiennes en régression, où la réponse et les régresseurs sont multivariés. On propose plusieurs approches pour estimer les paramètres de ce modèle, entre autres en grande dimension
(pour la réponse et pour les régresseurs). On définit, dans ce cadre, plusieurs estimateurs pour
les paramètres inconnus : une extension de l’estimateur du Lasso, l’estimateur du maximum de
vraisemblance et l’estimateur du maximum de vraisemblance sous contrainte de faible rang. Dans
chaque cas, on a cherché à optimiser les définitions des estimateurs dans un but algorithmique.
On décrit, de manière précise, les deux procédures proposées dans cette thèse pour classifier des
variables dans un cadre de régression en grande dimension, et estimer le modèle sous-jacent.
C’est une partie méthodologique, qui décrit précisément le fonctionnement de nos procédures,
et qui explique comment les mettre en œuvre numériquement. Des illustrations numériques sont
proposées, pour confirmer en pratique l’utilisation de nos procédures. On utilise pour cela des
données simulées, où l’aspect de grande dimension, l’aspect fonctionnel, et l’aspect de classification sont surlignés, et on utilise aussi des données de référence, où la vraie densité (inconnue
donc) n’appartient plus à la collection de modèles considérée.
Dans un deuxième chapitre, on obtient un résultat théorique pour l’estimateur du Lasso dans
les modèles de mélange de Gaussiennes en régression, en tant que régularisateur ℓ1 . Remarquons
qu’ici, la pénalité est différente de celle du chapitre 1, cet estimateur étant une extension directe
de l’estimateur du Lasso introduit par Tibshirani pour le modèle linéaire. Nous établissons
une inégalité oracle ℓ1 qui compare le risque de prédiction de l’estimateur Lasso à l’oracle ℓ1 .
Le point important de cette inégalité oracle est que, contrairement aux résultats habituels sur
l’estimateur du Lasso, nous n’avons pas besoin d’hypothèses sur la non colinéarité entre les
variables. En contrepartie, la borne sur le paramètre de régularisation n’est pas optimale, dans
le sens où des résultats d’optimalité ont été démontrés pour une borne inférieure à celle que l’on
obtient, mais sous des hypothèses plus contraignantes. Notons que les constantes sont toutes
explicites, même si l’optimalité de ces quantités n’est pas garantie.
Dans les chapitres 3 et 4, on propose une étude théorique de nos procédures de classification. On
justifie théoriquement l’étape de sélection de modèles en établissant une inégalité oracle dans
chaque cas (correspondant respectivement à l’inégalité oracle pour la procédure Lasso-EMV et
l’inégalité oracle pour la procédure Lasso-Rang). Dans un premier temps, on a obtenu un théorème général de sélection de modèles, qui permet de choisir un modèle parmi une sous-collection
aléatoire, dans un cadre de régression. Ce résultat, démontré à l’aide d’inégalités de concentration et de contrôles par calcul d’entropie métrique, est une généralisation à une sous-collection
aléatoire de modèles d’un résultat déjà existant. Cette amélioration nous permet d’obtenir une
inégalité oracle pour chacune de nos procédures : en effet, nous considérons une sous-collection
aléatoire de modèles, décrite par le chemin de régularisation de l’estimateur du Lasso, et cet effet
Introduction
36
aléatoire nécessite de prendre des précautions dans les inégalités de concentration. Ce résultat
fournit une forme de pénalité minimale garantissant que l’estimateur du maximum de vraisemblance pénalisé est proche de l’oracle ℓ0 . En appliquant une telle pénalité lors de nos procédures,
nous sommes sûrs d’obtenir un estimateur avec un faible risque de prédiction. L’hypothèse majeure que l’on fait pour obtenir ce résultat est de borner les paramètres du modèle de mélange.
Remarquons que la pénalité n’est pas proportionnelle à la dimension, il y a un terme logarithmique en plus. On peut alors s’interroger quant à la nécessité de ce terme. On illustre aussi cette
étape dans chacun des chapitres sur des jeux de données simulées et des jeux de données de référence. Il est important de souligner que ces résultats, théoriques et pratiques, sont envisageables
car nous avons réestimé les paramètres par l’estimateur du maximum de vraisemblance, en se
restreignant aux variables actives pour ne plus avoir de problème de grande dimension.
Dans le chapitre 5, on s’intéresse à un jeu de données réelles. On met en pratique la procédure
Lasso-EMV de classification des données en régression en grande dimension pour comprendre
comment classer les consommateurs électriques, dans le but d’améliorer la prédiction. Ce travail a été effectué en collaboration avec Yannig Goude et Jean-Michel Poggi. Le jeu de données
utilisé est un jeu de données irlandaises, publiques. Il s’agit de consommations électriques individuelles, relevées sur une année. Nous avons aussi accès à des données explicatives, telles que
la température, et des données personnelles pour chaque consommateur. Nous avons utilisé la
procédure Lasso-EMV de trois manières différentes. Un problème simple, qui nous a permis de
calibrer la méthode, est de considérer la consommation agrégée sur les individus, et de classifier
les transitions de jour. Les données sont alors assez stables, et les résultats interprétables (on
veut classer les transitions de jours de semaine ensemble par exemple). Le deuxième schéma envisagé est de classifier les consommateurs, sur leur consommation moyenne. Pour ne pas perdre
l’aspect temporel, on a considéré les jours moyens. Finalement, pour compléter l’analyse, on a
classifié les consommateurs sur leur comportement sur deux jours fixés. Le problème majeur
de ce schéma est l’instabilité des données. Cependant, l’analyse des résultats, par des critères
classiques en consommation électrique, ou grâce aux variables explicatives disponibles avec ce
jeu de données, permet de justifier l’intérêt de notre méthode pour ce jeu de données.
A travers ce manuscrit, nous illustrons donc l’utilisation des modèles de mélange de Gaussiennes
en régression, d’un point de vue méthodologique, mis en oeuvre d’un point de vue pratique, et
justifié d’un point de vue théorique.
Perspectives
Pour poursuivre l’exploration des résultats de nos méthodes sur des données réelles, on pourrait
utiliser un modèle de prédiction dans chaque classe. L’idée serait alors de comparer la prédiction
obtenue plus classiquement avec la prédiction agrégée obtenue à l’aide de notre classification.
D’un point de vue méthodologique, on pourrait développer des variantes de nos procédures.
Par exemple, on pourrait envisager de relaxer l’hypothèse d’indépendance des variables induite
par la matrice de covariance diagonale. Il faudrait la supposer parcimonieuse, pour réduire la
dimension sous-jacente, et ainsi considérer des variables possiblement corrélées.
On pourrait aussi améliorer le critère de sélection de modèles, en l’orientant plus pour la classification. Par exemple, le critère ICL, introduit dans [BCG00] et développé dans la thèse de
Baudry, [Bau09], tient compte de cet objectif de classification en considérant l’entropie.
D’un point de vue théorique, d’autres résultats pourraient être envisagés.
Les inégalités oracles obtenues donnent une pénalité minimale conduisant à de bons résultats,
mais on pourrait vouloir démontrer que l’ordre de grandeur est le bon, à l’aide d’une borne
37
Introduction
minimax.
On pourrait aussi s’intéresser à des intervalles de confiance. Le théorème de van de Geer et
al., dans [vdGBRD14], valable pour des pertes convexes, peut être généralisé à notre cas, et on
obtient ainsi assez facilement un intervalle de confiance pour la matrice de régression. Cependant,
dans un but de prédiction, il pourrait être plus intéressant d’obtenir un intervalle de confiance
pour la réponse, mais c’est un problème bien plus difficile.
Dans un but de classification, on pourrait aussi vouloir obtenir des résultats similaires aux
inégalités oracles pour un autre critère que la divergence de Kullback-Leibler, plus orienté classification.
Notations
38
Notations
In this thesis, we denote (unless otherwise stated) by capital letter random variables, by lower
case observations, and in bold letters the observation vector. For a matrix A, we denote by [A]i,.
its ith row, [A].,j its jth column, and [A]i,j its coefficient indexed by (i, j). For a vector B, we
denote by [B]j its jth component.
Usual notations
cA
complement of A
t
A
transpose of A
T r(A)
trace of a square matrix A
E(X)
esperance of the random variable X
Var(X)
variance of the random variable X
N
Gaussian distribution
Nq
Gaussian multivariate distribution of size q
χ2
chi-squared distribution
B
orthonormal basis
1A
indicator function on a set A
⌊a⌋
floor function of a
a∨b
notation for the maximum between a and b
a∧b
notation for the minimum between a and b
f ≍g
f is asymptotically equivalent to g
∆
discriminant for a quadratic polynomial function
Iq
identity matrix of size q
< a, b >
scalar product between a and b
P({1, . . . , p}) set of parts of {1, . . . , p}
S++
set of positive-definite matrix of size q
q
Acronym
AIC
Akaike Information Criterion
ARI
Adjusted Rand Index
BIC
Bayesian Information Criterion
EM
Expectation-Maximization (algorithm)
EMV
Estimateur du Maximum de Vraisemblance
FR
False Relevant (variables)
LMLE Lasso-Maximum Likelihood Estimator procedure
LR
Lasso-Rank procedure
MAPE Mean Absolute Percentage Error
MLE
Maximum Likelihood Estimator
REC
Restricted Eigenvalue Condition
SNR
Signal-to-Noise Ratio
TR
True Relevant (variables)
39
Notations
40
Variables and observations
X regressors: random variable of size p
xi ith observation of the variable X
x vector of the observations
Y response: random variable of size q
yi ith observation of the variable Y
y vector of the observations
Z random variable for the affectation: vector of size K,
Zk = 1 if the variable Y , conditionally to X, belongs to the cluster k, 0 otherwise
zi observation of the variable Z for the observation yi conditionally to Xi = xi
F functional regressor: random functional variable
fi ith observation of the variable F
G functional response: random functional response
gi ith observation of the variable G
ỹ
reparametrization of observation y, matrix of size n × K × q
x̃ reparametrization of observation x, matrix of size n × K × p
ǫ
Gaussian variable
ǫi ith observation of the variable ǫ
ŷi prediction of the value of yi from observation xi
Parameters
β
conditional mean, of size q × p × K
σ
variance, in univariate models, of size K
Σ
covariance matrix, in multivariate models, of size q × q × K
Φ
reparametrized conditional mean, of size q × p × K
P
reparametrized covariance matrix, of size q × q × K
π
proportions coefficients, of size K
τ̂
A Posteriori Probability, matrix of size n × K
ξ
vector of all parameters before reparametrization: (π, β, Σ)
θ
vector of all parameters after reparametrization: (π, φ, P)
λ
regularization parameter for the Lasso estimator
λk,j,z
Lasso regularization parameter to cancel coefficient [φk ]z,j in mixture models
Ω
parameter for the assumption K
wm
weights for the assumption K
τi,k (θ) probability for the observation i to belong to the cluster k, according to the parameter θ
κ
parameter for the slope heuristic
R
vector of ranks for conditional mean, of size K
Rk
rank value of the conditional mean φk in the component k
ξ0
true parameter (Chapter 2)
41
Estimators
β̂
σ̂ 2
Σ̂
θ̂Lasso (λ)
β̂ Lasso (λ)
β̃ Lasso (A)
Lasso (λ)
ξˆK
EM V
ξˆK
ξˆJEM V
ξˆJRank
Σ̂x
θ̂Group−Lasso (λ)
β̂ LR (λ)
P̂ LR (λ)
Notations
estimator of the conditional mean in linear model
estimator of the variance in linear model
estimator of the covariance matrix in multivariate linear model
estimator of θ by the Lasso estimator, with regularization parameter λ
estimator of β by the Lasso estimator, with regularization parameter λ
estimator of β by the Lasso estimator, with regularization parameter A
according to the dual formulae
estimator of ξK by the Lasso estimator, with regularization parameter λ
estimator of ξK by the maximum likelihood estimator
estimator of ξK by the maximum likelihood estimator,
restricted to J for the relevant variables
estimator of ξK by the low rank estimator,
restricted to J for the relevant variables
Gram matrix, according to the sample x
estimator of θ by the Group-Lasso estimator with regularization parameter λ
estimator of β by the low-rank estimator,
restricted to variables detected by β̂ Lasso (λ)
estimator of P by the low-rank estimator,
restricted to variables detected by β̂ Lasso (λ)
Sets of densities
H(K,J)
set of conditional densities, with parameters θ,
with K clusters, and J for relevant variable set
Ȟ(K,J)
set of conditional densities, with parameters θ,
with K clusters, and J for relevant variable set, and with vector of ranks R
S(K,J)
set of conditional densities, with parameters ξ,
with K clusters, and J for relevant variable set
S(K,J,R) set of conditional densities, with parameters ξ,
with K clusters, and J for relevant variable set, and with vector of ranks R
B
S(K,J)
subset of S(K,J) with bounded parameters
B
S(K,J,R) subset of S(K,J,R) with bounded parameters
Dimensions
p
number of regressors
q
response size
K
number of components
n
sample size
Dm
dimension of the model Sm
D(K,J)
dimension of the model S(K,J)
D(K,J,R) dimension of the model S(K,J,R)
Notations
42
Sets of parameters
ΘK
set of θ with K components
Θ(K,J)
set of θ with K components and J for relevant variables set
Θ(K,J,R) set of θ with K components and J for relevant variables set,
and R for vector of ranks for the conditional mean
ΞK
set of ξ with K components
Ξ(K,J)
set of ξ with K components and J for relevant variables set
Ξ(K,J,R) set of ξ with K components and J for relevant variables set,
and R for vector of ranks for the conditional mean
Ξ̃(K,J,R) subset of Ξ(K,J,R) with bounded parameters
K
set of possible number of components
J
set of set of relevant variables
L
J
set of set of relevant variables, determined by the Lasso estimator
Je
set of set of relevant variables, determined by the Group-Lasso estimator
R
set of possible rank vectors
S
set of densities
M
model collection indices for the model collection constructed by our procedure
L
M
random model collection indices for the model collection
constructed by our procedure, according to the Lasso estimator
M̌
random model collection indices
M̃
model collection indices for the Group-Lasso-MLE model collection
L
M̃
random model collection indices for the Group-Lasso-MLE model collection
ΠK
simplex of proportion coefficients
Tq
upper triangular matrices, of size q
J
set of relevant variables
Jλ
set of relevant variables dected by the Lasso estimator with regularization parameter λ
˜
J
set of relevant variables detected by the Group-Lasso estimator
Ψ(K,J,R) set of conditional means, with J for relevant columns
Ψ̃(K,J,R) subset of Ψ(K,J,R) with bounded parameters
and R for vector of ranks, in a mixture with K components
FJ
set of conditional Gaussian density, with bounded conditional means
and bounded covariance coefficients, and relevant variables set defined by J
F(J,R)
set of conditional Gaussian density, with relevant variables set defined by J
and vector of ranks defined by R,
and bounded covariance coefficients and bounded singular values
GK
grid of regularization parameters, for model with K clusters
43
Notations
Functions
KL
KLn
KL⊗n
dH
n
d⊗
H
JKLρ
n
JKL⊗
ρ
sξ
sK
ξ
(K,J)
sξ
(K,J,R)
sξ
s∗
sO
sm
l
lλ
˜lλ
lc
γ
l
γn
pen
l
u
ϕ
ψ
φ
ξ(x)
m(A)
M (A)
H[.] (ǫ, S, ||.||)
Kullback-Leibler divergence
Kullback-Leibler divergence for fixed covariates
tensorized Kullback-Leibler divergence
Hellinger distance
tensorized Hellinger distance
Jensen-Kullback-Leibler divergence, with parameter ρ ∈ (0, 1)
tensorized Jensen-Kullback-Leibler divergence, with parameter ρ ∈ (0, 1)
conditional density, with parameter ξ
conditional density, with parameter ξ and with K components
conditional density, with parameter ξ
and with K components and J for relevant variables set
conditional density, with parameter ξ
and with K components and J for relevant variables set and R for vector of ranks
true density
oracle conditional density
density for the model m
log-likelihood function
penalized log-likelihood function for the Lasso estimator
penalized log-likelihood function for the Group-Lasso estimator
complete log-likelihood function
constrast function
loss function
empirical contrast function
penalty
lower function in a bracket
upper function in a bracket
Gaussian density
wavelet function
scaling function in wavelet decomposition
parameters in mixture regressions, defining from regressors x
smallest eigenvalue of the matrix A
biggest eigenvalue of the matrix A
bracketing entropy of a set S, with brackets of width ||.|| smaller than ǫ
Indices
j
varying from 1 to p
z
varying from 1 to q
k
varying from 1 to K
i
varying from 1 to n
m
varying in M
mO index of the oracle
m̂
selected index
Notations
44
Chapter 1
Two procedures
Contents
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Introduction . . . . . . . . . . . . . . . . . . . . . . . .
Gaussian mixture regression models . . . . . . . . . .
1.2.1 Gaussian mixture regression . . . . . . . . . . . . . . .
1.2.2 Clustering with Gaussian mixture regression . . . . .
1.2.3 EM algorithm . . . . . . . . . . . . . . . . . . . . . . .
1.2.4 The model collection . . . . . . . . . . . . . . . . . . .
Two procedures . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Lasso-MLE procedure . . . . . . . . . . . . . . . . . .
1.3.2 Lasso-Rank procedure . . . . . . . . . . . . . . . . . .
Illustrative example . . . . . . . . . . . . . . . . . . . .
1.4.1 The model . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Sparsity and model selection . . . . . . . . . . . . . .
1.4.3 Assessment . . . . . . . . . . . . . . . . . . . . . . . .
Functional datasets . . . . . . . . . . . . . . . . . . . .
1.5.1 Functional regression model . . . . . . . . . . . . . . .
1.5.2 Two procedures to deal with functional datasets . . .
Projection onto a wavelet basis . . . . . . . . . . . . .
Our procedures . . . . . . . . . . . . . . . . . . . . . .
1.5.3 Numerical experiments . . . . . . . . . . . . . . . . . .
Simulated functional data . . . . . . . . . . . . . . . .
Electricity dataset . . . . . . . . . . . . . . . . . . . .
Tecator dataset . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
Appendices . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 EM algorithms . . . . . . . . . . . . . . . . . . . . . .
EM algorithm for the Lasso estimator . . . . . . . . .
EM algorithm for the rank procedure . . . . . . . . .
1.7.2 Group-Lasso MLE and Group-Lasso Rank procedures
Context - definitions . . . . . . . . . . . . . . . . . . .
Group-Lasso-MLE procedure . . . . . . . . . . . . . .
Group-Lasso-Rank procedure . . . . . . . . . . . . . .
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
40
42
42
43
44
45
46
46
47
48
48
49
50
54
55
55
55
56
56
57
57
58
61
61
61
61
64
64
65
65
66
1.1. INTRODUCTION
46
In this chapter, we describe two procedures to cluster data in a
regression context. Following Maugis and Meynet [MMR12], we
propose two global model selection procedures to simultaneously select the number of clusters and the set of relevant variables for the
clustering. It is especially suited to deal with high-dimension and
low sample size settings.
We take advantage of regression datasets to underline reliance between the regressors and the responses, and cluster data from this
approach. This idea could be interesting for prediction, because observations sharing the same reliance will be considered in the same
cluster.
In addition, we focus on functional dataset, for which the projection onto a wavelet basis leads to sparse representation. We also
illustrate those procedures on simulated and benchmark dataset.
1.1
Introduction
Owing to the increasing of high-dimensional datasets, regression models for multivariate response
and high-dimensional predictors have become important tools.
The goal of this chapter is to describe two procedures which cluster regression datasets. We
focus on the model-based clustering. Each cluster is represented by a parametric conditional
distribution, the entire dataset being modeled by a mixture of these distributions. It provides a
rigorous statistical framework, and allows to understand the role of each variable in the clustering
process. The model considered is then, for i ∈ {1, . . . , n}, if (yi , xi ) ∈ Rq × Rp belongs to the
component k, there exists an unknown q ×p matrix of coefficients βk and an unknown covariance
matrix Σk such that
yi = β k x i + ǫ i
(1.1)
where ǫi ∼ Nq (0, Σk ). We will work with high-dimensional datasets, that is to say q × p could
be larger than the sample size n, then we have to reduce the dimension. Two ways will be
considered here, coefficients sparsity and ranks sparsity.
We could work with a sparse model if the matrix β could be estimated by a matrix with few
nonzero coefficients. The well-known Lasso estimator, introduced by Tibshirani in 1996 in
[Tib96] for linear models, is the solution chosen here. Indeed, the Lasso estimator is used for
variable selection, cite for example Meinshausen and Bühlmann in [MB10] for stability selection
results. We could also cite the book of Bühlmann and van de Geer [BvdG11] for an overview of
the Lasso estimator.
If we look for rank sparsity for β, we have to assume that a lot of regressors are linearly dependent.
This approach date’s back to the 1950’s, and was initiated by Anderson in [And51] for the linear
model. Izenman, in [Ize75], introduced the term of reduced-rank regression for this class of
models. A number of important works followed, cite for example Giraud in [Gir11] and Bunea
et al. in [BSW12] for recent results. Nevertheless, the linear regression model used in those
methods is appropriate for modeling the relationship between response and predictors when the
reliance is the same for all observations, and it is inadequate for settings in which the regression
coefficients differ across subgroups of the observations, then we will consider here mixture models.
An important example of high-dimensional datasets is functional datasets (functional predictors
and/or functional response). They have been studied for example in the book of Ramsay and
Silverman, [RS05]. A lot of recent works have been done on regression models for functional
datasets: for example, cite the article of Ciarleglio ([CO14]) which deals with scalar response
47
CHAPTER 1. TWO PROCEDURES
and functional regressors. One way to consider functional datasets is to project it onto a basis
well-suited. We can cite for example Fourier basis, splines, or, the one we consider here, wavelet
basis. Indeed, wavelets are particularly well suited to handle many types of functional data,
because they represent global and local attributes of functions, and can deal with discontinuities.
Moreover, to deal with the sparsity previously mentioned, a large class of functions can be well
represented with few non-zero coefficients, for any suitable wavelet.
We propose here two procedures which cluster high-dimensional data or data described by a
functional variable, explained by high-dimensional predictors or by predictor variables arising
from sampling continuous curves. Note that we estimate the number of components, parameters
of each model, and the proportions. We assume we do not have any knowledge about the model,
except that it could be well approximated by sparse mixture Gaussian regression model. The
high-dimensional problem is solved by using variable selection to detect relevant variables. Since
the structure of interest may often be contained into a subset of available variables and many
attributes may be useless or even harmful to detect a reasonable clustering structure, it is
important to select the relevant clustering variables. Moreover, removing irrelevant variables
enables to get simpler modeling and can largely enhance comprehension.
Our two procedures are mainly based on three recent works. Firstly, we could cite the article
of Städler et al. [SBG10], which studies finite mixture regression model. Even if we work on
a multivariate version of it, the model considered in the article [SBG10] is adopted here. The
second, Meynet and Maugis article [MMR12], deals with model-based clustering in density estimation. They propose a procedure, called Lasso-MLE procedure, which determines the number
of clusters, the set of relevant variables for the clustering, and a clustering of the observations,
with high-dimensional data. We extend this procedure with conditional densities. Finally, we
could cite the article [Gir11] of Giraud. It suggests a low-rank estimator for the linear model.
To take into account the matrix structure, we will consider this approach in our mixture models.
We consider finite mixture of Gaussian regression model. We propose two different procedures,
considering more or less the matrix structure. Both of them have the same frame. Firstly, an
ℓ1 -penalized likelihood approach is considered to determine potential sets of relevant variables.
Introduced by Tibshirani in [Tib96], the Lasso is used to select variables. This allows one to
efficiently construct a data-driven model subcollection with reasonable complexity, even for highdimensional situations, with different sparsities, varying the regularization parameter in the ℓ1 penalized likelihood function. The second step of the procedures consists to estimate parameters
in a better way than by the Lasso estimator. Then, we select a model among the collection using
the slope heuristic, which is developed by Birgé and Massart in [BM07]. Differences between the
both procedures are the estimation of parameters in each model. The first one, later called LassoMLE procedure, uses the maximum likelihood estimator rather than the ℓ1 -penalized maximum
likelihood estimator. It avoids estimation problems due to the ℓ1 -penalization shrinkage. The
second one, called Lasso-Rank procedure, deals with low rank estimation. For each model in the
collection, we construct a subcollection of models with conditional means estimated by various
low ranks matrices. It leads to sparsity and for the coefficients, and for the rank, and consider
the conditional mean with its matrix structure.
The chapter is organized as follows. Section 1.2 deals with Gaussian mixture regression models. It describes the model collection that we will consider. In Section 1.3, we describe both
procedures that we propose to solve the problem of clustering high-dimensional regression data.
Section 1.4 presents an illustrative example, to highlight each choice involved by both procedures. Section 1.5 states the functional data case, with a description of the projection proposed
to convert these functions into coefficients data. We end this section by study of simulated and
benchmark data. Finally, a conclusion section ends this chapter.
48
1.2. GAUSSIAN MIXTURE REGRESSION MODELS
1.2
Gaussian mixture regression models
We have to construct a statistical framework on the observations. Because we estimate the conditional densities by multivariate Gaussian in each cluster, the model used is a finite Gaussian
mixture regression model. Städler et al in [SBG10] describe this model, when X is multidimensional, and Y is scalar. We generalize it in the multivariate response case in this section.
Moreover, we will describe a model collection of Gaussian mixture regression models, with several
sparsities.
1.2.1
Gaussian mixture regression
We observe n independent couples (xi , yi )1≤i≤n , realizations of random variables (Xi , Yi )1≤i≤n ,
with Yi ∈ Rq and Xi ∈ Rp for all i ∈ {1, . . . , n}, coming from a probability distribution with
unknown conditional density denoted by s∗ . We want to perform model-based clustering, then
we
PKassume that data could be well approximated by a mixture conditional density s(y|x) =
k=1 πk sk (y|x), with K unknown. To get a Gaussian mixture regression model, we suppose
that, if Y conditionally to X belongs to the cluster k,
Y = βk X + ǫ;
where ǫ ∼ Nq (0, Σk ). We then assume that sk is a multivariate Gaussian conditional density.
Thus, the random response variable Y ∈ Rq depends on a set of explanatory variables, written
X ∈ Rp , through a regression-type model. By considering multivariate response, we could work
with more general datasets. Indeed, we could for example explain a functional response by a
functional regressor, as done with the electricity dataset in Section 1.5.3. Some assumptions are
in order, for a mixture of K Gaussian regression models.
— the variables Yi conditionally to Xi are independent, for all i ∈ {1, . . . , n} ;
— we let Yi |Xi = xi ∼ sK
ξ (y|xi )dy, with
sK
ξ (y|x)
K
X
(y − βk x)t Σ−1
πk
k (y − βk x)
=
exp
−
2
(2π)q/2 det(Σk )1/2
k=1
!
K
ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++
q )
)
(
K
X
ΠK = (π1 , . . . , πk ); πk > 0 for k ∈ {1, . . . , K} and
πk = 1
k=1
S++
q
is the set of symmetric positive definite matrices on Rq .
Then, we want to estimate the conditional density function sK
ξ from the observations. For all
k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the
mixture component k. The πk sP
are the mixture proportions. Actually, for all k ∈ {1, . . . , K},
for all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j xj is the zth component of the conditional mean of
the mixture component k for the conditional density sK
ξ (.|x).
In order to have a scale-invariant maximum likelihood estimator, and to have a convex optimization problem, we reparametrize the model described above by generalizing the reparametrization
described in [SBG10].
For all k ∈ {1, . . . , K}, we then define Φk = Pk βk , in which t Pk Pk = Σ−1
k (it is the Cholesky
decomposition of the positive definite matrix Σ−1
).
Our
hypotheses
could
now be rewritten:
k
— the variables Yi conditionally to Xi are independent, for all i ∈ {1, . . . , n} ;
49
CHAPTER 1. TWO PROCEDURES
— we let Yi |Xi = xi ∼ hK
θ (y|xi )dy, for i ∈ {1, . . . , n} , with
hK
θ (y|x)
=
K
X
πk det(Pk )
k=1
(2π)q/2
(Pk y − Φk x)t (Pk y − Φk x)
exp −
2
θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , P1 , . . . , PK ) ∈ ΘK = ΠK × (Rp×q )K × (Tq )K
(
)
K
X
ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and
πk = 1
k=1
Tq is the set of lower triangular matrix with non-negative diagonal entries.
The log-likelihood of this model is equal to, according to the sample (xi , yi )1≤i≤n ,
l(θ, x, y) =
n
X
log
i=1
K
X
πk det(Pk )
k=1
(2π)q/2
(Pk yi − Φk xi )t (Pk yi − Φk xi )
exp −
2
!
;
and the maximum log-likelihood estimator (later denoted by MLE) is
1
M LE
θ̂
:= argmin − l(θ, x, y) .
n
θ∈ΘK
This estimator is scale-invariant, and the optimization is convex in each cluster.
Since we deal with the p × q >> n case, this estimator has to be regularized to obtain accurate
estimates. As a result, we propose the ℓ1 -norm penalized MLE
1
Lasso
θ̂
(λ) := argmin − lλ (θ, x, y) ;
(1.2)
n
θ∈ΘK
where
K
Pp
X
1
1
πk ||Φk ||1 ;
− lλ (θ, x, y) = − l(θ, x, y) + λ
n
n
Pq
k=1
where ||Φk ||1 = j=1 z=1 |[Φk ]z,j |, and with λ > 0 to specify. This estimator is not the usual
ℓ1 -estimator, called the Lasso estimator, introduced by Tibshirani in [Tib96]. It penalizes the
ℓ1 -norm of the coefficients and small variances simultaneously, which has some close relations to
the Bayesian Lasso (see Park and Casella [PC08]). Moreover, the reparametrization allows us
to consider non-standardized data.
Notice that we restrict ourselves in this chapter to diagonal covariance matrices which are dependent of the clusters, that is to say for all k ∈ {1, . . . , K}, Σk = Diag ([Σk ]1,1 , . . . , [Σk ]q,q ).
Then, with the renormalization described above, the restriction becomes, for all k ∈ {1, . . . , K},
Pk = Diag ([Pk ]1,1 , . . . , [Pk ]q,q ). In that case, we assume that variables are not correlated, which
is a strong assumption, but it allows to reduce easily the dimension.
1.2.2
Clustering with Gaussian mixture regression
Suppose we know how many clusters there are, denoted by K, and assume that we get, from the
observations, an estimator θ̂ such that hK
well approximate the unknown conditional density
θ̂
∗
s . Then, we want to group the data into clusters between observations which seem similar.
From a different point of view, we can look at this problem as a missing data problem. Indeed,
the complete data are ((x1 , y1 , z1 ), . . . , (xn , yn , zn )) in which the latent random variables are
Z = (Z1 , . . . , Zn ), Zi = ([Zi ]1 , . . . , [Zi ]K ) for i ∈ {1, . . . , n} being defined by
1.2. GAUSSIAN MIXTURE REGRESSION MODELS
[Zi ]k =
1
0
50
if Yi arises from the k th subpopulation ;
otherwise.
Thanks to the estimation θ̂, we could use the Maximum A Posteriori principle (later denoted
MAP principle) to cluster data. Specifically, for all i ∈ {1, . . . , n}, for all k ∈ {1, . . . , K},
consider
πk det(Pk ) exp − 21 (Pk yi − Φk xi )t (Pk yi − Φk xi )
τi,k (θ) = PK
1
t
r=1 πk det(Pk ) exp − 2 (Pk yi − Φk xi ) (Pk yi − Φk xi )
the posterior probability of yi coming from the component number k, where θ = (π, φ, P).
Then, the data are partitioned by the following rule:
1 if τi,k (θ̂) > τi,l (θ̂) for all l 6= k ;
[Zi ]k =
0 otherwise.
1.2.3
EM algorithm
From an algorithmic point of view, we will use a generalization of the EM algorithm to compute
the MLE and the ℓ1 -norm penalized MLE. The EM algorithm was introduced by Dempster et
al. in [DLR77] to approximate the maximum likelihood estimator of parameters of mixture
model. It is an iterative process based on the minimization of the expectation of the empirical
contrast for the complete data conditionally to the observations and the current estimation of
the parameter θ(ite) at each iteration (ite) ∈ N∗ . Thanks to the Karush-Kuhn-Tucker conditions,
we could extend the second step to compute the maximum likelihood estimators, penalized or
not, under rank constraint or not, as it was done in the scalar case in [SBG10]. All those calculus
are available in Appendix 1.7.1. We therefore obtain the next updating formulae for the Lasso
estimator defined by (1.2). Remark that it includes maximum likelihood estimator, and the rank
constraint could be easily computed according to a singular value decomposition.
n
(ite)
(ite)
(ite+1)
k
− πk
;
(1.3)
= πk + t(ite)
πk
n
√
(ite)
(ite)
(ite)
nk h[ỹ]k,z , [Φk ]z,. [x̃]k,. i + ∆
(ite+1)
=
[Pk ]z,z
;
(1.4)
(ite)
2nk ||[ỹ]k,z ||22
(ite)
(ite)
−[Sk ]j,z +nλπk
(ite)
(ite)
if [Sk ]j,z > nλπk ;
(ite)
||[x̃]k,j ||22
(ite+1)
(ite)
(ite)
[Sk ]j,z +nλπk
(1.5)
[Φk ]z,j
=
(ite)
(ite)
if [Sk ]j,z < −nλπk ;
−
(ite)
2
||[x̃]
||
2
k,j
0
else ;
51
CHAPTER 1. TWO PROCEDURES
with, for j ∈ {1, . . . , p}, k ∈ {1, . . . , K}, z ∈ {1, . . . , q},
(ite)
[Sk ]j,z = −
nk =
n
X
n
X
(ite)
(ite)
[x̃i ]k,j [Pk ](ite)
z,z [ỹi ]k,z +
i=1
p
X
(ite)
(ite)
(ite)
[x̃i ]k,j [x̃i ]k,j2 [Φk ]z,j2 ;
(1.6)
j2 =1
j2 6=j
(ite)
τi,k ;
i=1
q
(ite)
(ite)
(ite)
([ỹi ]k,z , [x̃i ]k,j ) = τi,k ([yi ]z , [xi ]j );
(ite)
(ite) 2
(ite)
− 4||[ỹ]k,z ||22 ;
[x̃]
i
∆ = −nk h[ỹ]k,z , [Φk ](ite)
z,.
k,.
t
(ite)
(ite)
(ite)
(ite)
(ite)
(ite)
πk
det Pk
exp −1/2 Pk yi − Φk xi
Pk y i − Φ k x i
(ite)
τi,k =
t
; (1.7)
PK
(ite)
(ite)
(ite)
(ite)
(ite)
(ite)
det Pk
exp −1/2 Pk yi − Φk xi
Pk y i − Φ k x i
r=1 πk
and t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, 0 < δ < 1, such that the function is
not increasing.
In our case, the EM algorithm corresponds to switch between the E-step which corresponds to
the calculus of (1.3), (1.4) and (1.5), and the M-step, which corresponds to the calculus of (1.7).
To avoid convergence to local maximum rather than global maximum, we need to precise the
initialization and the stopping rules. We initialize the clustering with the k-means algorithm
on the couples (xi , yi )1≤i≤n . According to this clustering, we compute the linear regression
estimators in each class. Then, we run a small number of times the EM-algorithm, repeat
this initialization many times, and keep the one which maximizes the log-likelihood function:
how the computation will start is important. Finally, to stop the algorithm, we could wait for
any convergence, but the EM algorithm is known to check the convergence hypothesis, without
converging, because of local maximum. Consequently, we choose to fix a minimum number of
iterations to ensure non-local maximum, and to specify a maximum number of iterations to
ensure stopping. Between these two bounds, we stop if there is convergence of the log-likelihood
and of the parameters (with a relative criteria), adapted from [SBG10].
1.2.4
The model collection
We want to deal with high-dimensional data, that is why we have to determine which variables
are relevant for the Gaussian regression mixture clustering. Indeed, we observe a small sample
and we have to estimate many coefficients: we have a problem of identifiability. The sample
size n is smaller than K(pq + q + 1) − 1, the size of parameters to estimate. A way to solve
this problem is to select few variables to describe the problem. We then assume that we could
estimate s∗ by a sparse model.
To reduce the dimension, we want to determine which variables are useful for the clustering,
and which are not. It leads to the definition of an irrelevant variable.
Definition 1.2.1. A variable indexed by (z, j) ∈ {1, . . . , q} × {1, . . . , p} is irrelevant for the
clustering if [Φ1 ]z,j = . . . = [ΦK ]z,j = 0. A relevant variable is a variable which is not irrelevant:
at least in one cluster, this variable is not equal to zero. We denote by J the relevant variables
set.
[J]
We denote by Φk the matrix with 0 on the set c J, for all k ∈ {1, . . . , K}, and H(K,J) the model
with K components and with J for relevant variables set:
52
1.3. TWO PROCEDURES
n
o
(K,J)
(y|x) ;
H(K,J) = y ∈ Rq |x ∈ Rp 7→ hθ
where
(K,J)
hθ
(y|x) =
K
X
πk det(Pk )
k=1
and
[J]
(2π)q/2
[J]
(1.8)
[J]
(Pk y − Φk x)t (Pk y − Φk x)
exp −
2
[J]
θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , P1 , . . . , PK ) ∈ Θ(K,J) = ΠK × Rq×p
K
!
,
× Rq+
K
;
where the notation A[J] means that J is the relevant set variable for the matrix A.
We will construct a model collection, by varying the number of components K and the relevant
variables subset J.
1.3
Two procedures
The goal of our procedures is, given a sample (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , to
discover the structure of the variable Y according to X. Thus, we have to estimate, according
to the representation of H(K,J) , the number of clusters K, the relevant variables set J, and the
parameters θ. To overcome this difficulty, we want to take advantage of the sparsity property of
the ℓ1 -penalization to perform automatic variable selection in clustering high-dimensional data.
Then, we could compute another estimator restricted on relevant variables, which will work
better because it is no longer an high-dimensional issue. Thus, we avoid shrinkage problems
due to the Lasso estimator. The first procedure takes advantage of the maximum likelihood
estimator, whereas the second one takes into account the matrix structure of Φ with a low rank
estimation.
1.3.1
Lasso-MLE procedure
This procedure is decomposed into three main steps: we construct a model collection, then in
each we compute the maximum likelihood estimator, and we choose the best one among the
model collection.
The first step consists of constructing a model collection {H(K,J) }(K,J)∈M in which H(K,J) is
defined by equation (1.8), and the model collection is indexed by M = K × J . We denote by
K ⊂ N∗ the possible number of components. We assume we could bound K without loss of
estimation. We also note J ⊂ P({1, . . . , q} × {1, . . . , p}).
To detect the relevant variables, and construct the set J ∈ J , we penalize
P empirical contrast
P the
by an ℓ1 -penalty on the mean parameters proportional to ||Φk ||1 = pj=1 qz=1 |[Φk ]z,j |. In the
ℓ1 -procedures, the choice of the regularization parameters is often difficult: fixing the number of
components K ∈ K, we propose to construct a data-driven grid GK of regularization parameters
by using the updating formulae of the mixture parameters in the EM algorithm. We can give a
formula for λ, the regularization parameter, depending on which coefficient we want to cancel,
for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}:
[Φk ]z,j = 0
⇔
λk,j,z =
|[Sk ]j,z |
;
nπk
with [Sk ]j,z defined by (1.6). Then, we define the data-driven grid by
GK = {λk,j,z , k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}} .
53
CHAPTER 1. TWO PROCEDURES
We could compute it from maximum likelihood estimations.
Then, for each λ ∈ GK , we could compute the Lasso estimator defined by
( n
)
K
X
1X
(K,J)
Lasso
θ̂
(λ) = argmin
log(hθ
(yi |xi )) + λ
πk ||Φk ||1 .
n
θ∈Θ(K,J)
i=1
k=1
For a fixed number of mixture components K ∈ K and a regularization parameter λ ∈ GK , we
could use an EM algorithm, recalled in Appendix 1.7.1, to approximate this estimator. Then,
for each K ∈ K, and for each λ ∈ GK , we could construct the relevant variables set Jλ . We
denote by J the collection of all these sets.
The second step consists of approximating the MLE
(
)
n
X
1
ĥ(K,J) = argmin −
log(h(yi |xi )) ;
n
h∈H(K,J)
i=1
using the EM algorithm for each model (K, J) ∈ M.
The third step is devoted to model selection. Rather than select the regularization parameter, we
select the refitted model. We use the slope heuristic described in [BM07]. Explain briefly how it
works. Firstly, models are grouping according to their dimension D, to obtain a model collection
{HD }D∈D . The dimension of a model is the number of parameters estimated in the model. For
each dimension D, let ĥD be the estimator maximizing the likelihood
P among the estimators
associated to a model of dimension D. Also, the function D/n 7→ 1/n ni=1 log(ĥD (yi |xi )) has a
linear behavior for large dimensions. We estimate the slope, denoted by κ̂,Pwhich will be used to
calibrate the penalty. The minimizer D̂ of the penalized criterion −1/n ni=1 log(ĥD (yi |xi )) +
2κ̂D/n is determined, and the model selected is (KD̂ , JD̂ ). Remark that D = K(|J| + q + 1) − 1.
Note that the model is selected after parameters refitting, which avoids issue of regularization
parameter selection. For an oracle inequality to justify the slope heuristic used here, see [Dev14b].
1.3.2
Lasso-Rank procedure
Whereas the previous procedure does not take into account the multivariate structure, we propose a second procedure to perform this point. For each model belonging to the collection
H(K,J) , a subcollection is constructed, varying the rank of Φ. Let us describe this procedure.
As in the Lasso-MLE procedure, we first construct a collection of models, thanks to the ℓ1 approach. For λ ≥ 0, we obtain an estimator for θ, denoted by θ̂Lasso (λ), for each model
belonging to the collection. We could deduce the set of relevant columns, denoted by Jλ , and
this for all K ∈ K: we deduce J the collection of relevant variables set.
The second step consists to construct a subcollection of models with rank sparsity, denoted by
{Ȟ(K,J,R) }(K,J,R)∈M̃ .
The model Ȟ(K,J,R) has K components, the set J for active variables, and R is the vector of the
ranks of the matrix of regression coefficients in each group:
n
o
(K,J,R)
Ȟ(K,J,R) = y ∈ Rq |x ∈ Rp 7→ hθ
(y|x)
(1.9)
where
!
t (P y − (ΦRk )[J] x)
k [J]
(Pk y − (ΦR
)
x)
k
k
k
=
exp −
;
q/2
2
(2π)
k=1
q K
RK [J]
1 [J]
;
θ = (π1 , . . . , πK , (ΦR
1 ) , . . . , (ΦK )) , P1 , . . . , PK ) ∈ Θ(K,J,R) = ΠK × Ψ(K,J,R) × R+
n
o
RK [J]
q×p K
1 [J]
Ψ(K,J,R) = ((ΦR
Rank(Φk ) = Rk for all k ∈ {1, . . . , K} ;
1 ) , . . . , (ΦK ) ) ∈ R
(K,J,R)
hθ
(y|x)
K
X
πk det(Pk )
1.4. ILLUSTRATIVE EXAMPLE
54
and MR = K ×J ×R. We denote by K ⊂ N∗ the possible number of components, J a collection
of subsets of {1, . . . , p}, and R the set of vectors of size K ∈ K with ranks values for each mean
matrix. We could compute the MLE under the rank constraint thanks to an EM algorithm.
Indeed, we could constrain the estimation of Φk , for the cluster k, to have a rank equal to Rk ,
in keeping only the Rk largest singular values. More details are given in Section 1.7.1. It leads
to an estimator of the mean with row sparsity and low rank for each model. As described in the
above section, a model is selected using the slope heuristic. This step is justified theoretically
in [Dev15].
1.4
Illustrative example
We illustrate our procedures on four different simulated datasets, adapted from [SBG10], belonging to the model collection. They have been both implemented in Matlab, with the help of
Benjamin Auder, and the Matlab code is available. Firstly, we will present models used in these
simulations. Then, we validate numerically each step, and we finally compare results of our
procedures with others. Remark that we propose here some examples to illustrate our methods,
but not a complete analysis. We highlight some issues which seems important. Moreover, we
do not illustrate the one-component case, focusing on the clustering. If, on some dataset, we
are not convinced by the clustering, we could add to the model collection models with one component, more or less sparse, using the same pattern (computing the Lasso estimator to get the
relevant variables for various regularization parameters, and refit parameters with the maximum
likelihood estimator, under rank constraint or not), and select a model among this collection of
linear and mixture models.
1.4.1
The model
Let x be a sample of size n distributed according to multivariate standard Gaussian. We consider
a mixture of two components. Besides, we fix the number of active variables to 4 in each cluster.
More precisely, the first
four variables of Y are explained respectively by the four first variables
of X. Fix π = 12 , 21 and Pk = Iq for all k ∈ {1, 2}.
The difficulty of the clustering is partially controlled by the signal-to-noise ratio. In this context,
we could extend the natural idea of the SNR with the following definition, where Tr(A) denotes
the trace of the matrix A.
SNR =
Tr(Var(Y ))
.
Tr(Var(Y |βk = 0 for all k ∈ {1, . . . , K}))
Remark that it only controls the distance between the signal with or without the noise, and not
the distance between the both signals.
We compute four different models, varying n, the SNR, and the distance between the clusters.
Details are available in the Table 1.1.
55
CHAPTER 1. TWO PROCEDURES
n
k
p
q
β1|J
β2|J
σ
SNR
Model 1
2000
2
10
10
3
-2
1
3.6
Model 2
100
2
10
10
3
-2
1
3.6
Model 3
100
2
10
10
3
-2
3
1.88
Model 4
100
2
10
10
5
3
1
7.8
Model 5
50
2
30
5
3
-2
1
3.6
Table 1.1: Description of the different models
Take a sample of Y according to a Gaussian mixture, meaning in βk X and with covariance
matrix Σk = (Pkt Pk )−1 = σIq , for the cluster k. We run our procedures with the number of
components varying in K = {2, . . . , 5}.
The model 1 illustrates our procedures in low dimension models. Moreover, it is chosen in the
next section to illustrate well each step of the procedure (variable selection, models construction
and model selection). Model 5 is considered high-dimensional, because p × K > n. The model 2
is easier than the others, because clusters are not so close to each other according to the noise.
Model 3 is constructed as the models 1 and 2, but n is small and the noise is more important.
We will see that it gives difficulty for the clustering. Model 4 has a larger SNR, nevertheless,
the problem of clustering is difficult, because each βk is closer to the others.
Our procedures are run 20 times, and we compute statistics on our results over those 20 simulations: it is a small number, but the whole procedure is time-consuming, and results are
convincing enough.
For the initialization, we repeat 50 times the initialization, and keep the one which maximizes
the log-likelihood function after 10 iterations. Those choices are size-dependent, a numerical
study not reported here concludes that it is enough in that case.
1.4.2
Sparsity and model selection
To illustrate the both procedures, all the analyses made in this section are done from the model
1, since the choice of each step is clear.
Firstly, we compute the grid of regularization parameters. More precisely, each regularization
parameter is computed from maximum likelihood estimations (using EM algorithm), and give
an associated sparsity (computed by the Lasso estimator, using again the EM algorithm). In
Figure (1.1) and Figure (1.2), the collection of relevant variables selected by this grid are plotted.
Firstly, we could notice that the number of relevant variables selected by the Lasso decreases with
the regularization parameter. We could analyze more precisely which variables are selected, that
is to say if we select true relevant or false relevant variables. If the regularization parameter is not
too large, the true relevant variables are selected. Even more, if the regularization parameter
is well-chosen, we select only the true relevant variables. In our example, we remark that if
λ = 0.09, we have selected exactly the true relevant variables. This grid construction seems to
be well-chosen according to these simulations.
From this variable selection, each procedure (Lasso-MLE or Lasso-Rank) leads to a model collection, varying the sparsity thanks to the regularization parameters grid, and the number of
components.
Among this collection, we select a model with the slope heuristic.
We want to select the best model by improving a penalized criterion. This penalty is computed
1.4. ILLUSTRATIVE EXAMPLE
Figure 1.1: For one simulation, number of
false relevant (in red color) and true relevant (in blue color) variables generated by
the Lasso, by varying the regularization parameter λ in the grid G2
56
Figure 1.2: For one simulation, zoom in on
number of false relevant (in red color) and
true relevant (in blue color) variables generated by the Lasso, by varying the regularization parameter λ around the interesting
values
P
by performing a linear regression on the couples of points {(D/n; −1/n ni=1 log(ĥD (yi |xi )))},
D varying. The slope
κ̂ allows to have access to the best model, the one with dimension D̂
P
minimizing −1/n ni=1 log(ĥD (yi |xi )) + 2κ̂D/n. In practice, we have to look if couples of points
have a linear comportment . For each procedure, we construct the different model collection,
and we have to justify this behavior. The Figures (1.3) and (1.4) represent the log-likelihood in
function of the dimension of the models, for model collections constructed respectively by the
Lasso-MLE procedure and by the Lasso-Rank procedure. The couples are plotted by points,
whereas the estimated slope is specified by a dotted line. We could observe more than a line (4
for the Lasso-MLE procedure, more for the Lasso-Rank procedure). This phenomenon could be
explained by a linear behavior for each mixture, fixing the number of classes, and for each rank.
Nevertheless, slopes are almost the same, and select the same model. In practice, we estimate
the slope with the Capushe package [BMM12].
1.4.3
Assessment
We compare our procedures to three other procedures on simulated models 1, 2, 3 and 4.
Firstly, let us give some remarks about the model 1. For each procedure, we get a good clustering
and a very low Kullback-Leibler divergence. Indeed, the sample size is large, and the estimations
are good. That is the reason why we focus in this section on models 2, 3 and 4.
To compare our procedures with others, the Kullback-Leibler divergence with the true density
and the ARI (the Adjusted Rand Index, measuring the similarity between two data clusterings,
knowing that the closer to 1 the ARI, the more similar the two partitions) are computed, and
we note which variables are selected, and how many clusters are selected. For more details on
the ARI, see [Ran71].
From the Lasso-MLE model collection, we construct two models, to compare our procedures
with. We compute the oracle (the model which minimizes the Kullback-Leibler divergence with
the true density), and the model which is selected by the BIC criterion instead of the slope
heuristic. Thanks to the oracle, we know how good we could get from this model collection for
the Kullback-Leibler divergence, and how this model, as good it is possible for the log-likelihood,
performs the clustering.
57
Figure 1.3: For one simulation, slope graph
obtain by our Lasso-Rank procedure. For
large dimensions, we observe a linear part
CHAPTER 1. TWO PROCEDURES
Figure 1.4: For one simulation, slope graph obtain by our Lasso-MLE procedure. For large dimensions, we observe a linear part
The third procedure we compare with is the maximum likelihood estimator, assuming that we
know how many clusters there are, fixed to 2. We use this procedure to show that variable
selection is needed.
In each case, we apply the MAP principle, to compare clusterings.
We do not plot the Kullback-Leibler divergence for the MLE procedure, because values are too
high, and make the boxplots unreadable.
For the model 2, according to the Figure (1.5) for the Kullback-Leibler divergence, and Figure
(1.6) for the ARI, the Kullback-Leibler divergence is small and the ARI is close to 1, except
for the MLE procedure. Boxplots are still readable with those values, but it is important to
highlight that variable selection is needed, even in a model with reasonable dimension. The model
collections are then well constructed. The model 3 is more difficult, because the noise is higher.
That is why results, summarized in Figures (1.7) and (1.8), are not as good as for the model
2. Nevertheless, our procedures lead to the best ARI, and the Kullback-Leibler divergences are
close to the one of the oracle. We could make the same remarks for the model 4. In this study,
means are closer, according to the noise. Results are summarized in Figures (1.9) and (1.10).
The model 5 is in high-dimension. Models selected by the BIC criterion are bad, in comparison
with models selected by our procedures, or oracles. They are bad for estimation, according to
the Kullback-Leibler divergence boxplots in Figure (1.11), but also for clustering, according to
Figure (1.12). Our models are not as well as constructed as previously, but it is explained by the
high-dimensional context. It is explained by the high Kullback-Leibler divergence. Nevertheless,
our performances in clustering are really good.
Note that the Kullback-Leibler divergence is smaller for the Lasso-MLE procedure, thanks to
the maximum likelihood refitting. Moreover, the true model has not any matrix structure. If
we look after the MLE, where we do not use the sparsity hypothesis, we could conclude that
estimations are not satisfactory, which could be explained by an high-dimensional issue. The
Lasso-MLE procedure, the Lasso-Rank procedure and the BIC model work almost as well as the
oracle, which mind that the models are well selected.
58
0.45
1
0.4
0.35
0.8
0.3
0.6
0.25
ARI
Kullback−Leibler divergence
1.4. ILLUSTRATIVE EXAMPLE
0.2
0.4
0.15
0.2
0.1
0
0.05
LMLE
LR
Oracle
BIC
LMLE
Oracle
BIC
MLE
Figure 1.6: Boxplots of the ARI over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
and the MLE (MLE) for the model 2
1
1
0.8
0.8
0.6
0.6
ARI
Kullback−Leibler divergence
Figure 1.5: Boxplots of the Kullback-Leibler
divergence between the true model and the
one selected by the procedure over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
for the model 2
LR
0.4
0.4
0.2
0.2
LMLE
LR
Oracle
BIC
0
LMLE
Figure 1.7: Boxplots of the Kullback-Leibler
divergence between the true model and the
one selected by the procedure over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
for the model 3
LR
Oracle
BIC
MLE
Figure 1.8: Boxplots of the ARI over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
and the MLE (MLE) for the model 3
59
CHAPTER 1. TWO PROCEDURES
0.9
1.5
0.8
0.7
0.6
1
0.5
ARI
Kullback−Leibler divergence
2
0.4
0.3
0.5
0.2
0.1
0
0
LMLE
LR
Oracle
BIC
LMLE
Figure 1.9: Boxplots of the Kullback-Leibler
divergence between the true model and the
one selected by the procedure over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
for the model 4
LR
Oracle
BIC
MLE
Figure 1.10: Boxplots of the ARI over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
and the MLE (MLE) for the model 4
80
1
60
0.8
40
0.6
20
0.4
0
0.2
0
LMLE
LR
Oracle
BIC
LMLE
Figure 1.11: Boxplots of the KullbackLeibler divergence between the true model
and the one selected by the procedure over
the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure
(LR), the oracle (Oracle), the BIC estimator
(BIC) for the model 5
LR
Oracle
BIC
Figure 1.12: Boxplots of the ARI over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the Lasso-Rank procedure (LR),
the oracle (Oracle), the BIC estimator (BIC)
and the MLE (MLE) for the model 5
60
1.5. FUNCTIONAL DATASETS
Procedure
Lasso-MLE
Lasso-Rank
Oracle
BIC estimator
TR
8(0)
8(0)
8(0)
8(0)
Model 2
FR
2.2(6.9)
24(0)
1.5(3.3)
2.6(15.8)
Model 3
TR
FR
8(0)
4.3(28.8)
8(0)
24(0)
7.8(0.2) 2.2(11.7)
7.8(0.2) 5.7(64.8)
TR
8(0)
8(0)
8(0)
8(0)
Model 4
FR
2(13, 5)
24(0)
0.8(2.6)
2.6(11.8)
Table 1.2: Mean number {TR, FR} of true relevant and false relevant variables over the 20
simulations for each procedure, for models 2, 3 and 4. The standard deviations are put into
parenthesis
In Table (1.2), we summarize results about variable selection. For each model, for each procedure, we compute how many true relevant and false relevant variables are selected.
The true model has 8 relevant variables, which are always recognized. The Lasso-MLE has less
false relevant variables than the others, which means that the true structure was found. The
Lasso-Rank has 24 false relevant variables, because of the matrix structure: the true rank in
each component was 4, then the estimator restricted on relevant variables is a 4 × 4 matrix,
and we get 12 false relevant variables in each component. Nevertheless, we do not have more
variables, that is to say the model constructed is the best possible. The BIC estimator and the
oracle have a large variability for the false relevant variables.
For the number of components, we find that all the procedures have selected the true number 2.
Thanks to the MLE, the first procedure has good estimations (better than the second one).
Nevertheless, depending on the data, the second procedure could be more attractive. If there is
a matrix structure, for example if most of the variation of the response Y is caught by a small
number of linear combinations of the predictors, the second procedure will work better.
We could conclude that the model collection is well constructed, and that the clustering is
well-done.
1.5
Functional datasets
One of the main interest of our methods is to be applied to functional datasets. Indeed, in
different fields of applications, considered data are functions. The functional data analysis has
been popularized first by Ramsay and Silverman in their book [RS05]. It gives a description of the
main tools to analyze functional datasets. Another book is the Ferraty and Vieu one’s [FV06].
However, the main part of the existing literature about functional regression is concentrated
on the case Y scalar and X functional. For example, we can cite Zhao et al., in [ZOR12] for
using wavelet basis in linear model, Yao et al. ([YFL11]) for functional mixture regression, or
Ciarleglio et al. ([CO14]) for using wavelet basis in functional mixture regression. In this section,
we concentrate on Y and X both functional. In this regression context, we could be interested
in clustering: it leads to identify the individuals involved in the same reliance between Y and
X. Denote that, with functional datasets, we have to denoise and smooth signals to remove the
noise and capture only the important patterns in the data. Here, we explain how our procedures
can be applied in this context. Note that we could apply our procedures with scalar response
and functional regressor, or, on the contrary, with functional response for scalar regressor. We
explain how our procedures are generalized in the more difficult case, the other cases resulting
of that. Remark that we focus here on the wavelet basis, to take advantage of the time-scale
decomposition, but the same analysis is available on Fourier basis or splines.
61
1.5.1
CHAPTER 1. TWO PROCEDURES
Functional regression model
Suppose we observe a centered sample of functions (fi , gi )1≤i≤n , associated with the random
variables (F, G), coming from a probability distribution with unknown conditional density s∗ .
We want to estimate this model by a functional mixture model: if the variables (F, G) come
from the component k, there exists a function βk such that
Z
F (u)βk (u, t)du + ǫ(t),
(1.10)
G(t) =
Ix
where ǫ is a residual function. This linear model is introduced in Ramsay and Silverman’s book
[RS05]. They propose to project onto basis and the response, and the regressors. We extend
their model in mixture model, to consider several subgroups for a sample.
If we assume that, for all t, for all i ∈ {1, . . . , n}, ǫi (t) ∼ N (0, Σk ), the model (1.10) is an
integrated version of the model (1.1). Depending on the cluster k, the linear reliance of G with
respect to F is described by the function βk .
1.5.2
Two procedures to deal with functional datasets
Projection onto a wavelet basis
To deal with functional data, we project them onto some basis, to obtain data as described in
the Gaussian mixture regression models (1.1). In this chapter, we choose to deal with wavelet
basis, given that they represent localized features of functions in a sparse way. If the coefficients
matrix x and y are sparse, the regression matrix β has more chance to be sparse. Moreover, we
could represent a signal with a few coefficients dataset, then it is a way to reduce the dimension.
For details about the wavelet theory, see the Mallat’s book [Mal99].
Begin by an overview of some important aspects of wavelet basis.
Let ψ a real wavelet function, satisfying
Z
1
2
1
ψ ∈ L ∩ L , tψ ∈ L , and
ψ(t)dt = 0.
R
We denote by ψl,h the function defined from ψ by dyadic dilation and translation:
ψl,h (t) = 2l/2 ψ(2l t − h) for (l, h) ∈ Z2 .
We could define wavelet coefficients of a signal f by
Z
dl,h (f ) =
f (t)ψl,h (t)dt for (l, h) ∈ Z2 .
R
Let ϕ be a scaling function related to ψ, and Rϕl,h the dilatation and translation of ϕ for (l, h) ∈
Z2 . We also define, for (l, h) ∈ Z2 , βl,h (f ) = R f (t)ϕl,h (t)dt.
Note that scaling functions serve to construct approximations of the function of interest, while
the wavelet functions serve to provide the details not captured by successive approximations.
We denote by Vl the space generated by {ϕl,h }h∈Z , and by Wl the space egenrated by {ψl,h }h∈Z
for all l ∈ Z. Remark that
Vl−1 = Vl ⊕ Wl for all l ∈ Z
L2 = ⊕l∈Z Wl
Let L ∈ N∗ . For a signal f , we could define the approximation at the level L by
XX
AL =
dl,h ψl,h ;
l>L h∈Z
62
1.5. FUNCTIONAL DATASETS
and f could be decomposed by the approximation at the level L and the details (dl,h )l<L .
The decomposition of the basis between scaling function and wavelet function emphasizes on
the local nature of the wavelets, and that is an important aspect in our procedures, because we
want to know which details allow us to cluster two observations together.
Consider the sample (fi , gi )1≤i≤n , and introduce the wavelet expansion of fi in the basis B: for
all t ∈ [0, 1],
X
XX
fi (t) =
βL,h (fi )ϕL,h (t) +
dl,h (fi )ψl,h (t).
h∈Z
|
{z
AL
}
l≤L h∈Z
The collection {βL,h (fi ), dl,h (fi )}l≤L,h∈Z is the Discrete Wavelet Transform (DWT) of f in the
basis B.
Because we project onto an orthonormal basis, this leads to a n-sample (x1 , . . . , xn ) of wavelet
coefficient decomposition vectors, with
fi = W xi ;
in which xi is the vector of the discretized values of the signal, xi the matrix of coefficients in the
basis B, and W a p × p matrix defined by ϕ and ψ. The DWT can be performed by a compu′
tationally fast pyramid algorithm (see Mallat, [Mal89]). In the same way, there exists W such
′
that gi = W yi , with y = (y1 , . . . , yn ) a n sample of wavelet coefficient decomposition vectors.
′
Because the matrices W and W are orthogonal, we keep the mixture structure, and the noise is
also Gaussian. We could consider the wavelet coefficient dataset (x, y) = ((x1 , y1 ), . . . , (xn , yn )),
which defines of n observations whose probability distribution could be modeled by the finite
Gaussian mixture regression model (1.1).
Our procedures
We could apply our both procedures to this dataset, and obtain a clustering of the data. Indeed,
rather than considering (f , g), we run our procedures on the sample (x, y), varying the number
of clusters in K.
The notion of relevant variable is natural: the function ϕl,h or ψl,h is irrelevant if it appears in
none of the wavelet coefficient decomposition of the functions in each cluster.
1.5.3
Numerical experiments
We will illustrate our procedures on functional datasets by using the Matlab wavelet toolbox
(see Misiti et al. in [MMOP04] for details). Firstly, we simulate functional datasets, where
the true model belongs to the model collection. Then, we run our procedure on an electricity
dataset, to cluster successive days. We have access to time series, measured every half-hour, of
a load consumption, on 70 days. We extract the signal of each day, and construct couples by
each day and its eve, and we aim at clustering these couples. To finish, we test our procedures
on the well-known Tecator dataset. This benchmark dataset corresponds to the spectrometric
curves and fat contents of meat. These experiments illustrate different aspects of our procedures.
Indeed, the simulated example proves that our procedures work in a functional context. The
second example is a toy example used to validate the classification, on real data already studied,
and in which we clearly understand the clusters. The last example illustrates the use of the
classification to perform prediction, and the description given by our procedures to the model
constructed.
63
CHAPTER 1. TWO PROCEDURES
Simulated functional data
Firstly, we simulate a mixture regression model. Let f be a sample of the noised cosine function,
discretized on a 15 points grid. Let g be, depending on the cluster, either f , or the function −f ,
computed by a white-noise.
We use the Daubechies-2 basis at level 2 to decompose the signal.
Our procedures are run 20 times, and the number of clusters are fixed to K = 2. Then our
procedures run on the projection are compared with the oracle among the collection constructed
by the Lasso-MLE procedure, and with the model selected by the BIC criterion among this
collection. The MLE is also computed.
Figure 1.13: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE),
the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (Bic) and the MLE
(MLE)
This simulated dataset proves that our procedures also perform clustering functional data, considering the projection dataset.
Electricity dataset
We also study the clustering on electricity dataset. This example is studied in [MMOP07]. We
work on a sample of size 70 of couples of days, which is plotted in Figure 5.1. For each couple,
we have access to the half-hour load consumption.
Figure 1.14: Plot of the 70-sample of halfhour load consumption, on the two days
Figure 1.15: Plot of a week of load consumption
As we said previously, we want to cluster the relationship between two successive days. In Figure
1.15, we plot a week of consumption.
1.5. FUNCTIONAL DATASETS
64
The regression model taken is F for the first day, and G for the second day of each couple.
Besides, discretization of each day on 48 points, every half-hour, is made available. In our
opinion, a linear model is not appropriate, as the behavior from the eve to the day depends on
which day we consider: there is a difference between working days and weekend days, as involved
in Figure 1.15.
To apply our procedures, we project f and g onto wavelet basis. The symmlet-4 basis, at level
5, is used.
We run our procedures with the number of clusters varying from 2 to 6. Our both procedures
select a model with 4 components. The first one considers couples of weekdays, the second
Friday-Saturday, the third component is Saturday-Sunday and the fourth considers SundayMonday. This result is faithful with the knowledge we have about these data. Indeed, working
days have the same behavior, depending on the eve, whereas days off have not the same behavior,
depending on working days, and conversely. Moreover, in the article [MMOP07], which also
studied this example, they get the same classification.
Tecator dataset
This example deals with spectrometric data. More precisely, a food sample has been considered,
which contained finely chopped pure meat with different fat contents. The data consist of a 100channel spectrum of absorbances in the wavelength range 850 − 1050 nm, and of the percentage
of fat. We observe a sample of size 215. Those data have been studied in a lot of articles, cite
for example Ferraty and Vieu’s book [FV06]. They work on different approaches. They test
prediction, and classification, supervised (where the fat content become a class, larger or smaller
than 20%), or not (ignoring the response variable). In this work, we focus on clustering data
according to the reliance between the fat content and the absorbance spectrum. We could not
predict the response variable, because we do not know the class of a new observation. Estimate
it is a difficult problem, in which we are not involved in this chapter.
We will take advantage of our procedures to know which coefficients, in the wavelet basis decomposition of the spectrum, are useful to describe the fat content.
The sample will be split into two subsamples, 165 observations for the learning set, and 50
observations for the test set. We split it to have the same marginal distribution for the response
in each sample.
The spectrum is a function, which we decompose into the Haar basis, at level 6. Nevertheless, our
model did not take into account a constant coefficient to describe the response. Thereby, before
run our procedure, we center and the y according to the learning sample, and each function xi
for all observations in the whole sample. Then, we could estimate the mean of the response by
the mean µ̂ over the learning sample.
We construct models on the training set by our procedure Lasso-MLE. Thanks to the estimations, we have access to relevant variables, and we could reconstruct signals keeping only
relevant variables. We have also access to the a posteriori probability, which leads to know
which observation is with high probability in which cluster. However, for some observations,
the a posteriori probability do not ensure the clustering, being almost the same for different
clusters. The procedure selects two models, which we describe here. In Figures 1.16 and 1.17,
we represent clusters done on the training set for the different models. The graph on the left
is a candidate for representing each cluster, constructed by the mean of spectrum over an a
posteriori probability greater than 0.6. We plot the curve reconstruction, keeping only relevant
variables in the wavelet decomposition. On the right side, we present the boxplot of the fat
values in each class, for observations with an a posteriori probability greater than 0.6.
The first model has two classes, which could be distinguish in the absorbance spectrum by the
65
CHAPTER 1. TWO PROCEDURES
bump on wavelength around 940 nm. The first cluster is dominating, with π̂1 = 0.95. The
fat content is smaller in the first cluster than in the second cluster. According to the signal
reconstruction, we could see that almost all variables have been selected. This model seems
consistent according to the classification goal.
The second model has 3 classes, and we could remark different important wavelength. Around
940 nm, there is some difference between classes, corresponding to the bump underline in the
model 1, but also around 970 nm, with higher or smaller values. The first class is dominating,
with π̂1 = 0.89. Just a few of variables have been selected, which give to this model the
understanding property of which coefficients are discriminating.
We could discuss about those models. The first one selects only two classes, but almost all
variables, whereas the second model has more classes, and less variables: there is a trade-off
between clusters and variable selection for the dimension reduction.
Figure 1.16: Summarized results for the model 1. The graph on the left is a candidate for
representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori
probability greater than 0.6 On the right side, we present the boxplot of the fat values in each
class, for observations with an a posteriori probability greater than 0.6.
66
1.5. FUNCTIONAL DATASETS
Figure 1.17: Summarized results for the model 2. The graph on the left is a candidate for
representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori
probability greater than 0.6 On the right side, we present the boxplot of the fat values in each
class, for observations with an a posteriori probability greater than 0.6.
According to those classifications, we could compute the response according to the linear model.
We use two ways to compute ŷ: either consider the linear model in the cluster selected by the
MAP principle, or mix estimations in each cluster thanks to these
P a posteriori probabilities.
We compute the Mean Absolute Percentage Error, MAPE = n1 ni=1 |(ŷi − yi )/yi |. Results are
summarized in Table 1.3.
Model 1
Model 2
Linear model in the class
with higher probability
0.200
0.055
Mixing estimation
0.198
0.056
Table 1.3: Mean absolute percentage of error of the predicted value, for each model, for the
learning sample
Thus, we work on the test sample. We use the response and the regressors to know the a
posteriori of each observation. Then, using our models, we could compute the predicted fat
values from the spectrometric curve, as before according to two ways, mixing or choosing the
classes.
Model 1
Model 2
Linear model in the class
with higher probability
0.22196
0.20492
Mixing estimation
0.21926
0.20662
Table 1.4: Mean absolute percentage of error of the predicted value, for each model, for the
test sample
Because the models are constructed on the learning sample, MAPE are lower than for the test
sample. Nevertheless, results are similar, saying that models are well constructed. This is
particularly the case for the model 1, which is more consistent over a new sample.
67
CHAPTER 1. TWO PROCEDURES
To conclude this study, we could highlight the advantages of our procedure on these data. It
provides a clustering of data, similar to the one done with supervised clustering in [FV06], but
we could explain how this clustering is done.
This work has been done with the Lasso-MLE procedure. Nevertheless, the same kind of results
have been get with the Lasso-Rank procedure.
1.6
Conclusion
In this chapter, two procedures are proposed to cluster regression data. Detecting the relevant
clustering variables, they are especially designed for high-dimensional datasets. We use an
ℓ1 -regularization procedure to select variables, and then deduce a reasonable random model
collection. Thus, we recast estimations of parameters of these models into a general model
selection problem. These procedures are compared with usual criteria on simulated data: the
BIC criterion used to select a model, the maximum-likelihood estimator, and to the oracle when
we know it. In addition, we compare our procedures to others on benchmark data.
One main asset of those procedures is that it can be applied to functional datasets. We also
develop this point of view.
1.7
Appendices
In those appendices, we develop calculus for EM algorithm updating formulae in Section 1.7.1,
for Lasso and maximum likelihood estimators, and for low ranks estimators. In Section 1.7.2,
we extend our procedures with the Group-Lasso estimator to select relevant variables, rather
than use the Lasso estimator.
1.7.1
EM algorithms
EM algorithm for the Lasso estimator
Introduced by Dempster et al. in [DLR77], the EM (Expectation-Maximization) algorithm is
used to compute maximum likelihood estimators, penalized or not.
The expected complete negative log-likelihood is denoted by
1
Q(θ|θ′ ) = − Eθ′ (lc (θ, X, Y, Z)|X, Y)
n
in which
lc (θ, X, Y, Z) =
n X
K
X
[Zi ]k log
i=1 k=1
1
det(Pk )
t
exp − (Pk Yi − Xi Φk ) (Pk Yi − Xi Φk )
2
(2π)q/2
+ [Zi ]k log(πk );
with [Zi ]k are independent and identically distributed unobserved multinomial variables, showing
the component-membership of the ith observation in the finite mixture regression model.
The expected complete penalized negative log-likelihood is
′
′
Qpen (θ|θ ) = Q(θ|θ ) + λ
K
X
k=1
πk ||Φk ||1 .
68
1.7. APPENDICES
Calculus for updating formula
— E-step: compute Q(θ|θ(ite) ), or, equivalently, compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n},
(ite)
τi,k
= Eθ(ite) ([Zi ]k |Y)
(ite)
(ite)
πk det Pk exp
=
PK
− 12
(ite)
(ite)
det Pk exp
r=1 πk
(ite)
Pk Y i
− 12
−
(ite) t
(ite)
X i Φk
Pk Y i
(ite)
Pk Y i
−
−
(ite)
X i Φk
(ite) t
(ite)
X i Φk
Pk Yi
−
(ite)
X i Φk
This formula updates the clustering, thanks to the MAP principle.
— M-step: improve Qpen (θ|θ(ite) ).
For this, rewrite the Karush-Kuhn-Tucker conditions. We have
Qpen (θ|θ(ite) )
n K
1 XX
1
det(Pk )
t
=−
exp − (Pk Yi − Xi Φk ) (Pk Yi − Xi Φk )
|Y
Eθ(ite) [Zi ]k log
n
2
(2π)q/2
−
1
n
i=1 k=1
n X
K
X
i=1 k=1
Eθ(ite) ([Zi ]k log πk |Y) + λ
K
X
k=1
πk ||Φk ||1
n K
1 XX 1
=−
− (Pk Yi − Xi Φk )t (Pk Yi − Xi Φk )Eθ(ite) ([Zi ]k |Y)
n
2
i=1 k=1
q
n K
[Pk ]z,z
1 XXX
Eθ(ite) [[Zi ]k |Y]
−
log √
n
2π
i=1 k=1 z=1
n
−
K
K
X
1 XX
Eθ(ite) ([Zi ]k |Y) log πk + λ
πk ||Φk ||1 .
n
i=1 k=1
(1.11)
k=1
Firstly, we optimize this formula with respect to π: it is equivalent to optimize
n
−
K
K
X
1 XX
τi,k log(πk ) + λ
πk ||Φk ||1 .
n
i=1 k=1
k=1
We obtain
(ite+1)
πk
=
(ite)
πk
+t
(ite)
Pn
i=1 τi,k
n
−
(ite)
πk
;
with t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, with 0 < δ < 1, such that the
function is not increasing.
To optimize (1.11) with respect to (Φ, P), we could rewrite the expression: it is similar
to the optimization of
!
q
n
X
1X
1
t
−
τi,k
log([Pk ]z,z ) − (Pk Ỹi − X̃i Φk ) (Pk Ỹi − X̃i Φk ) + λπk ||Φk ||1
n
2
i=1
z=1
for all k ∈ {1, . . . , K}, which is equivalent to the optimization of
q
q
2
1 X
1 XX
− nk
log([Pk ]z,z ) +
[Pk ]z,z [Ỹi ]k,z − [Φk ]z,. [X̃i ]k,. + λπk ||Φk ||1 ;
n
2n
z=1
n
i=1 z=1
69
CHAPTER 1. TWO PROCEDURES
P
where nk = ni=1 τi,k . The minimum in [Pk ]z,z is the function which cancel its partial
derivative with respect to [Pk ]z,z :
nk 1
1 X
−
+
2[Ỹi ]k,z [Pk ]z,z [Ỹi ]k,z − [Φk ]z,. [X̃i ]k,. = 0
n [Pk ]z,z
2n
n
i=1
for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, which is equivalent to
−1 +
n
n
i=1
i=1
X
X
1
1
[Pk ]2z,z
[Pk ]z,z
[Ỹi ]k,z [Φk ]z,. [X̃i ]k,. = 0
[Ỹi ]2k,z −
nk
nk
1
1
⇔ −1 + [Pk ]2z,z ||[Ỹ]k,z ||22 − [Pk ]z,z h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i = 0.
nk
nk
The discriminant is
∆=
1
− h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i
nk
2
−
4
||[Ỹ]k,z ||22 .
nk
Then, for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q},
[Pk ]z,z =
nk h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i +
√
2nk ||[Ỹ]k,z ||22
∆
.
We could also look at the equation (1.11) as a function of the variable Φ: according
to the partial derivative with respect to [Φk ]z,j , we obtain for all z ∈ {1, . . . , q}, for all
k ∈ {1, . . . , K}, for all j ∈ {1, . . . , p},
p
n
X
X
[X̃i ]k,j [Pk ]z,z [Ỹi ]k,z −
[X̃i ]k,j2 [Φk ]z,j2 − nλπk sgn([Φk ]z,j ) = 0.
i=1
j2 =1
Then, for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q},
Pn
i=1 [X̃i ]k,j [Pk ]z,z [Ỹi ]k,z
[Φk ]z,j =
−
Pp
j2 =1
j2 6=j
[X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 − nλπk sgn([Φk ]j,z )
||[X̃]k,j ||22
To reduce notations, let, for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q},
[Sk ]j,z = −
n
X
[X̃i ]k,j [Pk ]z,z [Ỹi ]k,z +
i=1
p
X
[X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 .
j2 =1
j2 6=j
Then
[Φk ]z,j =
=
−[Sk ]j,z − nλπk sgn([Φk ]z,j )
||[X̃]k,j ||22
−[Sk ]j,z +nλπk
||[X̃]k,j ||22
[Sk ]j,z +nλπk
− ||[X̃] ||2
k,j 2
0
if [Sk ]j,z > nλπk
if [Sk ]j,z < −nλπk
elsewhere.
.
70
1.7. APPENDICES
From these equalities, we could write the updating formulae. For j ∈ {1, . . . , p}, k ∈ {1, . . . , K},
z ∈ {1, . . . , q},
(ite)
[Sk ]j,z
nk =
=−
n
X
n
X
[X̃i ]k,j [Pk ](ite)
z,z [Ỹi ]k,z
p
X
+
i=1
(ite)
[X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 ;
j2 =1
j2 6=j
τi,k ;
i=1
([Ỹi ]k,. , [X̃i ]k,. ) =
√
τi,k (Yi , Xi ).
and t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, 0 < δ < 1, such that the function is
not increasing.
EM algorithm for the rank procedure
To take into account the matrix structure, we want to make a dimension reduction on the rank
of the mean matrix. If we know to which cluster each sample belongs, we could compute the
low rank estimator for linear model in each component.
Indeed, an estimator of fixed rank r is known in the linear regression case: denoting A+ the
Moore-Penrose pseudo-inverse of A, and [A]r = U Dr V t in which Dr is obtained from D by
setting (Dr )i,i = 0 for i ≥ r + 1, with U DV t the singular decomposition of A, if Y = βX + Σ,
an estimator of β with rank r is β̂r = [(xt x)+ xt y]r .
We do not know the clustering of the sample, but the E-step in the EM algorithm computes it.
We suppose in this case that Σk and πk are known, for all k ∈ {1, . . . , K}. We use this algorithm
to determine Φk , for all k ∈ {1, . . . , K}, with ranks fixed to R = (R1 , . . . , RK ).
— E-step: compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n},
τi,k = Eθ(ite) ([Zi ]k |Y )
(ite)
(ite)
πk det Pk exp
=
PK
− 21
(ite)
(ite)
det Pk exp
r=1 πk
(ite)
Pk y i
− 21
−
(ite) t
(ite)
x i Φk
Pk yi
(ite)
Pk y i
−
−
(ite)
x i Φk
(ite) t
(ite)
x i Φk
Pk y i
−
(ite)
x i Φk
— M-step: assign each observation in its estimated cluster, by the MAP principle applied
(ite)
thanks to the E-step. We say that Yi comes from component number argmax τi,k .
k∈{1,...,K}
Then, we can define β˜k
(ite)
=
(xt|k x|k )−1 xt|k y|k ,
in which x|k and y|k are a restriction of
(ite)
the sample to the cluster k, which we decompose in singular value with β̃k
= U SV t .
Using the singular value decomposition described before, we obtain the estimator.
1.7.2
Group-Lasso MLE and Group-Lasso Rank procedures
One way to perform those procedures is to consider the Group-Lasso estimator rather than the
Lasso estimator to select relevant variables. Indeed, this estimator is more natural, according to
the relevant variable definition. Nevertheless, results are very similar, because we select grouped
variables in both case, selected by the Lasso or by the Group-Lasso estimator. In this section,
we describe our procedures with the Group-Lasso estimator, which could be understood as an
improvement of our procedures.
71
CHAPTER 1. TWO PROCEDURES
Context - definitions
Our both procedures take advantage of the Lasso estimator to select relevant variables, to reduce
the dimension in case of high-dimensional datasets. First, recall what is a relevant variable.
Definition 1.7.1. A variable indexed by (z, j) ∈ {1, . . . , q} × {1, . . . , p} is irrelevant for the
clustering if [Φ1 ]z,j = . . . = [ΦK ]z,j = 0. A relevant variable is a variable which is not irrelevant.
We denote by J the relevant variables set.
According to this definition, we could introduce the Group-Lasso estimator.
Definition 1.7.2. The Lasso estimator for mixture regression models with regularization parameter λ ≥ 0 is defined by
1
Lasso
θ̂
(λ) := argmin − lλ (θ) ;
n
θ∈ΘK
where
where ||Φk ||1 =
K
Pp
j=1
Pq
X
1
1
− lλ (θ) = − l(θ) + λ
πk ||Φk ||1 ;
n
n
k=1
z=1 |[Φk ]z,j |,
and with λ to specify.
It is the estimator used in the both procedures described in previous parts.
Definition 1.7.3. The Group-Lasso estimator for mixture regression models with regularization
parameter λ ≥ 0 is defined by
1˜
Group-Lasso
θ̂
(λ) := argmin − lλ (θ) ;
n
θ∈ΘK
where
p
q
XX√
1
1
k||[Φ]z,j ||2 ;
− ˜lλ (θ) = − l(θ) + λ
n
n
j=1 z=1
where ||[Φ]z,j ||22 =
PK
k=1 |[Φk ]z,j |
2,
and with λ to specify.
This Group-Lasso estimator has the advantage to cancel grouped variables rather than variables
one by one. It is consistent with the relevant variable definition.
Nevertheless, depending on datasets, it could be interesting to look after which variables are canceled first. One way could be to extend this work with Lasso-Group-Lasso estimator, described
for the example for the linear model in [SFHT13].
Let describe two additional procedures, which will use the Group-Lasso estimator rather than
the Lasso estimator to detect relevant variables.
Group-Lasso-MLE procedure
This procedure is decomposed into three main steps: we construct a model collection, then in
each model we compute the maximum likelihood estimator, and we select the best one among
all the models.
The first step consists of constructing a collection of models {H(K,J)
in which H(K,J)
˜ }(K,J)∈M
˜
˜
is defined by
q
p
H(K,J)
(1.12)
˜ = {y ∈ R |x ∈ R 7→ hθ (y|x)} ;
72
1.7. APPENDICES
where
hθ (y|x) =
K
X
πk det(Pk )
k=1
and
(2π)q/2
˜
[J]
˜
[J]
(Pk y − Φk x)t (Pk y − Φk x)
exp −
2
θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , ρ1 , . . . , ρK ) ∈ ΠK × Rq×p
K
!
× Rq+
,
K
.
The model collection is indexed by M = K × J˜. Denote K ⊂ N∗ the possible number of
components. We could bound K without loss of estimation. Denote also J˜ a collection of
subsets of {1, . . . , q} × {1, . . . , p}, constructed by the Group-Lasso estimator.
To detect the relevant variables, and construct the set J˜ ∈ J˜, we will use the Group-Lasso
estimator defined by (1.7.3). In the ℓ1 -procedures, the choice of the regularization parameters is
often difficult: fixing the number of components K ∈ K, we propose to construct a data-driven
grid GK of regularization parameters by using the updating formulae of the mixture parameters
in the EM algorithm.
Then, for each λ ∈ GK , we could compute the Group-Lasso estimator defined by
p X
q
n
1X
X
√
Group-Lasso
θ̂
= argmin −
log(hθ (yi |xi )) + λ
K||[Φ]z,j ||2 .
θ∈ΘK n
j=1 z=1
i=1
For a fixed number of mixture components K ∈ K and a regularization parameter λ, we could
use a generalized EM algorithm to approximate this estimator. Then, for each K ∈ K, and for
each λ ∈ GK , we have constructed the relevant variables set J˜λ . We denote by J˜ the collection
of all these sets.
The second step consists of approximating the MLE
(
)
n
1X
˜
(K,J)
ĥ
= argmin −
log(t(yi |xi )) ;
n
t∈H(K,J)
˜
i=1
˜ ∈ M.
using the EM algorithm for each model (K, J)
The third step is devoted to model selection. We use the slope heuristic described in [BM07].
Group-Lasso-Rank procedure
As when the relevant variables were selected by the Lasso estimator, whereas the previous
procedure does not take into account the multivariate structure, we propose a second procedure
to perform this point. For each model belonging to the collection H(K,J)
˜ , a subcollection is
constructed, varying the rank of Φ. Let us describe this procedure.
As in the Group-Lasso-MLE procedure, we first construct a collection of models, thanks to the
ℓ1 -approach. We obtain an estimator for θ, denoted by θ̂Group-Lasso , for each model belonging
˜ and this for all
to the collection. We could deduce the set of relevant variables, denoted by J,
˜
K ∈ K: we deduce J the collection of set of relevant variables.
The second step consists to construct a subcollection of models with rank sparsity, denoted by
{Ȟ(K,J,R)
˜ }(K,J,R)∈
˜
M̃ .
˜
The model {Ȟ(K,J,R)
˜ } has K components, the set J for active variables, and R is the vector of
the ranks of the matrix of regression coefficients in each group:
o
n
˜
(K,J,R)
(y|x)
(1.13)
Ȟ(K,J,R)
= y ∈ Rq |x ∈ Rp 7→ hθ
˜
73
CHAPTER 1. TWO PROCEDURES
where
!
˜
˜
t (P y − (ΦRk )[J]
k [J]
(Pk y − (ΦR
)
x)
x)
k
k
k
;
exp −
=
q/2
2
(2π)
k=1
q K
RK
R
1
;
θ =(π1 , . . . , πK , ΦR
1 , . . . , ΦK , P1 , . . . , PK ) ∈ ΠK × ΨK × R+
n
o
RK
R1
q×p K
ΨR
=
(Φ
,
.
.
.
,
Φ
)
∈
R
|Rank(Φ
)
=
R
,
.
.
.
,
Rank(Φ
)
=
R
1
1
K
K ;
K
1
K
˜
(K,J,R)
hθ
(y|x)
K
X
πk det(Pk )
and M̃R = K × J˜ × R. Denote K ⊂ N∗ the possible number of components, J˜ a collection
of subsets of {1, . . . , q} × {1, . . . , p}, and R the set of vectors of size K ∈ K with ranks values
for each mean matrix. We could compute the MLE under the rank constraint thanks to an EM
algorithm. Indeed, we could constrain the estimation of Φk , for all k, to have a rank equal to
Rk , in keeping only the Rk largest singular values. More details are given in Section 1.7.1. It
leads to an estimator of the mean with row sparsity and low rank for each model.
1.7. APPENDICES
74
Chapter 2
An ℓ1-oracle inequality for the Lasso in
finite mixture of multivariate Gaussian
regression models
Contents
2.1
2.2
2.3
2.4
2.5
2.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Finite mixture regression model . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Boundedness assumption on the mixture and component parameters . .
2.2.3 Maximum likelihood estimator and penalization . . . . . . . . . . . . .
Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of the oracle inequality . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Main propositions used in this proof . . . . . . . . . . . . . . . . . . . .
2.4.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Proof of the Theorem 2.4.1 thanks to the Propositions 2.4.2 and 2.4.3 .
2.4.4 Proof of the Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of the theorem according to T or T c . . . . . . . . . . . . . . .
2.5.1 Proof of the Proposition 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Proof of the Proposition 2.4.3 . . . . . . . . . . . . . . . . . . . . . . . .
Some details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Proof of the Lemma 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Lemma 2.6.5 and Lemma 4.15 . . . . . . . . . . . . . . . . . . . . . . .
75
69
70
70
71
71
72
74
74
76
76
77
78
78
81
83
83
86
76
2.1. INTRODUCTION
This chapter focuses on the Lasso estimator for its regularization
properties. We consider a finite mixture of Gaussian regressions
for high-dimensional heterogeneous data, where the number of covariates and the dimension of the response may be much larger
than the sample size. We estimate the unknown conditional density by an ℓ1 -penalized maximum likelihood estimator. We provide
an ℓ1 -oracle inequality satisfied by this Lasso estimator according
to the Kullback-Leibler loss. This result is an extension of the
ℓ1 -oracle inequality established by Meynet in [Mey13] in the multivariate case. It is deduced from a model selection theorem, the
Lasso being viewed as the solution of a penalized maximum likelihood model selection procedure over a collection of ℓ1 -ball models.
2.1
Introduction
Finite mixture regression models are useful for modeling the relationship between response and
predictors, arising from different subpopulations. Due to recent improvements, we are faced
with high-dimensional data where the number of variables can be much larger than the sample
size. We have to reduce the dimension to avoid identifiability problems. Considering a mixture
of linear models, an assumption widely used is to say that only a few covariates explain the
response. Among various methods, we focus on the ℓ1 -penalized least squares estimator of
parameters to lead to sparse regression matrix. Indeed, it is a convex surrogate for the nonconvex ℓ0 -penalization, and produces sparse solutions. First introduced by Tibshirani in [Tib96]
in a linear model Y = Xβ + ǫ, where X ∈ Rp , Y ∈ R, and ǫ ∼ N (0, Σ), the Lasso estimator is
defined in the linear model by
β̂ Lasso (λ) = argmin ||Y − Xβ||22 + λ||β||1 ,
λ > 0.
β∈Rp
Many results have been proved to study the performance of this estimator. For example, cite
[BRT09] and [EHJT04], for studying this estimator as a variable selection procedure in this linear
model case. Note that those results need strong assumptions on the Gram matrix X t X, as the
restrictive eigenvalue condition, that can be not fulfilled in practice. A summary of assumptions
and results is given by Bühlmann and van de Geer in [vdGB09]. One can also cite van de Geer
in [vdGBZ11] and discussions, who precises a chaining argument to perform rate, even in a non
linear case.
If we assume that (xi , yi )1≤i≤n arise from different subpopulations, we could work with finite
mixture regression models. Indeed, the homogeneity assumption of the linear model is often
inadequate and restrictive. This model was introduced by Städler et al., in [SBG10]. They
assume that, for i ∈ {1, . . . , n}, the observation yi , conditionally to Xi = xi , comes from a
conditional density sξ0 (.|xi ) which is a finite mixture of K Gaussian conditional densities with
proportion vector π, where
K
X
πk0
(yi − βk0 xi )2
√
Yi |Xi = xi ∼ sξ0 (yi |xi ) =
exp −
2(σk0 )2
2πσk0
k=1
for some parameters ξ 0 = (πk0 , βk0 , σk0 )1≤k≤K . They extend the Lasso estimator by
p
n
K
1X
X
X
−1
log(sK
(y
|x
))
+
λ
,
λ>0
ŝLasso (λ) = argmin −
π
|σ
[β
]
|
i
i
j
k
k
ξ
k
n
sK
ξ
i=1
k=1
j=1
(2.1)
77
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
For this estimator, they provide an ℓ0 -oracle inequality satisfied by ŝLasso (λ), according to the
restricted eigenvalue condition also, and margin conditions, which lead to link the KullbackLeibler loss function to the ℓ2 -norm of the parameters.
Another way to study this estimator is to look after the Lasso for its ℓ1 -regularization properties.
For example, cite [MM11a], [Mey13], and [RT11]. Contrary to the ℓ0 -results, some ℓ1 -results
are valid with no assumptions, neither on the Gram matrix, nor on the margin. This can be
√
achieved due to the fact that they are looking for rate of convergence of order 1/ n rather than
1/n. For finite mixture Gaussian regression models, we could cite Meynet in [Mey13] who give
an ℓ1 -oracle inequality for the Lasso estimator (2.1).
In this chapter, we extend this result to finite mixture of multivariate Gaussian regression models. We will work with random multivariate variables (X, Y ) ∈ Rp × Rq . As in [Mey13], we shall
restrict to the fixed design case, that is to say non-random regressors. We observe (xi )1≤i≤n .
Without any restriction, we could assume that the regressors xi ∈ [0, 1]p for all i ∈ {1, . . . , n}.
Under only bounded parameters assumption, we provide a lower bound on the Lasso regularization parameter λ which guarantees such an oracle inequality.
This result is non-asymptotic: the number of observations is fixed, and the number p of covariates
can grow. Remark that the number K of clusters in the mixture is supposed to be known.
Our result is deduced from a finite mixture multivariate Gaussian regression model selection
theorem for ℓ1 -penalized maximum likelihood conditional density estimation. We establish the
general theorem following the one of Meynet in [Mey13], which combines Vapnik’s structural risk
minimization method (see Vapnik in [Vap82]) and theory around model selection (see Le Pennec
and Cohen in [CLP11] and Massart in [Mas07]). As in Massart and Meynet in [MM11a], our
oracle inequality is deduced from this general theorem, the Lasso being viewed as the solution of
a penalized maximum likelihood model selection procedure over a countable collection of ℓ1 -ball
models.
The chapter is organized as follows. The model and the framework are introduced in Section 2.2.
In Section 2.3, we state the main result of the chapter, which is an ℓ1 -oracle inequality satisfied
by the Lasso in finite mixture of multivariate Gaussian regression models. Section 2.4 is devoted
to the proof of this result and of the general theorem, deriving from two easier propositions.
Those propositions are proved in Section 2.5, whereas details of lemma states in Section 2.6.
2.2
2.2.1
Notations and framework
Finite mixture regression model
We observe n independent couples (x, y) = (xi , yi )1≤i≤n ∈ ([0, 1]p ×Rq )n , with yi ∈ Rq a random
observation, realization of variable Yi , and xi ∈ [0, 1]p fixed for all i ∈ {1, . . . , n}. We assume
that, conditionally to the xi s, the Yi s are independent identically distributed with conditional
density sξ0 (.|xi ) which is a finite mixture of K Gaussian regressions with unknown parameters
ξ 0 . In this chapter, K is fixed, then we do not precise it with unknown parameters. We will
estimate the unknown conditional density by a finite mixture of K Gaussian regressions. Each
subpopulation is then estimated by a multivariate linear model. Detail the conditional density.
78
2.2. NOTATIONS AND FRAMEWORK
For all y ∈ Rq , for all x ∈ [0, 1]p ,
K
X
(y − βk x)t Σ−1
πk
k (y − βk x)
exp
−
sξ (y|x) =
q/2
1/2
2
(2π) det(Σk )
k=1
!
K
ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ Ξ = ΠK × (Rq×p )K × (S++
q )
(
)
K
X
ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and
πk = 1
(2.2)
k=1
S++
q
is the set of symmetric positive definite matrices on Rq .
We want to estimate ξ 0 from the observations. For all k ∈ {1, . . . , K}, βk is the matrix of
regression coefficients, and Σk is the covariance matrix in the mixture component k, whereas the
πk s are the mixture proportions. For x ∈ [0, 1]p , we define the parameter ξ(x) of the conditional
density sξ (.|x) by
K
ξ(x) = (π1 , . . . , πK , β1 x, . . . , βK x, Σ1 , . . . , ΣK ) ∈ RK × (Rq )K × (S++
q ) .
P
For all k ∈ {1, . . . , K}, for all x ∈ [0, 1]p , for all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j [x]j , and
then βk x ∈ Rq is the mean vector of the mixture component k for the conditional density sξ (.|x).
2.2.2
Boundedness assumption on the mixture and component parameters
Denote, for a matrix A, m(A) the modulus of the smallest eigenvalue of A, and M (A) the
modulus of the largest eigenvalue of A. We shall restrict our study to bounded parameters
vector ξ = (π, β, Σ), where π = (π1 , . . . , πK ), β = (β1 , . . . , βK ), Σ = (Σ1 , . . . , ΣK ). Specifically,
we assume that there exists deterministic positive constants Aβ , aΣ , AΣ , aπ such that ξ belongs
to Ξ̃, with
Ξ̃ =
(
ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max
sup |[βk x]z | ≤ Aβ ,
z∈{1,...,q} x∈[0,1]p
aΣ ≤
m(Σ−1
k )
≤
M (Σ−1
k )
≤ AΣ , aπ ≤ πk . (2.3)
Let S the set of conditional densities sξ ,
n
o
S = sξ , ξ ∈ Ξ̃ .
2.2.3
Maximum likelihood estimator and penalization
In a maximum likelihood approach, we consider the Kullback-Leibler information as the loss
function, which is defined for two densities s and t by
Z
s(y)
log
s(y)dy if sdy << tdy;
t(y)
KL(s, t) =
(2.4)
Rq
+ ∞ otherwise.
In a regression framework, we have to adapt this definition to take into account the structure of
conditional densities. For the fixed covariates (x1 , . . . , xn ), we consider the average loss function
n
n Z
1X
1X
s(y|xi )
s(y|xi )dy.
KL(s(.|xi ), t(.|xi )) =
log
KLn (s, t) =
n
n
t(y|xi )
Rq
i=1
i=1
79
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
Using the maximum likelihood approach, we want to estimate sξ0 by the conditional density sξ
which maximizes the likelihood conditionally to (xi )1≤i≤n . Nevertheless, because we work with
high-dimensional data, we have to regularize the maximum likelihood estimator. We consider the
ℓ1 -regularization, and a generalization of the estimator associated, the Lasso estimator, which
we define by
q X
p
n
K X
1X
X
log(sξ (yi |xi )) + λ
ŝLasso (λ) := argmin −
|[βk ]z,j | ;
sξ ∈S n
k=1 z=1 j=1
i=1
where λ > 0 is a regularization parameter, for ξ = (π, β, Σ).
We define also, for sξ defined as in (2.2), and with parameters ξ = (π, β, Σ),
[2]
N1 (sξ )
2.3
= ||β||1 =
p X
q
K X
X
k=1 j=1 z=1
(2.5)
|[βk ]z,j |.
Oracle inequality
In this section, we provide an ℓ1 -oracle inequality satisfied by the Lasso estimator in finite
mixture multivariate Gaussian regression models, which is the main result of this chapter.
Theorem 2.3.1. We observe n couples (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ ([0, 1]p × Rq )n coming
from the conditional density sξ0 , where ξ 0 ∈ Ξ̃, where
Ξ̃ =
(
ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max
sup |[βk x]z | ≤ Aβ ,
z∈{1,...,q} x∈[0,1]p
aΣ ≤
m(Σ−1
k )
≤
M (Σ−1
k )
≤ AΣ , aπ ≤ πk .
Denote by a ∨ b = max(a, b).
We define the Lasso estimator, denoted by ŝLasso (λ), for λ ≥ 0, by
ŝ
Lasso
!
n
1X
[2]
(λ) = argmin −
log(sξ (yi |xi )) + λN1 (sξ ) ;
n
sξ ∈S
(2.6)
i=1
with
n
o
S = sξ , ξ ∈ Ξ̃
and where, for ξ = (π, β, Σ),
[2]
N1 (sξ ) = ||β||1 =
p X
q
K X
X
k=1 j=1 z=1
|[βk ]z,j |.
Then, if
1
λ ≥ κ AΣ ∨
aπ
r
p
log(n)
K
2
1 + 4(q + 1)AΣ Aβ +
1 + q log(n) K log(2p + 1)
aΣ
n
with κ an absolute positive constant, the estimator (2.6) satisfies the following ℓ1 -oracle inequality.
80
2.3. ORACLE INEQUALITY
E[KLn (sξ0 , ŝLasso (λ))] ≤ (1 + κ−1 )
+
+
κ
κ
′
′
q
q
inf
sξ ∈S
1
K
n
K
n
[2]
KLn (sξ0 , sξ ) + λN1 (sξ ) + λ
e− 2 π q/2 aπ p
q/2
AΣ
2q
log(n)
1
2
1 + 4(q + 1)AΣ Aβ +
AΣ ∨
aπ
aΣ
2
q
×K 1 + Aβ +
aΣ
where κ′ is a positive constant.
This theorem provides information about the performance of the Lasso as an ℓ1 -regularization
algorithm. If the regularization parameter λ is properly chosen, the Lasso estimator, which
is the solution of the ℓ1 -penalized empirical risk minimization problem, behaves as well as the
deterministic Lasso, which is the solution of the ℓ1 -penalized true risk minimization problem,
up to an error term of order λ.
Our result is non-asymptotic: the number n of observations is fixed while the number p of
covariates and the size q of he response can grow with respect to n and can be much larger than
n. The number K of clusters in the mixture is fixed.
There is no assumption neither on the Gram matrix, nor on the margin, which are classical
assumptions for oracle inequality for the Lasso estimator. Moreover, this kind of assumptions
involve unknown constants, whereas here, every constants are explicit. We could compare this
result with the ℓ0 -oracle inequality established in [SBG10], which need those assumptions, and
is therefore difficult to interpret. Nevertheless, they get faster rate, the error term in the oracle
√
inequality being of order 1/n rather than 1/ n.
The main assumption we make to establish the Theorem 2.3.1 is the boundedness of the parameters, which is also assumed in [SBG10]. It is needed, to tackle the problem of the unboundedness
of the parameter space (see [MP04] for example).
Moreover, we let regressors to belong to [0, 1]p . Because we work with fixed covariates, they are
finite. To simplify the reading, we choose to rescale x to get ||x||∞ ≤ 1. Nevertheless, if we
not rescale the covariates, and the regularization parameter λ bound and the error term of the
oracle inequality depend linearly of ||x||∞ .
p
√
The regularization parameter λ bound is of order (q 2 + q)/ n log(n)2 log(2p + 1). For q = 1,
we recognize the same order, as regards to the sample size n and the number of covariates p, to
the ℓ1 -oracle inequality in [Mey13].
Van de Geer,
q in [vdGBZ11], gives some tools to improve the bound of the regularization parameter to log(p)
n . Nevertheless, we have to control eigenvalues of the Gram matrix of some
functions (ψj (xi )) 1≤j≤D , D being the number of parameters to estimate, where ψj (xi ) satisfies
1≤i≤n
| log(sξ (yi |xi )) − log(sξ̃ (yi |xi ))| ≤
D
X
j=1
|ξj − ξ˜j |ψj (xi ).
In our case of mixture of regression models, control eigenvalues of the Gram matrix of functions
(ψj (xi )) 1≤j≤D corresponds to make some assumptions, as REC, to avoid dimension reliance on
1≤i≤n
n, K and p. Without this kind of assumptions, we could not guarantee that our bound is of order
q
log(p)
n , because we could not guarantee that eigenvalues does not depend on dimensions. In
81
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
order to get a result with smaller assumptions, we do not use the chaining argument developed
in [vdGBZ11]. Nevertheless, one can easily compute that, under restricted
eigenvalue condition,
q
log(p)
n
we could perform the order of the regularization parameter to λ ≍
2.4
2.4.1
log(n).
Proof of the oracle inequality
Main propositions used in this proof
The first result we will prove is the next theorem, which is an ℓ1 -ball mixture multivariate
regression model selection theorem for ℓ1 -penalized maximum likelihood conditional density
estimation in the Gaussian framework.
Theorem 2.4.1. We observe (xi , yi )1≤i≤n with unknown conditional Gaussian mixture density
sξ 0 .
[2]
For all m ∈ N∗ , we consider the ℓ1 -ball Sm = {sξ ∈ S, N1 (sξ ) ≤ m} for S = {sξ , ξ ∈ Ξ̃}, and
Ξ̃ defined by
(
Ξ̃ =
ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max
sup |[βk x]z | ≤ Aβ ,
z∈{1,...,q} x∈[0,1]p
aΣ ≤
m(||Σk−1 ||)
≤
M (Σ−1
k )
≤ AΣ , aπ ≤ πk .
For ξ = (π, β, Σ), let
[2]
N1 (sξ )
= ||β||1 =
p X
q
K X
X
k=1 j=1 z=1
Let ŝm an ηm -log-likelihood minimizer in Sm , for ηm ≥ 0:
n
1X
log(ŝm (yi |xi )) ≤ inf
−
sm ∈Sm
n
i=1
n
|[βk ]z,j |.
1X
log(sm (yi |xi ))
−
n
i=1
!
+ ηm .
Assume that for all m ∈ N∗ , the penalty function satisfies pen(m) = λm with
r
p
K
log(n)
1
2
1 + 4(q + 1)AΣ Aβ +
1 + q log(n) K log(2p + 1)
λ ≥ κ AΣ ∨
aπ
aΣ
n
for a constant κ. Then, if m̂ is such that
n
1X
−
log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗
m∈N
n
i=1
for η ≥ 0, the estimator ŝm̂ satisfies
−1
n
1X
log(ŝm (yi |xi )) + pen(m)
−
n
i=1
!
+η
inf KLn (sξ0 , sm ) + pen(m) + ηm + η
E(KLn (sξ0 , ŝm̂ )) ≤(1 + κ ) inf ∗
sm ∈Sm
m∈N
r
1
K e− 2 π q/2 p
′
2qaπ
+κ
n Aq/2
Σ
r
4(q + 1)
log(n)
K
1
′
+κ
1+
K AΣ ∨
AΣ A2β +
n
aπ
2
aσ
2
q
× 1 + Aβ +
;
aΣ
82
2.4. PROOF OF THE ORACLE INEQUALITY
′
where κ is a positive constant.
It is an ℓ1 -ball mixture regression model selection theorem for ℓ1 -penalized maximum likelihood
conditional density estimation in the Gaussian framework. Its proof could be deduced from the
two following propositions, which split the result if the variable Y is large enough or not.
Proposition 2.4.2. We observe (xi , yi )1≤i≤n , with unknown conditional density denoted by sξ0 .
Let Mn > 0, and consider the event
T :=
max
max |[Yi ]z | ≤ Mn .
i∈{1,...,n} z∈{1,...,q}
For all m ∈ N∗ , we consider the ℓ1 -ball
[2]
Sm = {sξ ∈ S, N1 (sξ ) ≤ m}
where S = {sξ , ξ ∈ Ξ̃} and
[2]
N1 (sξ )
= ||β||1 =
p X
q
K X
X
k=1 j=1 z=1
|[βk ]z,j |
for ξ = (π, β, Σ).
Let ŝm an ηm -log-likelihood minimizer in Sm , for ηm ≥ 0:
n
1X
−
log(ŝm (yi |xi )) ≤ inf
sm ∈Sm
n
i=1
n
1X
log(sm (yi |xi ))
−
n
i=1
!
+ ηm .
q(|Mn |+Aβ )AΣ
. Assume that for all m ∈ N∗ , the
Let CMn = max a1π , AΣ + 21 (|Mn | + Aβ )2 A2Σ ,
2
penalty function satisfies pen(m) = λm with
p
4CM √
λ ≥ κ √ n K 1 + 9q log(n) K log(2p + 1)
n
for some absolute constant κ. Then, any estimate ŝm̂ with m̂ such that
n
1X
−
log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗
m∈N
n
i=1
n
1X
log(ŝm (yi |xi )) + pen(m)
−
n
i=1
!
+η
for η ≥ 0, satisfies
E(KLn (sξ0 , ŝm̂ )1T ) ≤(1 + κ
−1
) inf ∗
′
+
m∈N
κ K 3/2 qCMn
√
n
inf KLn (sξ0 , sm ) + pen(m) + ηm
!
q 2
;
1 + Aβ +
aΣ
sm ∈Sm
′
where κ is an absolute positive constant.
Proposition 2.4.3. Let sξ0 , T and ŝm̂ defined as in the previous proposition. Then,
E(KLn (sξ0 , ŝm̂ )1T c ) ≤
e−1/2 π q/2 p
2
2Knqaπ e−1/4(Mn −2Mn Aβ )aΣ .
q/2
AΣ
83
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
2.4.2
Notations
To prove those two propositions, and the theorem, begin with some notations.
For any measurable function g : Rq 7→ R , we consider the empirical norm
v
u n
u1 X
g 2 (yi |xi );
gn := t
n
i=1
its conditional expectation
E(g(Y |X)) =
Z
g(y|x)sξ0 (y|x)dy;
Rq
its empirical process
n
1X
g(yi |xi );
Pn (g) :=
n
i=1
and its normalized process
Z
n
1X
g(yi |xi ) −
g(y|xi )sξ0 (y|xi )dy .
νn (g) := Pn (g) − EX (Pn (g)) =
n
Rq
i=1
For all m ∈ N∗ , for all model Sm , we define Fm by
sm
Fm = fm = − log
, s m ∈ Sm .
sξ 0
Let δKL > 0. For all m ∈ N∗ , let ηm ≥ 0. There exist two functions, denoted by ŝm̂ and s̄m ,
belonging to Sm , such that
Pn (− log(ŝm̂ )) ≤ inf Pn (− log(sm )) + ηm ;
sm ∈Sm
Denote by fˆm := − log
set M (m) by
(2.7)
KLn (sξ0 , s̄m ) ≤ inf KLn (sξ0 , sm ) + δKL .
ŝm
sξ 0
sm ∈Sm
and f¯m := − log
s̄m
sξ 0
. Let η ≥ 0 and fix m ∈ N∗ . We define the
M (m) = m′ ∈ N∗ |Pn (− log(ŝm′ )) + pen(m′ ) ≤ Pn (− log(ŝm )) + pen(m) + η .
2.4.3
(2.8)
Proof of the Theorem 2.4.1 thanks to the Propositions 2.4.2 and 2.4.3
Let Mn > 0 and κ ≥ 36. Let CMn = max a1π , AΣ + 21 (|Mn | + Aβ )2 A2Σ , q(|Mn | + Aβ )AΣ /2 .
Assume that, for all m ∈ N∗ , pen(m) = λm, with
r
p
K
λ ≥ κCMn
1 + q log(n) K log(2p + 1) .
n
We derive from the two propositions that there exists κ′ such that, if m̂ satisfies
n
1X
−
log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗
m∈N
n
i=1
n
1X
−
log(ŝm (yi |xi )) + pen(m)
n
i=1
!
+ η;
2.4. PROOF OF THE ORACLE INEQUALITY
84
then ŝm̂ satisfies
E(KLn (sξ0 , ŝm̂ )) = E(KLn (sξ0 , ŝm̂ )1T ) + E(KLn (sξ0 , ŝm̂ )1T c )
−1
inf KLn (sξ0 , sm ) + pen(m) + ηm
≤(1 + κ ) inf ∗
sm ∈Sm
m∈N
q 2
CM
+ κ′ √ n K 3/2 q 1 + (Aβ +
) +η
aΣ
n
e−1/2 π q/2 p
2
+ κ′
2Knqaπ e−1/4(Mn −2Mn Aβ )aΣ .
q/2
AΣ
In order to optimize this equation with respect to Mn , we consider Mn the positive solution of
the polynomial
1
log(n) − (X 2 − 2XAβ )aΣ = 0;
4
q
√
2
and ne−1/4(Mn −2Mn Aβ )aΣ = √1n .
we obtain Mn = Aβ + A2β + 4 log(n)
aΣ
On the other hand,
q+1
1
2
1+
AΣ (Mn + Aβ )
C M n ≤ AΣ ∨
aπ
2
1
log(n)
≤ AΣ ∨
1 + 4(q + 1)AΣ A2β +
.
aπ
aΣ
We get
(sξ0 , ŝm̂ )1T ) + E(KLn (sξ0 , ŝm̂ )1T c )
−1
inf KLn (sξ0 , sm ) + pen(m) + ηm + η
≤ (1 + κ ) inf ∗
E(KLn (sξ0 , ŝm̂ )) =
E(KLn
sm ∈Sm
m∈N
+κ′
+κ′
2.4.4
q
q
K
n
K
n
− 12
p
e
2qaπ
q/2
(qAΣ )
log(n)
1
2
1 + 4(q + 1)AΣ Aβ +
AΣ ∨
aπ
aΣ
2 !
q
×K 1 + Aβ +
.
aΣ
π q/2
Proof of the Theorem 2.3.1
We will show that there exists ηm ≥ 0, and η ≥ 0 such that ŝLasso (λ) satisfies the hypothesis of
the Theorem 2.4.1, which will lead to Theorem 2.3.1.
First, let show that there exists m ∈ N∗ and ηm ≥ 0 such that the Lasso estimator is an
ηm -log-likelihood minimizer in Sm .
[2]
For all λ ≥ 0, if mλ = ⌈N1 (ŝ(λ))⌉,
!
n
1X
Lasso
log(s(yi |xi )) .
ŝ
(λ) = argmin −
n
s∈S
[2]
N1 (s)≤mλ
We could take ηm = 0.
Secondly, let show that there exists η ≥ 0 such that
i=1
85
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
n
1X
log(ŝLasso (λ)(yi |xi )) + pen(mλ ) ≤ inf ∗
−
m∈N
n
i=1
n
1X
log(ŝm (yi |xi )) + pen(m)
−
n
i=1
!
+ η.
Taking pen(mλ ) = λmλ ,
n
−
n
1X
1X
log(ŝLasso (λ)(yi |xi )) + pen(mλ ) = −
log(ŝLasso (λ)(yi |xi )) + λmλ
n
n
i=1
i=1
n
1X
[2]
≤−
log(ŝLasso (λ)(yi |xi )) + λN1 (ŝLasso (λ)) + λ
n
i=1
)
(
n
1X
[2]
log(sξ (yi |xi )) + λN1 (sξ ) + λ
≤ inf −
sξ ∈S
n
i=1
)
(
n
1X
[2]
log(sξ (yi |xi )) + λN1 (sξ ) + λ
≤ inf ∗ inf
−
m∈N sξ ∈Sm
n
i=1
(
)
!
n
1X
≤ inf ∗
inf
−
log(sξ (yi |xi )) + λm + λ
sξ ∈Sm
m∈N
n
i=1
!
n
X
1
≤ inf ∗ −
log(ŝm (yi |xi )) + λm + λ.
m∈N
n
i=1
which is exactly the goal, with η = λ. Then, according to the Theorem 2.4.1, with m̂ = mλ ,
and ŝm̂ = ŝLasso (λ), for
r
p
log(n)
1
K
1 + 4(q + 1)AΣ A2β +
1 + q log(n) K log(2p + 1) ,
λ ≥ κ AΣ ∨
aπ
aΣ
n
we get the oracle inequality.
2.5
2.5.1
Proof of the theorem according to T or T c
Proof of the Proposition 2.4.2
This proposition corresponds to the main theorem according to the event T . To prove it, we
need some preliminary results.
From our notations, reminded in Section 2.4.2, we have, for all m ∈ N∗ for all m′ ∈ M (m),
Pn (fˆm′ ) + pen(m′ ) ≤ Pn (fˆm ) + pen(m) + η ≤ Pn (f¯m ) + pen(m) + ηm + η;
E(Pn (fˆm′ )) + pen(m′ ) ≤ E(Pn (f¯m )) + pen(m) + ηm + η + νn (f¯m ) − νn (fˆm′ );
KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + δKL + pen(m) + ηm + η + νn (f¯m ) − νn (fˆm′ );
sm ∈Sm
thanks to the inequality (2.7).
The goal is to bound −νn (fˆm′ ) = νn (−fˆm′ ).
To control this term, we use the following lemma.
(2.9)
2.5. PROOF OF THE THEOREM ACCORDING TO T OR T C
86
Lemme 2.5.1. Let Mn > 0. Let
max |[Yi ]z | ≤ Mn .
T =
max
i∈{1,...,n}
Let CMn = max
1
aπ , AΣ
z∈{1,...,q}
+ 21 (|Mn | + Aβ )2 A2Σ ,
∆m′ = m′ log(n)
q(|Mn |+Aβ )AΣ
2
and
p
q
.
K log(2p + 1) + 6 1 + K Aβ +
aΣ
Then, on the event T , for all m′ ∈ N∗ , for all t > 0, with probability greater than 1 − e−t ,
√
√ √
4CMn
q
9 Kq∆m′ + 2 t 1 + K Aβ +
sup |νn (−fm′ )| ≤ √
aΣ
n
fm′ ∈Fm′
Proof. Page 90
N
From (2.9), on the event T , for all m ∈ N∗ , for all m′ ∈ M (m), for all t > 0, with probability
greater than 1 − e−t ,
KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + δKL + pen(m) + νn (f¯m )
sm ∈Sm
√
√ √
4CMn
q
+ √
9 Kq∆m′ + 2 t 1 + K Aβ +
+ ηm + η
aΣ
n
≤ inf KLn (sξ0 , sm ) + pen(m) + νn (f¯m )
sm ∈Sm
!
2
√
√
1
q
C Mn
+ Kt
9 Kq∆m′ + √
1 + K Aβ +
+4 √
aΣ
n
2 K
+ ηm + η + δKL ,
the last inequality being true because 2ab ≤
√1 a2 +
K
√
Kb2 . Let z > 0 such that t = z + m + m′ .
′
On the event T , for all m ∈ N, for all m′ ∈ M (m), with probability greater than 1 − e−(z+m+m ) ,
KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + pen(m) + νn (f¯m )
sm ∈Sm
CM
+4 √ n
n
√
1
9 Kq∆m′ + √
2 K
√
′
K(z + m + m )
CM
+4 √ n
n
+ ηm + η + δKL .
q
1 + K Aβ +
aΣ
2 !
CM √
KLn (sξ0 , ŝm′ ) − νn (f¯m ) ≤ inf KLn (sξ0 , sm ) + pen(m) + 4 √ n Km
sm ∈Sm
n
√
4CMn
′
′
′
K(m + 9q∆m ) − pen(m )
+ √
n
!
2
√
1
4CMn
q
√
+ Kz + ηm + η + δKL .
+ √
1 + K Aβ +
aΣ
n
2 K
Let κ ≥ 1, and assume that pen(m) = λm with
p
4CM √
λ ≥ √ n K 1 + 9q log(n) K log(2p + 1)
n
87
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
Then, as
∆
with
m′
p
q
,
= m log(n) K log(2p + 1) + 6 1 + K Aβ +
aΣ
′
4CM √ 1
1
p
κ−1 = √ n K ≤
,
λ
n
1 + 9q log(n) K log(2p + 1)
we get that
KLn (sξ0 , ŝm′ ) − νn (f¯m ) ≤ inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m)
sm ∈Sm
2
4CMn 1
q
√
+ √
1 + K Aβ +
aΣ
n 2 K
√
√
4CM
q
+ √ n 54 Kq 1 + K Aβ +
+ Kz
aΣ
n
+ η + δKL + ηm
≤ inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m)
sm ∈Sm
27 + 1/2
27K 3/2 + √
K
4CM
+ √ n
n
q
1 + K Aβ +
aΣ
2
+
√
Kz
!
+ ηm + η + δKL .
Let m̂ such that
n
1X
log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗
−
m∈N
n
i=1
n
1X
−
log(ŝm (yi |xi )) + pen(m)
n
i=1
!
+ η;
and M (m) = {m′ ∈ N∗ |Pn (− log(ŝm′ )) + pen(m′ ) ≤ Pn (− log(ŝm )) + pen(m) + η} . By definition, m̂ ∈ M (m). Because for all m ∈ N∗ , for all m′ ∈ M (m),
X
X
′
′
e−(z+m+m ) ≥ 1 − e−z
e−m−m ≥ 1 − e−z ,
1−
m∈N∗
m′ ∈M (m)
(m,m′ )∈(N∗ )2
we could sum up over all models.
On the event T , for all z > 0, with probability greater than 1 − e−z ,
inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m) + ηm
KLn (sξ0 , ŝm̂ ) − νn (f¯m ) ≤ inf ∗
m∈N
sm ∈Sm
4CM
+ √ n
n
55q
27K 3/2 + √
2 K
q
1 + K Aβ +
aΣ
2
+
√
Kz
!
+ η + δKL .
By integrating over z > 0, and noticing that E(νn (f¯m )) = 0 and that δKL can be chosen arbitrary
2.5. PROOF OF THE THEOREM ACCORDING TO T OR T C
88
small, we get
E(KLn (sξ0 , ŝm̂ )1T ) ≤ inf ∗
m∈N
inf KLn (sξ0 , sm ) + (1 + κ
sm ∈Sm
−1
) pen(m) + ηm
!
2
√
4CMn
55
q
q
+ K +η
27K 3/2 + √
1 + K Aβ +
+ √
aΣ
n
K 2
−1
≤ inf ∗
inf KLn (sξ0 , sm ) + (1 + κ ) pen(m) + ηm
sm ∈Sm
m∈N
!
q 2
332K 3/2 qCMn
√
1 + Aβ +
+
+ η.
aΣ
n
2.5.2
Proof of the Proposition 2.4.3
We want an upper bound of E KLn (sξ0 , ŝm̂ )1T c . Thanks to the Cauchy-Schwarz inequality,
E KLn (sξ0 , ŝm̂ )1T
However, for all sξ ∈ S,
n
1X
KLn (sξ0 , sξ ) =
n
1
=
n
≤−
Z
log
q
i=1 R
Z
n
X
Rq
i=1
Z
n
X
1
n
i=1
Rq
c
≤
sξ0 (y|xi )
sξ (y|xi )
q
E(KL2n (sξ0 , ŝm̂ ))
p
P (T c ).
sξ0 (y|xi )dy
log(sξ0 (y|xi ))sξ0 (y|xi )dy −
Z
Rq
log(sξ (y|xi ))sξ0 (y|xi )dy
log(sξ (y|xi ))sξ0 (y|xi )dy.
Because parameters are assumed to be bounded, according to the set (2.3), we get, with
(β 0 , Σ0 , π 0 ) the parameters of sξ0 and (β, Σ, π) the parameters of sξ ,
!!
(y − βk xi )t Σ−1
(y
−
β
x
)
πk
k i
k
p
exp −
log(sξ (y|xi ))sξ0 (y|xi ) = log
q/2 det(Σ )
2
(2π)
k
k=1
K
X
(y − βk0 xi )t (Σ0k )−1 (y − βk0 xi )
πk0
q
×
exp −
2
q/2 det(Σ0 )
k=1 (2π)
k
q
)
aπ det(Σ−1
1
t t −1
≥ log K
exp −(y t Σ−1
1 y + x i β 1 Σ1 β 1 x i )
q/2
(2π)
p
aπ det((Σ01 )−1 )
t t −1
exp −(y t Σ−1
×K
1 y + x i β 1 Σ1 β 1 x i )
q/2
(2π)
!
q/2
aπ aΣ
exp −(y t y + A2β )AΣ
≥ log K
(2π)q/2
K
X
q/2
×K
aπ aΣ
exp −(y t y + A2β )AΣ .
q/2
(2π)
89
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
Indeed, for u ∈ Rq , if we use the eigenvalue decomposition of Σ = P t DP ,
|ut Σu| = |ut P t DP u| ≤ ||P u||2 ||DP U ||2 ≤ M (D)||P u||22
≤ M (D)||u||22 ≤ AΣ ||u||22 .
√
To recognize the expectation of a Gaussian standardized variables, we put u = 2AΣ y:
# −ut u
"
!
2
q/2
q/2 Z
tu
KaΣ aπ
Kaπ e−Aβ AΣ aΣ
e 2
u
log
KL(sξ0 (.|xi ), sξ (.|xi )) ≤ −
du
− A2β AΣ −
q/2
q/2
2 (2π)q/2
(2AΣ )
(2π)
Rq
#
"
!
2
q/2
q/2
Kaπ aΣ
aΣ Kaπ e−Aβ AΣ
U2
2
E log
≤−
− Aβ AΣ −
2
(2AΣ )q/2
(2π)q/2
#
"
!
2
q/2
q/2
Kaπ aΣ
KaΣ aπ e−Aβ AΣ
1
≤−
log
− A2β AΣ −
q/2
q/2
2
(2AΣ )
(2π)
!
2
2
q/2
q/2
KaΣ aπ e−Aβ AΣ −1/2 1/2 q/2
Kaπ e−Aβ AΣ −1/2 aΣ
e π log
≤−
q/2
(2π)q/2
(2π)q/2 A
Σ
≤
e−1/2 π q/2
q/2
AΣ
;
where U ∼ Nq (0, 1). We have used that for all t ∈ R, t log(t) ≥ −e−1 . Then, we get, for all
sξ ∈ S,
n
e−1/2 π q/2
1X
KL(sξ0 (.|xi ), sξ (.|xi )) ≤
.
KLn (sξ0 , sξ ) ≤
q/2
n
A
i=1
Σ
As it is true for all sξ ∈ S, it is true for ŝm̂ , then
q
E(KL2n (sξ0 , ŝm̂ )) ≤
e−1/2 π q/2
q/2
AΣ
.
For the last step, we need to bound P (T c ).
c
c
P (T ) = E(1T c ) = E(EX (1T c )) = E(PX (T )) ≤ E
Nevertheless, let Yx ∼
P (||Yx ||∞
PK
k=1 πk Nq (βk x, Σk ),
Z
n
X
i=1
!
PX (||Yi ||∞ > Mn ) .
then,
K
X
(y−β x)t Σ−1 (y−β x)
k
k
k
−
1
2
p
e
dy
> Mn ) =
1{||Yx ||∞ ≥Mn }
πk
q/2 det(Σ )
(2π)
Rq
k
k=1
Z
(y−βk x)t Σ−1 (y−βk x)
K
k
X
−
1
2
p
=
dy
1{||Yx ||∞ ≥Mn }
e
πk
q/2 det(Σ )
q
(2π)
R
k
k=1
=
K
X
k=1
πk PX (||Yxk ||∞
> Mn ) ≤
q
K X
X
k=1 z=1
with Yxk ∼ N (βk x, Σk ) and [Yxk ]z ∼ N ([βk x]z , [Σk ]z,z ).
πk PX (|[Yxk ]z | > Mn )
90
2.6. SOME DETAILS
We need to control PX (|[Yxk ]z | > Mn ), for all z ∈ {1, . . . , q}.
PX (|[Yxk ]z | > Mn ) = PX ([Yxk ]z > Mn ) + PX ([Yx,k ]z < −Mn )
!
!
−Mn − [βk x]z
Mn − [βk x]z
p
+ PX U <
= PX U > p
[Σk ]z,z
[Σk ]z,z
!
!
Mn − [βk x]z
Mn + [βk x]z
+ PX U > p
= PX U > p
[Σk ]z,z
[Σk ]z,z
≤e
− 12
≤ 2e
≤ 2e
Mn −[βk x]z
√
[Σk ]z,z
2
+e
− 12
Mn +[βk x]z
√
2
− 21
− 21
2 −2M |[β x] |+|[β x] |2
Mn
n
k z
k z
[Σk ]z,z
Mn −|[βk x]z |
√
[Σk ]z,z
[Σk ]z,z
2
.
where U ∼ N (0, 1). Then,
1
2
P (||Yx ||∞ > Mn ) ≤ 2Kqe− 2 (Mn −2Mn Aβ )aΣ ,
P
1
2
n
− 21 (Mn2 −2Mn Aβ )aΣ
≤ 2Knaπ qe− 2 (Mn −2Mn Aβ )aΣ . We have
and we get P (T c ) ≤ E
i=1 2Kqaπ e
obtained the wanted bound for E(KLn (sξ0 , ŝm̂ )1T c ).
2.6
2.6.1
Some details
Proof of the Lemma 2.5.1
First, give some tools
prove the Lemma 2.5.1.
q to
1 Pn
We define ||g||n = n i=1 g 2 (yi |xi ) for any measurable function g.
Let m ∈ N∗ . We have
n
sup |νn (−fm )| = sup
fm ∈Fm
fm ∈Fm
1X
(fm (yi |xi ) − E(fm (Yi |xi ))) .
n
i=1
To control the deviation of such a quantity, we shall combine concentration with symmetrization
arguments. We first use the following concentration inequality which can be found in [BLM13].
Lemme 2.6.1. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z
and let Γ be a class of real-valued functions on Z. Assume that there exists Rn a non-random
constant such that supγ∈Γ ||γ||n ≤ Rn . Then, for all t > 0,
P
"
#
r !
n
n
√
1X
t
1X
sup
≤ e−t .
γ(Zi ) − E(γ(Zi )) > E sup
γ(Zi ) − E(γ(Zi )) + 2 2Rn
n
n
n
γ∈Γ
γ∈Γ
i=1
Proof. See [BLM13].
i=1
N
P
Then, we propose to bound E supγ∈Γ n1 ni=1 γ(Zi ) − E(γ(Zi )) thanks to the following symmetrization argument. The proof of this result can be found in [vdVW96].
91
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
Lemme 2.6.2. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z
and let Γ be a class of real-valued functions on Z. Let (ǫ1 , . . . , ǫn ) be a Rademacher sequence
independent of (Z1 , . . . , Zn ). Then,
"
#
"
#
n
n
1X
1X
E sup
γ(Zi ) − E(γ(Zi )) ≤ 2 E sup
ǫi γ(Zi ) .
γ∈Γ n
γ∈Γ n
i=1
i=1
Proof. See [vdVW96].
N
Then, we have to control E(supγ∈Γ
1
n
Pn
i=1 ǫi γ(Zi )
).
Lemme 2.6.3. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z
and let Γ be a class of real-valued functions on Z. Let (ǫ1 , . . . , ǫn ) be a Rademacher sequence
independent of (Z1 , . . . , Zn ). Define Rn a non-random constant such that
sup ||γ||n ≤ Rn .
γ∈Γ
Then, for all S ∈ N∗ ,
#
"
n
1X
ǫi γ(Zi ) ≤ Rn
E sup
γ∈Γ n
i=1
6 X −s p
√
2
log(1 + N (2−s Rn , Γ, ||.||n )) + 2−S
n s=1
S
!
where N (δ, Γ, ||.||n ) stands for the δ-packing number of the set of functions Γ equipped with the
metric induced by the norm ||.||n .
Proof. See [Mas07].
N
In our case, we get the following lemma.
Lemme 2.6.4. Let m ∈ N∗ . Consider (ǫ1 , . . . , ǫn ) a Rademacher sequence independent of
(Y1 , . . . , Yn ). Then, on the event T ,
!
n
X
√ CM q
ǫi fm (Yi |xi ) ≤ 18 K √ n ∆m ;
E
sup
n
fm ∈Fm
i=1
where ∆m := m log(n)
p
K log(2p + 1) + 6(1 + K(Aβ +
q
aΣ )).
Proof. Let m ∈ N∗ . According to Lemma 2.6.5, we get that on the event T ,
sup ||fm ||n ≤ Rn := 2CMn (1 + K(Aβ +
fm ∈Fm
q
)).
aΣ
92
2.6. SOME DETAILS
Besides, on the event T , for all S ∈ N∗ ,
S
X
2−s
s=1
≤
S
X
2
−s
S
X
p
p
log[1 + N (2−s Rn , Fm , ||.||n )] ≤
2−s log(2N (2−s Rn , Fm , ||.||n ))
s=1
p
log(2) +
p
log(2p + 1)
2s+1 C
Mn qKm
Rn
S
X
2s+3 CMn qK
2s+3 CMn
−s
K log 1 +
+
2
according to Lemma 4.15
1+
Rn a Σ
Rn
s=1
S
X
p
p
2s+1 CMn qKm
−s
log(2) + log(2p + 1)
≤
2
Rn
s=1
s
2
S
X
C Mn
−s
s+3
K log 1 + 2
+
2
max(1, qK/aΣ )
Rn
s=1
S
X
p
p
2s+1 CMn qKm p
−s
log(2) + log(2p + 1)
≤
+ 2(s + 3)K log(2)q/aΣ
2
Rn
s=1
!!
√
S
X
p
√
q √
2CMn Kmq p
S log(2p + 1) + log(2) 1 +
≤
6K + 2
2−s s
Rn
aΣ
s=1
!
√
√
p
q√
2e
2CMn Kmq p
√ √
√
6K + q K
≤
S log(2p + 1) + log(2) 1 +
Rn
aΣ
2− e
s=1
s
√ s
√
because 2−s s ≤ 2e for all s ∈ N∗ . Then, thanks to the Lemma 2.6.3,
n
E
sup
fm ∈Fm
1X
ǫi fm (Yi |xi )
n
i=1
!
≤ Rn
≤ Rn
Taking S =
log(n)
log(2)
!
S
6 X −s p
−S
−s
√
2
log[1 + N (2 Rn , Fm , ||.||n )] + 2
n s=1
2CMn Kmq p
6
√
S log(2p + 1)
Rn
n
p
q √
q √
2e
−S
√
+ log(2) 1 +
6K +
K
.
+2
aΣ
aΣ
2− e
to obtain the same order in the both terms depending on S, we could deduce
93
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
that
!
n
E
sup
fm ∈Fm
1X
ǫi fm (Yi |xi )
n
i=1
12CMn Kmq p
log(n)
√
≤
log(2p + 1)
log(2)
n
!
#
√
" p
√
log(2)
q
2e
1
√
√
+ 2CMn 1 + K Aβ +
+
1 + 6K +
aΣ
n
n
2 − 2e
18CMn Kmq p
√
≤
log(2p + 1) log(n)
n
!
#
√
√
"
p
√
q
K
2e
√
+ 2 √ C M n 1 + K Aβ +
log(2) 1 + 6 +
+1
aΣ
n
2 − 2e
√
p
K
q
√
.
CMn mq K log(2p + 1) log(n) + 6 1 + K Aβ +
≤18
aΣ
n
It completes the proof.
We are now able to prove the Lemma 2.5.1.
n
1X
(fm (yi |xi ) − EX (fm (Yi |xi )))
fm ∈Fm n i=1
!
r
n
X
√
t
fm (Yi |xi ) − E(fm (Yi |xi )) + 2 2Rn
≤E
sup
n
fm ∈Fm
sup |νn (−fm )| = sup
fm ∈Fm
i=1
with probability greater than 1 − e−t and where Rn
is a constant computed from the Lemma 2.6.5
!
r
n
X
√
t
≤ 2E
sup
ǫi fm (Yi |xi ) + 2 2Rn
n
fm ∈Fm
i=1
with ǫi a Rademacher sequence,
independent of Zi
r
√ C Mn q
√
t
≤ 2 18 K √ ∆m + 2 2Rn
n
n
r
√
!
√
q
Kq
t
.
1 + K Aβ +
≤ 4CMn 9 √ ∆m + 2
n
aΣ
n
2.6.2
Lemma 2.6.5 and Lemma 4.15
Lemme 2.6.5. On the event
T =
max
max |[Yi ]z | ≤ Mn ,
i∈{1,...,n} z∈{1,...,q}
for all m ∈ N∗ ,
sup ||fm ||n ≤ 2CMn
fm ∈Fm
q
1 + K Aβ +
aΣ
:= Rn .
N
94
2.6. SOME DETAILS
n
o
Proof. Let m ∈ N∗ . Because fm ∈ Fm = fm = − log ssm0 , sm ∈ Sm , there exists sm ∈
ξ
sm
p
Sm such that fm = − log s 0 . For all x ∈ [0, 1] , denote ξ(x) = (π, β1 x, . . . , βK x, Σ) the
ξ
parameters of sm (.|x). For all i ∈ {1, . . . , n},
|fm (yi |xi )|1T = | log(sm (yi |xi )) − log(sξ0 (yi |xi ))|1T
∂ log(sξ (yi |x))
||ξ(xi ) − ξ 0 (xi )||1 1T ,
≤ sup sup
∂ξ
x∈[0,1]p ξ∈Ξ
thanks to the Taylor formula. Then, we need an upper bound of the partial derivate.
For all x ∈ [0, 1]p , for all y ∈ Rq , we could write
!
K
X
log(sξ (y|x)) = log
hk (x, y)
k=1
where, for all k ∈ {1, . . . , K},
hk (x, y) =
πk
q/2
(2π) det Σk
1
× exp −
2
q
X
z2 =1
q
X
z1 =1
y z1 −
p
X
j=1
y z 2 −
xj [βk ]z1 ,j [Σk ]−1
z1 ,z2
p
X
j=1
[βk ]z2 ,j xj .
Then, for all l ∈ {1, . . . , K}, for all z1 ∈ {1, . . . , q}, for all z2 ∈ {1, . . . , q}, for all y ∈ Rq , for all
x ∈ [0, 1]p ,
!
q
q(|y| + Aβ )AΣ
∂ log(sξ (y|x))
1 X
hl (x, y)
−
;
[Σl ]−1
= PK
z1 ,z2 ([βl x]z2 − yz2 ) ≤
∂([βl x]z1 )
2
2
k=1 hk (x, y)
z2 =1
∂ log(sξ (y|x))
1
= PK
∂([Σl ]z1 ,z2 )
k=1 hk (x, y)
×
≤
−hl Cofz1 ,z2 (Σl ) hl (x, y)(yz1 − [βl x]z1 )(yz2 − [βl x]z2 )[Σl ]−2
z1 ,z2
−
det(Σl )
2
−Cofz1 ,z2 (Σl ) (yz1 − [βl x]z1 )(yz2 − [βl x]z2 )[Σl ]−2
z1 ,z2
+
det(Σl )
2
1
≤AΣ + (|y| + Aβ )2 A2Σ ,
2
where Cofz1 ,z2 (Σk ) is the (z1 , z2 )-cofactor of Σk . We also have, for all l ∈ {1, . . . , K}, for all
x ∈ [0, 1]p , for all y ∈ Rq ,
∂ log(sξ (y, x))
hl (x, y)
1
=
.
≤
PK
∂πl
aπ
πl k=1 hk (x, y)
Thus, for all y ∈ Rq ,
∂ log(sξ (y|x))
≤ max
sup sup
∂ξ
x∈[0,1]p ξ∈Ξ̃
We have Cy ≤ AΣ ∧
1
aπ
h
1+
q(|y| + Aβ )AΣ
1
1
, AΣ + (|y| + Aβ )2 A2Σ ,
aπ
2
2
q+1
2 AΣ (|y|
i
+ Aβ )2 . For all m ∈ N∗ ,
= Cy .
95
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
|fm (yi |xi )|1T ≤ Cyi ||ξ(xi ) − ξ 0 (xi )||1 1T
≤ C Mn
K
X
k=1
(||βk xi − βk0 xi ||1 + ||Σk − Σ0k ||1 + |πk − πk0 |).
0 belong to Ξ̃, we obtain
Since fm and fm
|fm (yi |xi )|1T ≤ 2CMn (KAβ + K
q
+ 1)
aΣ
and then
sup ||fm ||n 1T ≤ 2CMn (KAβ + K
fm ∈Fm
q
+ 1).
aΣ
N
For the next results, we need the following lemma, proved in [Mey13].
Lemme 2.6.6. Let δ > 0 and (Ai,j ) i∈{1,...,n} ∈ [0, 1]n×p . There exists a family B of (2p + 1)1/δ
2
j∈{1,...,p}
vectors of Rp such that for all µ ∈ Rp in the ℓ1 -ball, there exists µ′ ∈ B such that
2
p
n
X
X
1
(µj − µ′j )Ai,j ≤ δ 2 .
n
i=1
Proof. See [Mey13].
j=1
N
With this lemma, we can prove the following one:
Lemme 2.6.7. Let δ > 0 and m ∈ N∗ . On the event T , we have the upper bound of the
δ-packing number of the set of functions Fm equipped with the metric induced by the norm ||.||n :
N (δ, Fm , ||.||n ) ≤ (2p + 1)
2 K 2 q 2 m2 /δ 2
4CM
n
8CMn qK
1+
aΣ δ
K
8CMn
1+
δ
K
.
Proof. Let m ∈ N∗ and fm ∈ Fm . There exists sm ∈ Sm such that fm = − log(sm /sξ0 ). Intro′
′ = − log(s′ /s ). Denote by (β , Σ , π )
′
′
duce s′m in S and put fm
k
k k 1≤k≤K and (βk , Σk , πk )1≤k≤K
m ξ0
′
the parameters of the densities sm and sm respectively. First, applying Taylor’s inequality, on
the event
T =
max |[Yi ]z | ≤ Mn ,
max
i∈{1,...,n} z∈{1,...,q}
we get, for all i ∈ {1, . . . , n},
′
(yi |xi )|1T = | log(sm (yi |xi )) − log(s′m (yi |xi ))|1T
|fm (yi |xi ) − fm
∂ log(sξ (yi |x))
≤ sup sup
||ξ(xi ) − ξ ′ (xi )||1 1T
∂ξ
p
x∈[0,1] ξ∈Ξ̃
≤ C Mn
q
K
X
X
k=1
z=1
!
[βk xi ]z − [βk′ xi ]z + ||Σk − Σ′k ||1 + |πk − πk′ | .
96
2.6. SOME DETAILS
Thanks to the Cauchy-Schwarz inequality, we get that
!2
q
K X
X
2
′
+ (||Σ − Σ′ ||1 + ||π − π ′ ||)2
βk xi − βk′ xi
(yi |xi ))2 1T ≤ 2CM
(fm (yi |xi ) − fm
n
k=1 z=1
2
q
p
p
K X
X
X
X
2
[βk ]z,j [xi ]j −
Kq
≤ 2CM
[βk′ ]z,j [xi ]j + (||Σ − Σ′ ||1 + ||π − π ′ ||)2 ,
n
k=1 z=1
j=1
j=1
and
2
p
p
q
K X
n
X
X
X
X
1
′ 2
2
[βk ]z,j [xi ]j −
Kq
||fm − fm
||n 1T ≤2CM
[βk′ ]z,j [xi ]j
n
n
i=1
j=1
j=1
k=1 z=1
#
+ (||Σ − Σ′ ||1 + ||π − π ′ ||)2 .
Denote by
2
q
p
p
n
K X
X
X
X
X
1
[βk ]z,j [xi ]j −
a = Kq
[βk′ ]z,j [xi ]j .
n
k=1 z=1
i=1
j=1
j=1
Then, for all δ > 0, if
2
)
a ≤ δ 2 /(4CM
n
||Σ − Σ′ ||1 ≤ δ/(4CMn )
||π − π ′ || ≤ δ/(4CMn )
′ ||2 ≤ δ 2 . To bound a, we write
then ||fm − fm
n
2
q
p
p
K X
n
′
X
X
X
X
[βk ]z,j
[βk ]z,j
1
[xi ]j −
[xi ]j
a = Kqm2
n
m
m
=1 z=1
i=1
j=1
j=1
and we apply Lemma 2.6.6 to [βk ]z,. /m for all k ∈ {1, . . . , K}, and for all z ∈ {1, . . . , q}. Since
P
P
2
2 2 2 2
[β ]
sm ∈ Sm , we have qz=1 pj=1 kmz,j ≤ 1, thus there exists a family B of (2p + 1)4CMn q K m /δ
vectors of Rp such that for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, for all [βk ]z,. , there exists
′
2 ). Moreover, since ||Σ|| ≤ qK and ||π|| ≤ 1, we get that,
[βk ]z,. ∈ B such that a ≤ δ 2 /(4CM
1
1
aΣ
n
on the event T ,
δ
δ
K qK
K
,B (
, B (1), ||.||1
), ||.||1 N
N (δ, Fm , ||.||n ) ≤ card(B)N
4CMn 1 AΣ
4CMn 1
2 q 2 K 2 m2 /δ 2
8CMn qK K
8CMn K
4CM
n
1+
≤ (2p + 1)
1+
aΣ δ
δ
N
97
CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE
MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS
2.6. SOME DETAILS
98
Chapter 3
An oracle inequality for the Lasso-MLE
procedure
Contents
3.1
3.2
3.3
3.4
3.5
3.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
The Lasso-MLE procedure . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2.1 Gaussian mixture regression model . . . . . . . . . . . . . . . . . . . . . 93
3.2.2 The Lasso-MLE procedure . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.2.3 Why refit the Lasso estimator? . . . . . . . . . . . . . . . . . . . . . . . 94
An oracle inequality for the Lasso-MLE model . . . . . . . . . . . . . 95
3.3.1 Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.2 Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4.1 Simulation illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Tools for proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.5.1 General theory of model selection with the maximum likelihood estimator.101
3.5.2 Proof of the general theorem . . . . . . . . . . . . . . . . . . . . . . . . 103
3.5.3 Sketch of the proof of the oracle inequality 3.3.2 . . . . . . . . . . . . . 107
Assumption Hm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Assumption K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Appendix: technical results . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6.1 Bernstein’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6.2 Proof of Lemma 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6.3 Determination of a net for the mean and the variance . . . . . . . . . . 110
3.6.4 Calculus for the function φ . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.6.5 Proof of the Proposition 3.5.5 . . . . . . . . . . . . . . . . . . . . . . . . 114
3.6.6 Proof of the Lemma 3.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 115
99
3.1. INTRODUCTION
100
In this chapter, we focus on a theoretical result for the LassoMLE procedure. We will get a penalty which depends on the model
complexity for which the model selected by the penalized criterion among the collection constructed satisfies an oracle inequality.
This result is non-asymptotic. We derive it from a general model
selection theorem, also detailed here, which is a generalization of
Cohen and Le Pennec Theorem, [CLP11], for a model collection
constructed randomly.
3.1
Introduction
The goal of clustering methods is to discover a structure among individuals described by several
variables. Specifically, in regression case, given n observations (x, y) = ((x1 , y1 ), . . . , (xn , yn ))
which are realizations of random variables (X, Y ) with X ∈ Rp and Y ∈ Rq , one aims at grouping
the data into clusters such that the observations Y conditionally to X in the same cluster are
more similar to each other than those from the other clusters. Different methods could be
considered, more geometric or more statistical. We are dealing with model-based clustering, in
order to have a rigorous statistical framework to assess the number of clusters and the role of each
variable. This method is known to have good empirical performance relative to its competitors,
see in [TMZT06].
Datasets are described by a lot of explicative variables, sometimes much more than the sample
size. All the information should not be relevant for the clustering. To solve this problem,
we propose a procedure which provide a data clustering from variable selection. In a density
estimation way, we could also cite Pan and Shen, in [PS07], who focus on mean variable selection,
Zhou and Pan, in [ZPS09], who use the Lasso estimator to regularize Gaussian mixture model
with general covariance matrices, Sun and Wand, in [SWF12], who propose to regularize the
k-means algorithm to deal with high-dimensional data, Guo et al. in [GLMZ10] who propose a
pairwise variable selection method. All of them deal with penalized model-based clustering.
In a regression framework, the Lasso estimator, introduced by Tibshirani in [Tib96], is a classical
tool in this context. Working well in practice, many efforts have been made recently on this
estimator to get some theoretical results. Under a variety of different assumptions on the design
matrix, we could have oracle inequalities for the Lasso estimator. For example, we can state the
restricted eigenvalue condition, introduced by Bickel, Ritov and Tsybakov in [BRT09], who get
an oracle inequality with this assumption. For an overview of existing results, cite for example
[vdGB09].
Whereas focus on the estimation, the Lasso estimator could be used to select variables, and,
for this goal, many results without strong assumptions are proved. The first result in this way
is from Meinshausen and Bühlmann, in [MB10], who show that, for neighborhood selection in
Gaussian graphical models, under a neighborhood stability condition, the Lasso estimator is
consistent. Under different assumptions, as the irrepresentable condition, described in [ZY06],
one get the same kind of result: true variables are selected consistently.
Thanks to those results, one could refit the estimation, after the variable selection, with an
estimator with better properties. In this chapter, we focus on the maximum likelihood estimator
on the estimated active set. In a linear regression framework, we could cite Massart and Meynet,
[MM11a], or Belloni and Chernozhukov, [BC13], or also Sun and Zhang, [SZ12] for using this
idea.
In our case of finite mixture regression, we propose a procedure which is based on a modeling
that recasts variable selection and clustering problems into a model selection problem. This
procedure is developed in [Dev14c], with methodology, computational issues, simulations and
101
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
data analysis. First, for some data-driven regularization parameters, we construct a relevant
variables set. Then, restricted on those sets, we compute the maximum likelihood estimator.
Considering the model collection with various number of components and various sparsities, we
select a model thanks to the slope heuristic. Then, we get a clustering of the data thanks to the
Maximum A Posteriori principle. This procedure could be used to cluster heterogeneous multivariate regression data and understand which variables explain the clustering, in high-dimension.
Considering a regression clustering could refine a clustering, and it could be more adapted for
instance for prediction. In this chapter, we focus on the theoretical point of view. We define a
penalized criterion which allows to select a model (defined by the number of clusters and the
set of relevant variables) from a non-asymptotic point of view. Penalizing the empirical contrast
is an idea emerging from the seventies. Akaike, in [Aka74], proposed the Akaike’s Information
Criterion (AIC) in 1973, and Schwarz in 1978 in [Sch78] suggested the Bayesian Information Criterion (BIC). Those criteria are based on asymptotic heuristics. To deal with non-asymptotic
observations, Birgé and Massart in [BM07] and Barron et al. in [YB99], define a penalized
data-driven criterion, which leads to oracle inequalities for model selection. The aim of our
approach is to define penalized data-driven criterion which leads to an oracle inequality for our
procedure. In our context of regression, Cohen and Le Pennec, in [CLP11], proposed a general
model selection theorem for maximum likelihood estimation, adapted from Massart’s Theorem
in [Mas07]. Nevertheless, we can not apply it directly, because it is stated for a deterministic
model collection, whereas our data-driven model collection is random, constructed by the Lasso
estimator. As Maugis and Meynet have done in [MMR12] to generalize Massart’s Theorem,
we extend the theorem to cope with the randomness of our model collection. By applying this
general theorem to the finite mixture regression random model collection constructed by our
procedure, we derive a convenient theoretical penalty as well as an associated non-asymptotic
penalized criteria and an oracle inequality fulfilled by our Lasso-MLE estimator. The advantage
of this procedure is that it does not need any restrictive assumption.
To obtain the oracle inequality, we use a general theorem proposed by Massart in [Mas07], which
gives the form of the penalty and associated oracle inequality in term of the Kullback-Leibler
and Hellinger loss. In our case of regression, Cohen and Le Pennec, in [CLP11], generalize
this theorem in term of Kullback-Leibler and Jensen-Kullback-Leibler loss. Those theorems are
based on the centred process control with the bracketing entropy, allowing to evaluate the size
of the models. Our setting is more general, because we work with a random family denoted by
M̌. We have to control the centred process thanks to Bernstein’s inequality.
The rest of this chapter is organized as follows. In the Section 3.2, we define the multivariate
Gaussian mixture regression model, and we describe the main steps of the procedure we propose.
We also illustrate the requirement of refitting by some simulations. We present our oracle
inequality in the Section 3.3.2. In Section 3.4, we illustrate the procedure on simulated dataset
and benchmark dataset. Finally, in Section 3.5, we give some tools to understand the proof
of the oracle inequality, with a global theorem of model selection with a random collection in
Section 3.5.1 and sketch of proofs after. All the details are given in Appendix.
3.2
The Lasso-MLE procedure
In order to cluster high-dimensional regression data, we will work with the multivariate Gaussian
mixture regression model. This model is developed in [SBG10] in the scalar response case.
We generalize it in Section 3.2.1. Moreover, we want to construct a model collection. We
propose, in Section 3.2.2, a procedure called Lasso-MLE which constructs a model collection,
with various sparsity and various number of components, of Gaussian mixture regression models.
The different sparsities solve the high-dimensional problem. We conclude this section with
102
3.2. THE LASSO-MLE PROCEDURE
simulations, which illustrate the advantage of refitting.
3.2.1
Gaussian mixture regression model
We observe n independent couples (xi , yi )1≤i≤n realizing random variables (X, Y ), where X ∈
Rp , and Y ∈ Rq comes from a probability distribution with unknown conditional density denoted
by s∗ . To solve a clustering problem, we use a finite mixture model in regression. In particular,
we will approximate the density of Y conditionally to X with a mixture of K multivariate
Gaussian regression models. If the observation i belongs to the cluster k, we are looking for
βk ∈ Rq×p such that yi = βk xi + ǫ, where ǫ ∼ N (0, Σk ). Remark that we also have to estimate
the number of clusters K.
Thus, the random response variable Y ∈ Rq depends on a set of random explanatory variables,
written X ∈ Rp , through a regression-type model. Give more precisions on the assumptions on
the model we use.
— The variables Yi , conditionally to Xi , are independent, for all i ∈ {1, . . . , n} ;
— Yi |Xi = xi ∼ sK
ξ (y|xi )dy, with
K
X
(y − βk x)t Σ−1
πk
k (y − βk x)
sK
exp
−
ξ (y|x) =
q/2
1/2
2
(2π) det(Σk )
k=1
!
(3.1)
K
ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++
q )
(
)
K
X
ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and
πk = 1
k=1
S++
q
is the set of symmetric positive definite matrices on Rq .
We want to estimate the conditional density function sK
ξ from the observations. For all k ∈
{1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the
mixture component k. The πk s are the mixture
Ppproportions. In fact, for a regressor x, for all
k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, [βk x]z = j=1 [βk ]z,j xj is the zth component of the mean
of the mixture component k. To deal with high-dimensional data, we select variables.
Definition 3.2.1. A variable (z, j) ∈ {1, . . . , q} × {1, . . . , p} is said to be irrelevant if, for all
k ∈ {1, . . . , K}, [βk ]z,j = 0. A variable is relevant if it is not irrelevant.
A model is said to be sparse if there are a few of relevant variables.
We denote by Φ[J] the matrix with 0 on the set c J, and S(K,J) the model with K components
and with J for relevant variables set:
n
o
(K,J)
S(K,J) = y ∈ Rq |x ∈ Rp 7→ sξ
(y|x)
(3.2)
where
(K,J)
sξ
(y|x)
K
X
[J]
[J]
(y − βk x)t Σ−1
πk
k (y − βk x)
=
exp
−
q/2
1/2
2
(2π) det(Σk )
k=1
!
.
This is the main model used in this chapter. To construct the set of relevant variables J, we use
the Lasso estimator. Rather than select a regularization parameter, we consider a collection,
which leads to a model collection. Detail the procedure.
103
3.2.2
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
The Lasso-MLE procedure
The procedure we propose, which is particularly interesting in high-dimension, could be decomposed into three main steps. First, we construct a model collection, with models more or less
sparse and with more or less components. Then, we refit estimations with the maximum likelihood estimator. Finally, we select a model thanks to the slope heuristic. It leads to a clustering
according to the MAP principle on the selected model.
Model collection construction The first step consists of constructing a collection of models
{S(K,J) }(K,J)∈M in which the model S(K,J) is defined by equation (3.2), and the model collection
is indexed by M = K × J . Denote by K ⊂ N∗ the possible number of components, and denote
by J a collection of subsets of {1, . . . , q} × {1, . . . , p}.
To detect the relevant variables, and construct the set J in each model, we generalize the Lasso
estimator. Indeed, we penalize
empirical contrast by an ℓ1 -penalty on the mean parameters
P theP
proportional to ||Pk βk ||1 = pj=1 qz=1 |(Pk βk )z,j |, where Pkt Pk = Σ−1
k for all k ∈ {1, . . . , K}.
Then, we will consider
)
(
n
K
X
X
1
Lasso
log(sK
πk ||Pk βk ||1 .
ξˆK
(λ) = argmin
−
ξ (yi |xi )) + λ
n
ξ=(π,β,Σ)∈ΞK
i=1
k=1
This leads to penalize simultaneously the ℓ1 -norm of the mean coefficients and small variances.
Computing those estimators lead to construct the relevant variables set. For a fixed number
of mixture components K ∈ K, denote by GK a candidate of regularization parameters. Fix a
parameter λ ∈ GK , we could then use an EM algorithm to compute the Lasso estimator, and
construct the set of relevant variables J(λ,K)S
, sayingSthe non-zero coefficients. We denote by J
the random collection of all these sets, J = K∈K λ∈GK J(λ,K) .
Refitting The second step consists of approximating the maximum likelihood estimator
)
(
n
X
1
ŝ(K,J) = argmin −
log(t(yi |xi ))
n
t∈S(K,J)
i=1
using an EM algorithm for each model (K, J) ∈ K×J . Remark that we estimate all parameters,
to reduce bias induced by the Lasso estimator.
Model selection The third step is devoted to model selection. We get a model collection,
and we need to select the best one. Because we do not have access to s∗ , we can not take the one
which minimizes the risk. The Theorem 3.5.1 solves this problem: we get a penalty achieving to
an oracle inequality. Then, even if we do not have access to s∗ , we know that we can do almost
like the oracle.
3.2.3
Why refit the Lasso estimator?
In order to illustrate the refitting, we compute multivariate data, the restricted eigenvalue condition being not satisfied, and run our procedure. We consider an extension of the model studied
in Giraud et al. article [BGH09] in the Section 6.3. Indeed, this model is a linear regression with
a scalar response which does not satisfy the restricted eigenvalues condition. Then, we define
different classes, to get a finite mixture regression model, which does not satisfied the restricted
eigenvalues condition, and extend the dimension for multivariate response. We could compare
3.3. AN ORACLE INEQUALITY FOR THE LASSO-MLE MODEL
104
the result of our procedure with the Lasso estimator, to illustrate the oracle inequality we have
get. Let precise the model.
Let [x]1 , [x]2 , [x]3 be three vectors of Rn defined by
√
[x]1 = (1, −1, 0, . . . , 0)t / 2 √
t
2
[x]2 = (−1,
p
√1.001,√0, . . . , 0) / 1 + t0.001
[x]3 = (1/ 2, 1/ 2, 1/n, . . . , 1/n) / 1 + (n − 2)/n2
and for 4 ≤ j ≤ n, let [x]j be the j th vector of the canonical basis of Rn . We take a sample
of size n = 20, and vector of size p = q = 10. We consider two classes, each of them defined
by [β1 ]z,j = 10 and [β2 ]z,j = −10 for j ∈ {1, . . . , 2}, z ∈ {1, . . . , 10}. Moreover, we define the
covariance matrix of the noise by a diagonal matrix with 0.01 for diagonal coefficient in each
class.
We run our procedure on this model, and compare it with the Lasso estimator, without refitting.
We compute the model selected by the slope heuristic over the model collection constructed by
the Lasso estimator. In Figure 3.1 are the boxplots of each procedure, running 20 times. The
Kullback-Leibler divergence is computed over a sample of size 5000.
Kullback−Leibler divergence
8
7
6
5
4
Lasso−MLE
Lasso
Figure 3.1: Boxplot of the Kullback-Leibler divergence between the true model and the one
constructed by each procedure, the Lasso-MLE procedure and the Lasso estimator
We could see that a refitting after variable selection by the Lasso estimator leads to a better
estimation, according to the Kullback-Leibler loss.
3.3
An oracle inequality for the Lasso-MLE model
Before state the main theorem of this chapter, we need to precise some definitions and notations.
3.3.1
Notations and framework
We assume that the observations (xi , yi )1≤i≤n are realizations of random variables (X, Y ) where
X ∈ Rp and Y ∈ Rq .
For (K, J) ∈ K × J , for a model S(K,J) , we denote by ŝ(K,J) the maximum likelihood estimator
!
n
X
(K,J)
(K,J)
−
ŝ
= argmin
log sξ
(yi |xi ) .
(K,J)
sξ
∈S(K,J)
i=1
To avoid existence issue, we could work with almost minimizer of this quantity and define an
η-log-likelihood minimizer:
!
n
n
X
X
(K,J)
(K,J)
− log(ŝ
(yi |xi )) ≤
inf
− log sξ
(yi |xi ) + η.
i=1
(K,J)
sξ
∈S(K,J)
i=1
105
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
The best model in this collection is the one with the smallest risk. However, because we do not
have access to the true density s∗ , we can not select the best model, which we call the oracle.
Thereby, there is a trade-off between a bias term measuring the closeness of s∗ to the set S(K,J)
and a variance term depending on the complexity of the set S(K,J) and on the sample size. A
good set S(K,J) will be one for which this trade-off leads to a small risk bound. Because we
are working with a maximum likelihood approach, the most natural quality measure is thus the
Kullback-Leibler divergence denoted by KL.
Z
s(y)
log
s(y)dy if sdy << tdy;
t(y)
(3.3)
KL(s, t) =
R
+ ∞ otherwise;
for s and t two densities.
As we deal with conditional densities and not classical densities, the previous divergence should
be adapted. We define the tensorized Kullback-Leibler divergence by
#
" n
X
1
KL⊗n (s, t) = E
KL(s(.|xi ), t(.|xi )) .
n
i=1
This divergence, defined in [CLP11] appears as the natural one in this regression setting.
Namely, we use the Jensen-Kullback-Leibler divergence JKLρ with ρ ∈ (0, 1), which is defined
by
1
JKLρ (s, t) = KL(s, (1 − ρ)s + ρt);
ρ
and the tensorized one
"
#
n
X
1
n
JKL⊗
JKLρ (s(.|xi ), t(.|xi )) .
ρ (s, t) = E
n
i=1
This divergence is studied in [CLP11]. We use this divergence rather than the Kullback-Leibler
one because we need a boundedness assumption
on the controlled functions that is not satisfied by
(K,J) ∗
/s . When considering the Jensen-Kullback-Leibler
the log-likelihood differences − log sξ
divergence, those ratios are replaced by ratios
∗ + ρs(K,J)
(1
−
ρ)s
1
ξ
− log
ρ
s∗
that are close to the log-likelihood differences when sm are close to s∗ and always upper bounded
by − log(1 − ρ)/ρ. Indeed, this bound is needed to use deviation inequalities for sums of random
variables and their suprema, which is the key of the proof of oracle type inequality.
3.3.2
Oracle inequality
We denote by (S(K,J) )(K,J)∈K×J L the model collection constructed by the Lasso-MLE procedure,
with J L a random subcollection of P({1, . . . , q}×{1, . . . , p}) constructed by the Lasso estimator.
The grid of regularization parameter considered is data-driven, then random. Because we work
in high-dimension, we could not look at all subsets of P({1, . . . , q}×{1, . . . , p}). Considering the
Lasso estimator through its regularization path is the solution chosen here, but it needs more
control because of the random family. To get theoretical results, we need to work with restricted
106
3.3. AN ORACLE INEQUALITY FOR THE LASSO-MLE MODEL
parameters. Assume Σk diagonal, with Σk = diag([Σk ]1,1 , . . . , [Σk ]q,q ), for all k ∈ {1, . . . , K}.
We define
B
S(K,J)
=
(K,J)
∈ S(K,J) for all k ∈ {1, . . . , K}, [βk ][J] ∈ [−Aβ , Aβ ]q×p ,
aΣ ≤ [Σk ]z,z ≤ AΣ for all z ∈ {1, . . . , q}, for all k ∈ {1, . . . , K} .
sξ
(3.4)
Moreover, we assume that the covariates X belong to an hypercube. Without any restriction,
we could assume that X ∈ [0, 1]p .
Remark 3.3.1. We have to denote that in this chapter, the relevant variables set is designed
by the Lasso estimator. Nevertheless, any tool could be used to construct this set, and we obtain
analog results. We could work with any random subcollection of P({1, . . . , q} × {1, . . . , p}), the
controlled size being required in high-dimensional case.
Theorem 3.3.2. Let (xi , yi )1≤i≤n the observations, with unknown conditional density s∗ . Let
S(K,J) defined by (3.2). For (K, J) ∈ K×J L , J L being a random subcollection of P({1, . . . , q}×
B
{1, . . . , p}) constructed by the Lasso estimator, denote S(K,J)
the model defined by (3.4).
Consider the maximum likelihood estimator
(
)
n
1X
(K,J)
(K,J)
ŝ
= argmin
−
(yi |xi ) .
log sξ
n
(K,J)
s
∈S B
ξ
i=1
(K,J)
B
Denote by D(K,J) the dimension of the model S(K,J)
, D(K,J) = K(|J| + q + 1) − 1. Let s̄(K,J) ∈
B
S(K,J)
such that
δKL
;
KL⊗n (s∗ , s̄(K,J) ) ≤ inf KL⊗n (s∗ , t) +
B
n
t∈S(K,J)
and let τ > 0 such that s̄(K,J) ≥ e−τ s∗ . Let pen : K × J → R+ , and suppose that there
exists an absolute constant κ > 0 and an absolute constant B(Aβ , AΣ , aΣ ) such that, for all
(K, J) ∈ K × J ,
D(K,J) 2
D(K,J)
2
B (Aβ , AΣ , aΣ ) − log
B (Aβ , AΣ , aΣ ) ∧ 1
pen(K, J) ≥ κ
n
n
4epq
+(1 ∨ τ ) log
.
(D(K,J) − q 2 ) ∧ pq
ˆ
Then, the estimator ŝ(K̂,J) , with
ˆ =
(K̂, J)
argmin
(K,J)∈K×J L
(
n
1X
log(ŝ(K,J) (yi |xi )) + pen(K, J)
−
n
i=1
)
satisfies
E
h
ˆ
∗ (K̂,J)
n
JKL⊗
)
ρ (s , ŝ
i
≤C1 E
+ C2
inf
(K,J)∈K×J L
(1 ∨ τ )
;
n
for some absolute positive constants C1 and C2 .
inf
B
t∈S(K,J)
⊗n
KL
∗
(s , t) + pen(K, J)
!!
107
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
This oracle inequality compares performances of our estimator with the best model in the collection. Nevertheless, as we consider mixture of Gaussian, if we take enough clusters, we could
approximate well a lot of densities. This result could be compared with the oracle inequality
get in [SBG10], Theorem 4. Indeed, under restricted eigenvalues condition and fixed design,
they get an oracle inequality for the Lasso estimator in finite mixture regression model, with
scalar response and high-dimension regressors. Note that they control the divergence with the
true parameters. We get a similar result for the Lasso-MLE estimator. Moreover, our procedure
work in a more general framework, the only assumption needed is to be bounded.
3.4
Numerical experiments
To illustrate this procedure, we study some simulations and real data. The main algorithm is a
generalized version of the EM algorithm, which is used many times for the procedure. We first
use it to compute maximum likelihood estimator, to construct the regularization parameter grid.
Then, we use it to compute the Lasso estimator for each regularization parameter belonging to
the grid, and we are able to construct the relevant variables set. Finally, we could compute the
maximum likelihood estimator, restricted to those relevant variables in each model. Among this
model collection, we select one using the Capushe package. More details, as initialization rule,
stopping rule, and more numerical experiments, are available in [Dev14c].
3.4.1
Simulation illustration
We illustrate the procedure on a simulated dataset, adapted from [SBG10].
Let x be a sample of size n = 100 distributed according to multivariate standard Gaussian. We
consider a mixture of two components, and we fix the dimension of the regressor and of the
response variables to p = q = 10. Besides, we fix the number of relevant variables to 4 in each
cluster. More precisely, the first
four variables of Y are explained respectively by the four first
variables of X. Fix π = 21 , 21 , [β1 ][J] = 3, [β2 ][J] = −2 and Pk = 3Iq for all k ∈ {1, 2}.
The difficulty of the clustering is partially controlled by the signal-to-noise ratio. In this context,
we could extend the natural idea of the SNR with the following definition, where Tr(A) denotes
the trace of the matrix A.
Tr(Var(Y ))
= 1.88.
SNR =
Tr(Var(Y |βk = 0 for all k ∈ {1, . . . , K}))
We take a sample of Y knowing X = x according to a Gaussian mixture, meaning in βk x and
with covariance matrix Σk = (Pkt Pk )−1 = σIq , for the cluster k ∈ {1, 2}. We run our procedures
with the number of components varying in K = {2, . . . , 5}.
To compare our procedure with others, we compute the Kullback-Leibler divergence with the
true density, the ARI (the Adjusted Rand Index measures the similarity between two data
clusterings, knowing that the closer to 1 the ARI, the more similar the two partitions), and how
many clusters are selected.
From the Lasso-MLE model collection, we construct two models, to compare our procedures
with. We compute the oracle (the model which minimizes the Kullback-Leibler divergence with
the true density), and the model which is selected by the BIC criterion instead of the slope
heuristic. Thanks to the oracle, we know how good we could be from this model collection for
the Kullback-Leibler divergence, and how this model, as good it is possible for the contrast,
performs the clustering.
The third procedure we compare with is the maximum likelihood estimator, assuming that we
know how many clusters there are, fixed to 2. We use this procedure to show that variable
selection is necessary.
108
3.4. NUMERICAL EXPERIMENTS
0.4
1
0.35
0.8
0.3
0.25
0.6
0.2
ARI
Kullback−Leibler divergence
0.45
0.4
0.15
0.1
0.2
0.05
LMLE
Oracle
Bic
0
LMLE
Figure 3.2: Boxplots of the Kullback-Leibler
divergence between the true model and the
one selected by the procedure over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the oracle (Oracle), the BIC estimator (BIC)
Oracle
Bic
MLE
Figure 3.3: Boxplots of the ARI over the 20
simulations, for the Lasso-MLE procedure
(LMLE), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE)
Results are summarized in Figure 3.2 and in Figure 3.3. The Kullback-Leibler divergence is
smaller for models selected in our model collection (else by BIC, or by slope heuristic, or the
oracle) than for the model constructed by the MLE. The ARI is closer to 1 in those case, and,
moreover, is better for the model selected by the slope heuristic. We could conclude that the
model collection is well constructed, selecting relevant variables, and also that the model is well
selected among this collection, near the oracle.
3.4.2
Real data
We also illustrate the procedure on the Tecator dataset, which deal with spectrometric data. We
summarize here results which are described in [Dev14c]. Those data have been studied in a lot
of articles, cite for example Ferraty and Vieu’s book [FV06]. The data consist of a 100-channel
spectrum of absorbances in the wavelength range 850 − 1050 nm, and of the percentage of fat.
We observe a sample of size 215. In this work, we focus on clustering data according to the
reliance between the fat content and the absorbance spectrum. The sample will be split into
two subsamples, 165 observations for the learning set, and 50 observations for the test set. We
split it to have the same marginal distribution of the response in each sample.
The spectrum is a function, which we decompose into the Haar basis, at level 6.
The procedure selects two models, which we describe here. In Figures (3.4) and (3.5), we
represent clusters done on the training set for the different models.
The graph on the left is a candidate for representing each cluster, constructed by the mean of
spectrum over an a posteriori probability greater than 0.6. We plot the curve reconstruction,
keeping only active variables in the wavelet decomposition. On the right side, we present the
boxplot of the fat values in each class, for observations with an a posteriori probability greater
than 0.6.
The first model has two clusters, which could be distinguish in the absorbance spectrum by the
bump on wavelength around 940 nm. The first class is dominating, with π̂1 = 0.95. The fat
content is smaller in the first class than in the second class. According to the signal reconstruction, we could see that almost all variables have been selected. This model seems consistent
according to the classification goal.
The second model has 3 clusters, and we could remark different important wavelength. Around
940 nm, there is some differences between clusters, corresponding to the bump underline in the
model 1, but also around 970 nm, with higher or smaller values. The first class is dominating,
109
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
with π̂1 = 0.89. Just a few of variables have been selected, which give to this model the
understanding property of which coefficient are discriminating.
Figure 3.4: Summarized results for the model 1. The graph on the left is a candidate for
representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori
probability greater than 0.6 On the right side, we present the boxplot of the fat values in each
class, for observations with an a posteriori probability greater than 0.6.
Figure 3.5: Summarized results for the model 2. The graph on the left is a candidate for
representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori
probability greater than 0.6 On the right side, we present the boxplot of the fat values in each
class, for observations with an a posteriori probability greater than 0.6.
We could discuss about those models. The first select only two clusters, but almost all variables,
whereas the second model has more clusters, and less variables: there is a trade-off between
clusters and variable selection for the dimension reduction.
110
3.5. TOOLS FOR PROOF
3.5
Tools for proof
In this section, we present the tools needed to understand the proof. First, we present a general
theorem for model selection in regression among a random collection. Then, in subsection 3.5.2,
we present the proof of this theorem, and in the next subsection we explain how we could use
the main theorem to get the oracle inequality. All details are available in Appendix.
3.5.1
General theory of model selection with the maximum likelihood estimator.
To get an oracle inequality for our clustering procedure, we have to use a general model selection
theorem. Because the model collection constructed by our procedure is random, because of the
Lasso estimator which select variables randomly, we have to generalize Cohen and Le Pennec
Theorem. Begin by some general model selection theory.
Before state the general theorem, begin by talking about the assumptions. We work here in a
more general context, (X, Y ) ∈ X × Y, and (Sm )m∈M defining a model collection indexed by M.
First, we impose a structural assumption on each model indexed by m ∈ M. It is a bracketing
entropy condition on the model Sm with respect to the Hellinger divergence, defined by
" n
#
X
1
n 2
(d⊗
d2H (s(.|xi ), t(.|xi )) .
H ) (s, t) = E
n
i=1
A bracket [l, u] is a pair of functions such that for all (x, y) ∈ X × Y, l(y, x) ≤ s(y|x) ≤ u(y, x).
n
The bracketing entropy H[.] (ǫ, S, d⊗
H ) of a set S is defined as the logarithm of the minimum
n
number of brackets [l, u] of width d⊗
H (l, u) smaller than ǫ such that every functions of S belong
to one of these brackets. It leads to the Assumption Hm .
Assumption Hm . There is a non-decreasing function φm such that ̟ 7→
increasing on (0, +∞) and for every ̟ ∈ R+ and every sm ∈ Sm ,
Z ̟q
n
H[.] (ǫ, Sm (sm , ̟), d⊗
H )dǫ ≤ φm (̟);
1
̟ φm (̟)
is non-
0
n
where Sm (sm , ̟) = {t ∈ Sm , d⊗
H (t, sm ) ≤ ̟}. The model complexity Dm is then defined as
2
n̟m with ̟m the unique root of
√
1
(3.5)
φm (̟) = n̟.
̟
Remark that the model complexity depends on the bracketing entropies not of the global models
Sm but of the ones of smaller localized sets. This is a weaker assumption.
For technical reason, a separability assumption is also required.
′
′
′
Assumption Sepm . There exists a countable subset Sm of Sm and a set Ym with λ(Y \Ym ) = 0
′
such that for every t ∈ Sm , there exists a sequence (tl )l≥1 of elements of Sm such that for every
′
x and every y ∈ Ym , log(tl (y|x)) goes to log(t(y|x)) as l goes to infinity.
This assumption leads to work with a countable family, which allows to cope with the randomness
of ŝm . We also need an information theory type assumption on our collection. We assume the
existence of a Kraft-type inequality for the collection.
Assumption K. There is a family (wm )m∈M of non-negative numbers such that
X
e−wm ≤ Ω < +∞.
m∈M
111
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
The difference with Cohen and Le Pennec’s Theorem is that we consider a random collection of
models M̌, included in the whole collection M. In our procedure, we deal with high-dimensional
models, and we cannot look after all the models: we have to restrict ourselves to a smaller
subcollection of models, which is then random. In the proof of the theorem, we have to be careful
with the recentred process of − log(s̄m /s∗ ). Because we conclude by taking the expectation, if
M is fixed, this term is non-interesting, but if we consider a random family, we have to use the
Bernstein inequality to control this quantity, and then we have to make the assumption (3.6).
Let state our main global theorem.
Theorem 3.5.1. Assume we observe (xi , yi )1≤i≤n with unknown conditional density s∗ . Let
the model collection S = (Sm )m∈M be at most countable collection of conditional density sets.
Assume Assumption K holds, while assumptions Hm and Sepm hold for every m ∈ M. Let
δKL > 0, and s̄m ∈ Sm such that
KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , t) +
t∈Sm
δKL
;
n
and let τ > 0 such that
s̄m ≥ e−τ s∗ .
(3.6)
Introduce (Sm )m∈M̌ some random subcollection of (Sm )m∈M . Consider the collection (ŝm )m∈M̌
of η-log-likelihood minimizer in Sm , satisfying, for all m ∈ M̌,
!
n
n
X
X
− log(ŝm (yi |xi )) ≤ inf
− log(sm (yi |xi )) + η.
sm ∈Sm
i=1
i=1
Then, for any ρ ∈ (0, 1) and any C1 > 1, there are two constants κ0 and C2 depending only on
ρ and C1 such that, as soon as for every index m ∈ M,
(3.7)
pen(m) ≥ κ(Dm + (1 ∨ τ )wm )
with κ > κ0 , and where the model complexity Dm is defined in (3.5), the penalized likelihood
estimate ŝm̂ with m̂ ∈ M̌ such that
!
n
n
X
X
′
−
log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf −
log(ŝm (yi |xi )) + pen(m) + η
m∈M̌
i=1
i=1
satisfies
∗
n
E(JKL⊗
ρ (s , ŝm̂ ))
≤C1 E
inf
m∈M̌
+ C2 (1 ∨ τ )
⊗n
inf KL
t∈Sm
pen(m)
(s , t) + 2
n
Ω2 η ′ + η
+
.
n
n
∗
(3.8)
Obviously, one of the models minimizes the right hand side. Unfortunately, there is no way
to know which one without knowing s∗ . Hence, this oracle model can not be used to estimate
s∗ . We nevertheless propose a data-driven strategy to select an estimate among the collection
of estimates {ŝm }m∈M̌ according to a selection rule that performs almost as well as if we had
known this oracle, according to the absolute constant C1 . Using simply the log-likelihood of the
estimate in each model as a criterion is not sufficient. It is an underestimation of the true risk
of the estimate and this leads to select models that are too complex. By adding an adapted
penalty pen(m), one hopes to compensate for both the variance term and the bias term between
112
3.5. TOOLS FOR PROOF
P
−1/n ni=1 log (ŝm̂ (yi |xi )/s∗ (yi |xi )) and inf sm ∈Sm KL⊗n (s∗ , sm ). For a given choice of pen(m),
the best model Sm̂ is chosen as the one whose index is an almost minimizer of the penalized
η-log-likelihood.
Talk about the assumption (3.6). If s is bounded, with a compact support, this assumption is
satisfied. It is also satisfied in other cases, more particular. Then it is not a strong assumption,
but it is needed to control the random family.
This theorem is available for whatever model collection constructed, whereas assumptions Hm ,
K and Sepm are satisfied. In the following, we will use this theorem for the procedure we propose
to cluster high-dimensional data. Nevertheless, this theorem is not specific for our context, and
could be used whatever the problem.
Remark that the constant associated to the Assumption K appears squared in the bound. It is
due to the random subcollection M̌ of M, if the model collection is fixed, we get a linear bound.
Moreover, the weights wm appears linearly in the penalty bound.
3.5.2
Proof of the general theorem
For any model Sm , we have denoted by s̄m a function such that
KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , sm ) +
sm ∈Sm
Fix m ∈ M such that KL⊗n (s∗ , s̄m ) < +∞. Introduce
(
δKL
.
n
′
pen(m )
M(m) = m ∈ M Pn (− log ŝm′ ) +
n
′
′
pen(m) η
≤ Pn (− log ŝm ) +
+
n
n
where Pn (g) = 1/n
)
;
Pn
We define the functions kl(s̄m ), kl(ŝm ) and jklρ (ŝm ) by
s̄
ŝm
m
;
;
kl(ŝm ) = − log
kl(s̄m ) = − log
∗
s
s∗
1
(1 − ρ)s∗ + ρŝm
jklρ (ŝm ) = − log
.
ρ
s∗
i=1 g(yi |xi ).
′
For every m ∈ M(m), by definition,
′
′
pen(m) + η
pen(m )
≤Pn (kl(ŝm )) +
Pn (kl(ŝm′ )) +
n
n
′
pen(m) + η + η
≤Pn (kl(s̄m )) +
.
n
Let νn⊗n (g) denote the recentred process Pn (g) − P ⊗n (g). By concavity of the logarithm,
kl(ŝm′ ) ≥ jklρ (ŝm′ ),
and then
P ⊗n (jklρ (ŝm′ )) − νn⊗n (kl(s̄m ))
≤P ⊗n (kl(s̄m )) +
′
′
η + η pen(m )
pen(m)
− νn⊗n (jklρ (ŝm′ )) +
−
,
n
n
n
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
113
which is equivalent to
pen(m)
− νn⊗n (jklρ (ŝm′ ))
n
′
′
η + η pen(m )
+
−
.
n
n
⊗n ∗
⊗n
∗
n
(s , s̄m ) +
JKL⊗
ρ (s , ŝm′ ) − νn (kl(s̄m )) ≤ KL
(3.9)
Mimic the proof as done in Cohen and Le Pennec [CLP11], we could obtain that except on a
set of probability less than e−wm′ −w , for all w, for all zm′ > σm′ , there exist absolute constants
′
′
′
κ0 , κ1 , κ2 such that
s
′
wm′ + w 18 wm′ + w
κ1 σ m ′
−νn⊗n (jklρ (ŝm′ ))
′
+ κ2
≤
+
.
(3.10)
′
⊗n 2 ∗
2
zm′
ρ nz2 ′
nz2m′
z ′ + κ0 (dH ) (s , ŝm′ )
m
m
To obtain this inequality we use the hypothesis Sepm and Hm . This control is derived from
maximal inequalities, described in [Mas07].
Our purpose is now to control νn⊗n (kl(s̄m )). This is the difference with the Theorem of Cohen
and Le Pennec: we work with a random subcollection ML of M.
By definition of kl and νn⊗n ,
" n
#
n
X
X
1
s̄
(y
|x
)
s̄
(Y
|X
)
1
m
i
i
m
i
i
+E
.
log
log
νn⊗n (kl(s̄m )) = −
n
s∗ (yi |xi )
n
s∗ (Yi |Xi )
i=1
i=1
We want to apply Bernstein’s inequality, which is recalled
in Appendix.
s̄m (Yi |Xi )
1
If we denote by Zi the random variable Zi = − n log s∗ (Yi |Xi ) , we get
νn⊗n (kl(s̄m ))
=
n
X
i=1
(Zi − E(Zi )).
We need to control the moments of Zi to apply Bernstein’s inequality.
Lemme 3.5.2. Let s∗ and s̄m two conditional
with respect to the Lebesgue measure.
densities
s∗
≤ τ . Then,
Assume that there exists τ > 0 such that log
s̄m
n
E
1X
n
i=1
Z
Rq
log
s∗ (y|xi )
s̄m (y|xi )
2
∞
∗
s (y|xi )dy
!
≤
e−τ
τ2
KL⊗n (s∗ , s̄m ).
+τ −1
We prove this lemma in Appendix 3.6.2.
2
2
Because e−τ τ+τ −1 ∼ τ , there exists A such that e−τ τ+τ −1 ≤ 2τ for all τ ≥ A. For τ ∈ (0, A],
τ →∞
because this function is continuous
to 2 in 0, there exists B > 0 such that
Pn and2 equivalent
τ2
1
≤ B. We obtain that i=1 E(Zi ) ≤ n δ(1 ∨ τ ) KL⊗n (s∗ , s̄m ), where δ = 2 ∨ B.
e−τ +τ −1
Moreover, for all integers K ≥ 3,
n
X
i=1
E((Zi )K
+)
K
Z ∗
n
X
s (y|xi )
1
log
≤
s∗ (y|xi )dy
n K Rq
s̄m (y|xi )
+
i=1
∗
∗
K−2
Z
s (y|x)
s (y|x) 2
n
log
log
1s∗ ≥s̄m (y|x) s∗ (y|x)dy
≤ K
n
s̄
(y|x)
s̄
(y|x)
q
m
m
R
n K−2
⊗n ∗
δ(1 ∨ τ ) KL (s , s̄m ).
≤ Kτ
n
114
3.5. TOOLS FOR PROOF
Assumptions of Bernstein’s inequality are then satisfied, with
δ(1 ∨ τ ) KL⊗n (s∗ , s̄m )
,
n
v=
c=
τ
,
n
thus, for all u > 0, except on a set with probability less than e−u ,
√
νn⊗n (kl(s̄m )) ≤ 2vu + cu.
Thus, for all z > 0, for all u > 0, except on a set with probability less than e−u ,
√
√
cu
2vu + cu
vu
νn⊗n (kl(s̄m ))
+ 2.
≤ 2
≤ p
⊗n ∗
⊗n ∗
2
⊗
n
∗
z + KL (s , s̄m )
z + KL (s , s̄m )
z 2 KL (s , s̄m ) z
(3.11)
We apply this bound to u = w + wm + wm′ . We get that, except on a set with probability less
than e−(w+wm +wm′ ) , using that a2 + b2 ≥ a2 , from the inequality (3.10),
κ′1 + κ′2
18
⊗n 2 ∗
⊗n
2
′
+ 2
,
−νn (jklρ (ŝm′ )) ≤ zm′ + κ0 (dH ) (s , ŝm′ )
θ
θ ρ
and, from the inequality (3.11),
where we have chosen
2
⊗n
(s, sm ) ,
νn⊗n (kl(s̄m )) ≤ (β + β 2 ) zm,m
′ + KL
zm′ = θ
with θ > 1 to fix later, and
zm,m′ = β
−1
s
r
2 +
σm
′
w m′ + w
,
n
v
+ c (w + wm + wm′ ),
2 KL⊗n (s∗ , s̄m )
with β > 0 to fix later.
Coming back to the inequality (3.9),
⊗n ∗
∗
n
(s , s̄m ) +
JKL⊗
ρ (s , ŝm′ ) ≤ KL
pen(m)
n
∗
n 2
κ′0 (d⊗
H ) (s , ŝm′ ))
κ′1 + κ′2
18
+ 2
θ
θ ρ
+
(z2m′
+
η ′ + η pen(m′ )
2
⊗n ∗
(s , s̄m )).
−
+ (β + β 2 )(zm,m
′ + KL
n
n
+
Recall that s̄m is chosen such that
KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , sm ) +
sm ∈Sm
Put κ(β) = 1 + (β + β 2 ), and let ǫ1 > 0, we define θ1 by κ′0
defined by
∗
n 2
Cρ (d⊗
H ) (s , ŝm′ )
∗
n
≤ JKL⊗
ρ (s , ŝm′ ), and put κ2 =
⊗n ∗
∗
n
(s , sm ) +
(1 − ǫ1 ) JKL⊗
ρ (s , ŝm′ ) ≤κ(β) KL
+ κ(β)
δKL
.
n
κ′1 +κ′2
θ1
Cρ ǫ 1
κ0 .
+
18
θ12 ρ
= Cρ ǫ1 where Cρ is
We get that
pen(m) pen(m′ )
−
n
n
δKL η ′ + η
2
+
+ z2m′ κ2 + (β + β 2 )zm,m
′.
n
n
115
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
Since τ ≤ 1 ∨ τ , if we choose β such that (β + β 2 )(δ/2 + 1) = αθ1−2 β −2 , and if we put
κ1 = αγ −2 (β −2 + 1), since 1 ≤ 1 ∨ τ , using the expressions of zm′ and zm,m′ , we get that
⊗n ∗
∗
n
(s , sm ) +
(1 − ǫ1 ) JKL⊗
ρ (s , ŝm′ ) ≤κ(β) KL
pen(m) pen(m′ )
−
n
n
δKL η ′ + η
+
n
n
w
+ w m′
w + w m + w m′
2
2
+ κ2 θ 1 σ m ′ +
+ κ1 (1 ∨ τ )
n
n
wm
pen(m)
⊗n ∗
+ κ1 (1 ∨ τ )
≤κ(β) KL (s , sm ) +
n
n
′
pen(m′ )
w
w m′
m
2
2
+ −
+ κ2 θ 1 σ m ′ +
+ κ1 (1 ∨ τ )
n
n
n
′
w
δKL η + η
+
+ (κ2 θ12 + κ1 (1 ∨ τ )) .
+
n
n
n
+ κ(β)
Now, assume that κ1 ≥ κ in inequality (3.7), we get
pen(m) δKL η + η ′
⊗n ∗
∗
n
(s , sm ) + 2
(1 − ǫ1 ) JKL⊗
+
+
ρ (s , ŝm′ ) ≤κ(β) KL
n
n
n
w
2
+ (κ2 θ1 + κ1 (1 ∨ τ )) .
n
It only remains to sum up the tail bounds over all the possible values of m ∈ M and m′ ∈ M(m)
by taking the union of the different sets of probability less than e−(w+wm +wm′ ) ,
X
X
e−(w+wm +wm′ ) ≤ e−w
e−(wm +wm′ )
(m,m′ )∈M×M
m∈M
m′ ∈M(m)
= e−w
X
m∈M
e−wm
!2
= Ω2 e−w
from the Assumption K.
We then have simultaneously for all m ∈ M, for all m′ ∈ M(m), except on a set with probability
less than Ω2 e−w ,
pen(m) δKL
+
n
n
w
η + η′
2
+ κ2 θ1 + κ1 (1 ∨ τ )
.
+
n
n
⊗n ∗
∗
n
(s , sm ) + 2
(1 − ǫ1 ) JKL⊗
ρ (s , ŝm′ ) ≤κ(β) KL
It is in particular satisfied for all m ∈ M̌ and m′ ∈ M̌(m), and, since m̂ ∈ M̌(m) for all m ∈ M̌,
we deduce that except on a set with probability less than Ω2 e−w ,
1
pen(m)
⊗n ∗
⊗n ∗
JKLρ (s , ŝm̂ ) ≤
× inf κ(β) KL (s , sm ) + 2
(1 − ǫ1 )
n
m∈M̌
′
w
δKL η + η
2
+
+ κ2 θ1 + κ1 (1 ∨ τ )
+
.
n
n
n
116
3.5. TOOLS FOR PROOF
By integrating
over all w > 0, because for any non negative random variable Z and any a > 0,
R
E(Z) = a z≥0 P (Z > az)dz, we obtain that
∗
n
E JKL⊗
ρ (s , ŝm̂ ) −
1
(1 − ǫ1 )
Ω2
.
≤ κ2 θ12 + κ1 (1 ∨ τ )
n
inf
m∈M̌
κ(β) KL⊗n (s∗ , sm ) + 2
pen(m)
n
+
δKL + η + η ′
κ0 θ 2
n
As δKL can be chosen arbitrary small, this implies that
1
pen(m)
E(JKL⊗n (s∗ , ŝm̂ )) ≤
E inf κ(β) KL⊗n (s∗ , sm ) +
1 − ǫ1
n
m∈M̌
2
′
Ω
η+η
+ (κ2 θ12 + κ1 (1 ∨ τ ))
+
n
n
pen(m)
⊗n ∗
≤C1 E inf
inf KL (s , t) +
n
m∈M̌ t∈Sm
2
′
Ω
η +η
+ C2 (1 ∨ τ )
+
n
n
with C1 =
3.5.3
2
1−ǫ1
and C2 = κ2 θ12 + κ1 .
Sketch of the proof of the oracle inequality 3.3.2
To prove the Theorem 3.3.2, we have to apply the Theorem 3.5.1. Then, our model collection
has to satisfy all the assumptions. Here, m = (K, J). The Assumption Sepm is true when we
consider Gaussian densities. If s∗ is bounded, with compact support, the assumption (3.6) is
satisfied. It is also true in other particular cases. We have to look after assumption Hm and
Assumption K. Here we present only the main step to prove these assumptions. All the details
are in Appendix.
Assumption Hm
R̟q
n
H[.] (ǫ, Sm , d⊗
We could take φm (̟) = 0
H )dǫ for all ̟ > 0. It could be better to consider
more local version of the integrated square root entropy, but the global one is enough in this
case to define the penalty. As done in Cohen and Le Pennec [CLP11], we could decompose the
entropy by
⊗n
⊗n
B
n
H[.] (ǫ, S(K,J)
, d⊗
H ) ≤ H[.] (ǫ, ΠK , dH ) + KH[.] (ǫ, FJ , dH )
where
B
S(K,J)
ΠK
P
(K,J)
[J]
y ∈ Rq |x ∈ Rp 7→ sξ
(y|x) = K
πk ϕ(y|βk x, Σk )
k=1
o
n
[J]
[J]
=
ξ = π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ∈ Ξ̃(K,J)
Ξ̃(K,J) = ΠK × ([−Aβ , Aβ ]q×p )K × ([aΣ , AΣ ]q )K
(
)
K
X
= (π1 , . . . , πK ) ∈ (0, 1)K ;
πk = 1
n
k=1
FJ = ϕ(.|β [J] X, Σ); β ∈ [−Aβ , Aβ ]q×p , Σ = diag([Σ]1,1 , . . . , [Σ]q,q ) ∈ [aΣ , AΣ ]q
where ϕ denote the Gaussian density, and Aβ , aΣ , AΣ are absolute constants.
o
117
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
Calculus for the proportions We could apply a result proved by Wasserman and Genovese
in [GW00] to bound the entropy for the proportions. We get that
K−1 !
⊗n
K/2 3
.
H[.] (ǫ, ΠK , dH ) ≤ log K(2πe)
ǫ
Calculus for the Gaussian The family
2
l(y, x) = (1 + δ)−p q−3q/4 ϕ(y|νJ x, (1 + δ)−1/4 B [1] )
2
u(y, x) = (1 + δ)p q+3q/4 ϕ(y|νJ x, (1 + δ)B [2] )
[a]
B =diag(bi(1) , . . . , bi(q) ), with i a permutation, for a ∈ {1, 2},
Bǫ (FJ ) =
bl = (1 + δ)1−l/2 AΣ , l ∈ {2, . . . , N }
and
∀(z, j) ∈ J c , νz,j = 0
√
∀(z, j) ∈ J, νz,j = cδAΣ uz,j
(3.12)
is an ǫ-bracket covering for FJ , where uz,j is a net for the mean, N is the number of parameters
−1/4 )
1
ǫ, and c = 5(1−28
needed to recover all the variance set, δ = √2(p2 q+3/4q)
We obtain that
2Aβ |J| AΣ 1 −1−|J|
√
+
δ
;
|Bǫ (FJ )| ≤ 2
aΣ
2
cAΣ
.
and then we get
n
H[.] (ǫ, FJ , d⊗
H )
≤ log 2
2Aβ
√
cAΣ
|J|
AΣ 1
+
aΣ
2
Proposition 3.5.3. Put D(K,J) = K(1 + |J|). For all ǫ ∈ (0, 1),
B
n
H[.] (ǫ, S(K,J)
, d⊗
H ) ≤ log(C) + D(K,J) log
with
C = 2K(2πe)K/2
2Aβ
√
cAΣ
K|J|
3K−1
δ −1−|J|
(̟)
.
1
;
ǫ
AΣ 1
+
aΣ
2
Determination of a function φ We could take
s
"
q
φ(K,J) (̟) = D(K,J) ̟ B(Aβ , AΣ , aΣ ) + log
φ
!
K
.
1
̟∧1
#
.
This function is non-decreasing, and ̟ 7→ (K,J)
is non-increasing.
̟
√ 2
. With the expression of φ(K,J) ,
The root ̟(K,J) is the solution of φ(K,J) (̟(K,J) ) = n̟(K,J)
we get
s
"
r
#
D
1
(K,J)
2
̟ B(Aβ , AΣ , aΣ ) + log
̟(K,J)
=
.
n
̟(K,J) ∧ 1
q
D(K,J)
∗
Nevertheless, we know that ̟ =
n B(Aβ , AΣ , aΣ ) minimizes ̟(K,J) : we get
"
!#
D(K,J)
1
2
2
̟(K,J) ≤
.
2B (Aβ , AΣ , aΣ ) + log D
(K,J)
n
B 2 (Aβ , AΣ , aΣ ) ∧ 1
n
118
3.6. APPENDIX: TECHNICAL RESULTS
Assumption K
We want to group models by their dimension.
Lemme 3.5.4. The quantity card{(K, J) ∈ N∗ × P({1, . . . , q} × {1, . . . , p}), D(K, J) = D} is
upper bounded by
2pq if pq ≤ D − q 2
D−q2
epq
otherwise.
2
D−q
Proposition 3.5.5. Consider the weight family {w(K,J) }(K,J)∈K×J defined by
w(K,J) = D(K,J) log
Then we have
3.6
P
(K,J)∈K×J
4epq
(D(K,J) − q 2 ) ∧ pq
.
e−w(K,J) ≤ 2.
Appendix: technical results
In this appendix, we give more details for the proofs.
3.6.1
Bernstein’s Lemma
Lemme 3.6.1 (Bernstein’s inequality). Let (X1 , . . . , Xn ) be independent real
Pnvalued random
variables. Assume that therePexists some positive numbers v and P
c such that i=1 E(Xi2 ) ≤ v,
n
K! K−2
K
and, for all integers K ≥ 3, i=1 E((Xi )+ ) ≤ 2 vc
. Let S = ni=1 (Xi − E(Xi )). Then, for
every positive x,
√
P (S ≥ 2vx + cx) ≤ exp(−x).
3.6.2
Proof of Lemma 3.5.2
This proof is adapted from Maugis and Meynet, [MMR12]. First, let give some bounds.
Lemme 3.6.2. Let τ > 0. For all x > 0, consider
f (x) = x log(x)2 ,
h(x) = x log(x) − x + 1,
Then, for all 0 < x < eτ , we get
f (x) ≤
φ(x) = ex − x − 1.
τ2
h(x).
φ(−τ )
φ(y)
y2
is non-decreasing. We omit the proof here.
∗
s
We want to apply this inequality, in order to derive the Lemma 3.5.2. As log
≤ τ,
s̄m
To prove this, we have to show that y 7→
∞
s∗
s̄m
∞
≤ eτ ;
and we could apply the previous inequality to s∗ /s̄m . Indeed, for all x, for all y,
∗
∗
τ2
s (y|x)
s (y|x)
h
≤
.
f
s̄m (y|x)
φ(−τ )
s̄m (y|x)
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
119
Integrating with respect to the density s̄m , we get that
∗
s (y|.) 2
s∗ (y|.)
log
s̄m (y|.)dy
s̄m (y|.)
Rq s̄m (y|.)
∗
Z
τ2
s∗ (y|.)
s∗ (y|.)
s (y|.)
≤
log
−
+
1
s̄m (y|.)dy
−τ − τ − 1
s̄m (y|.)
s̄m (y|.) s̄m (y|.)
Rq e
∗
n Z
s (y|xi ) 2
1X
∗
s (y|xi ) log
dy
⇐⇒
n
s̄m (y|xi )
i=1
n Z
τ2
1X
s∗ (y|xi )
≤ −τ
dy.
s∗ (y|xi ) log
e −τ −1n
s̄m (y|xi )
Z
i=1
It concludes the proof.
3.6.3
Determination of a net for the mean and the variance
In this subsection, we work with a Gaussian density, then β ∈ Rq×p and Σ ∈ S++
q .
— Step 1: construction of a net for the variance
j
Let ǫ ∈ (0, 1], and δ = √2(p21q+ 3 q) ǫ. Let bj = (1 + δ)1− 2 AΣ . For 2 ≤ j ≤ N , we have
S
S 4
[aΣ , AΣ ] = [bN , bN −1 ] . . . [b3 , b2 ], where N is chosen to recover everything. We want
that
aΣ = (1 + δ)1−N/2 AΣ
aΣ
N
log
= 1−
log(1 + δ)
AΣ
2
√
2 log( AaΣΣ 1 + δ)
.
N=
log(1 + δ)
⇔
⇔
We want N to be an integer, then N =
A
2 log( a Σ
Σ
√
1+δ)
log(1+δ)
. We get a net for the variance.
We could let B = diag(bi(1) , . . . , bi(q) ), close to Σ (and deterministic, independent of
the values of Σ), where i is a permutation such that bi(z)+1 ≤ [Σ]z,z ≤ bi(z) for all
b
√1
z ∈ {1, . . . , q}. Remember that j+1
bj = 1+δ .
— Step 2: construction of a net for the mean vectors
We select only the relevant variables detected by the Lasso estimator. For λ ≥ 0,
n
o
Lasso
Jλ = (z, j) ∈ {1, . . . , q} × {1, . . . , p}|β̂z,j
(λ) 6= 0 .
Let f = ϕ(.|βx, Σ) ∈ FJ .
— Definition of the brackets
Define the bracket by the functions l and u:
2
l(y, x) = (1 + δ)−p q−3q/4 ϕ y|νJ x, (1 + δ)−1/4 B [1] ;
2
u(y, x) = (1 + δ)p q+3q/4 ϕ y|νJ x, (1 + δ)B [2] .
We have chosen i such that [B [1] ]z,z ≤ Σz,z ≤ [B [2] ]z,z for all z ∈ {1, . . . , q}.
We need to define ν such that [l, u] is an ǫ-bracket for f .
120
3.6. APPENDIX: TECHNICAL RESULTS
— Proof that [l, u] is a bracket for f
We are looking for a condition on νJ to have fu ≤ 1 and fl ≤ 1.
We will use the following lemma to compute these ratios.
Lemme 3.6.3. Let ϕ(.|µ1 , Σ1 ) and ϕ(.|µ2 , Σ2 ) be two Gaussian densities. If their
variance matrices are assumed to be diagonal, with Σa = diag([Σa ]1,1 , . . . , [Σa ]q,q ) for
a ∈ {1, 2}, such that [Σ2 ]z,z > [Σ1 ]z,z > 0 for all z ∈ {1, . . . , q}, then, for all y ∈ Rq ,
q p
1
1
(µ1 −µ2 )
,..., [Σ ] −[Σ
ϕ(y|µ1 , Σ1 ) Y [Σ2 ]z,z 21 (µ1 −µ2 )t diag [Σ2 ]1,1 −[Σ
]
]
q,q
q,q
1 1,1
2
1
p
≤
e
.
ϕ(y|µ2 , Σ2 )
[Σ1 ]z,z
z=1
For the ratio
f
u
we get:
ϕ(y|βx, Σ)
1
f (y|x)
=
2 q+3q/4
p
u(y, x) (1 + δ)
ϕ(y|νJ x, (1 + δ)B [2] )
q
Y
1
bz
1
t
[2]
−1
≤
(1 + δ)q/2 × e 2 (βx−νJ x) ((1+δ)B −Σ) (βx−νJ x)
2 q+3q/4
p
[Σ]z,z
(1 + δ)
z=1
≤(1 + δ)
p2 q−q/4
1
(1 + δ)q/4 e 2 (βx−νJ x)
1
2
≤(1 + δ)p q e 2δ (βx−νJ x)
For the ratio
l
f
t [B [2] ]−1 (βx−ν
t (δB [2] )−1 (βx−ν
J x)
J x)
(3.13)
.
we get:
ϕ(y|νJ x, (1 + δ)−1/4 B [1] )
l(y, x)
1
=
2
f (y|x) (1 + δ)p q+3q/4
ϕ(y|βx, Σ)
q
Y
1
Σz,z
1
t
[1] −1
(1 + δ)q/8 × e 2 (βx−νJ x) (Σ−B ) (βx−νJ x)
≤
2 q+3q/4
p
bz
(1 + δ)
z=1
≤(1 + δ)
−p2 q−3q/8
≤(1 + δ)−p
2 q−3q/8
1
(1 + δ)q/4 × e 2 (βx−νJ x)
1
e 2(1−(1+δ)−1/4 )
t ((1−(1+δ)−1/4 )B [1] )−1 (βx−ν
(βx−νJ x)t [B [1] ]−1 (βx−νJ x)
We want to bound the ratios (3.13) and (3.14) by 1. Put c =
these calculus. A necessary condition to obtain this bound is
.
5(1−2−1/4 )
,
8
||βx − νJ x||22 ≤ pqδ 2 (1 − 2−1/4 )A2Σ .
Indeed, we want
(1 + δ)−p
2 q−3q/8
1
e 2(1−(1+δ)−1/4 )
(1 + δ)
−p2 q
e
(βx−νJ x)t [B [2] ]−1 (βx−νJ x)
1
(βx−νJ x)t [B [1] ]−1 (βx−νJ x)
2δAΣ
≤1
≤ 1;
which is equivalent to
δ2
||βx − νJ x||22 ≤ p2 q A2Σ ;
2
3
2
2
||βx − νJ x||2 ≤ p q + q δ 2 (1 − 2−1/4 )AΣ .
4
As ||βx − νJ x||22 ≤ p||β − νJ ||22 ||x||∞ , and x ∈ [0, 1]p , we need to get
||β − νJ ||22 ≤ pqδ 2 (1 − 2−1/4 )A2Σ to have the wanted bound. Put
Aβ
−Aβ
, √
.
U := Z ∩ √
cδAΣ
cδAΣ
J x)
(3.14)
and develop
121
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
For all (z, j) ∈ J, choose
uz,j = argmin βz,j −
vz,j ∈U
√
cδAΣ vz,j .
(3.15)
Define ν by
for all (z, j) ∈ J c , νz,j = 0;
√
for all (z, j) ∈ J , νz,j = cδAΣ uz,j .
Then, we get a net for the mean vectors.
— Proof that dH (l, u) ≤ ǫ
We will work with the Hellinger distance.
d2H (l, u)
=
=
=
=
−
Z √
√
1
( l − u)2
2 Rq
Z
√
1
l + u − 2 lu
2 Rq
i Z √
1h
2
2
(1 + δ)−p q−3q/4 + (1 + δ)p q+3q/4 −
ϕ l ϕu
2
Rq
i
1h
2
2
(1 + δ)−p q−3q/4 + (1 + δ)p q+3q/4
2
2 !1/2
q p
Y
2bi(z)+1 bi(z) (1 + δ)1/2 (1 + δ)−1/8
∗ 1.
(1 + δ)bi(z)+1 + (1 + δ)−1/4 bi(z)
z=1
We have used the following lemma:
Lemme 3.6.4. The Hellinger distance of two Gaussian densities with diagonal variance matrices is given by the following expression:
d2H (ϕ(.|µ1 , Σ1 ), ϕ(.|µ2 , Σ2 ))
!1/2
p
q
Y
2 [Σ1 ]z,z [Σ2 ]z,z
=2 − 2
[Σ1 ]z,z + [Σ2 ]z,z
z=1
(
!
)
1
1
× exp − (µ1 − µ2 )t diag
(µ1 − µ2 )
4
[Σ1 ]2z,z + [Σ2 ]2z,z z∈{1,...,q}
As bi(z)+1 = (1 + δ)−1/2 bi(z) , we get that
2
bi(z)+1
(1 + δ)3/8 bi(z)
(1 + δ)5/8
(1 + δ)−1/4 + (1 + δ)3/2
(1 + δ)−1/4 + (1 + δ)1/2 (1 + δ)
2
.
=
(1 + δ)−7/8 + (1 + δ)7/8
=2
Then
d2H (l, u)
i
1h
2
2
= (1 + δ)−(p q+3q/4) + (1 + δ)p q+3q/4 −
2
2
−7/8
(1 + δ)
+ (1 + δ)7/8
= cosh((p2 q + 3q/4) log(1 + δ)) − 2 cosh(7/8 log(1 + δ))−q/2
q/2
= cosh((p2 q + 3q/4) log(1 + δ)) − 1 + 1 − 2−q/2 cosh(7/8 log(1 + δ))−q/2 .
122
3.6. APPENDIX: TECHNICAL RESULTS
We want to apply the Taylor formula to f (x) = cosh(x) − 1 to obtain an upper bound,
and to g(x) = 1 − 2−q/2 cosh(x)−q/2 . Indeed, there exists c such that, on the good
2
2
interval, f (x) ≤ cosh(c) x2 and g(x) ≤ q 2 x2 . Then, and because log(1 + δ) ≤ δ,
d2H (l, u) ≤ cosh((p2 q + 3q/4) log(1 + δ)) − 2 cosh(7/8 log(1 + δ))−q/2
49
2
2 2
≤ (p q + 3q/4) δ cosh(α) +
128
≤ 2(p2 q + 3q/4)2 δ 2 ≤ ǫ2 ;
√
where ǫ ≥ 2(p2 q + 34 q)δ.
— Step 3: Upper bound of the number of ǫ-brackets for FJ .
From Step 1 and Step 2, the family
−(p2 q+3q/4) ϕ(y|ν x, (1 + δ)−1/4 B [1] )
l(y,
x)
=
(1
+
δ)
J
p2 q+3q/4 ϕ(y|ν x, (1 + δ)B [2] )
u(y,
x)
=
(1
+
δ)
J
B [a] = diag(b[a] , . . . , b[a] ) where i is a permutation, for a ∈ {1, 2},
a
i(1)
i(q)
Bǫ (FJ ) =
[a]
bi(z) = (1 + δ)1−ia (z)/2 AΣ for all z ∈ {1, . . . , q}
c
with
∀(z,
j)
∈
J
,
ν
=
0
z,j
√
∀(z, j) ∈ J, ν = cδA u
z,j
Σ z,j
(3.16)
is an ǫ-bracket for FJ , for uz,j defined by (3.15). Therefore, an upper bound of the number
of ǫ-brackets necessary to cover FJ is deduced from an upper bound of the cardinal of
Bǫ (FJ ).
|J|
N
N
X
Y 2Aβ 2Aβ |J| X
2Aβ
√
√
√
≤
(N − 1).
|Bǫ (FJ )| ≤
1≤
cδAΣ
cδAΣ
cδAΣ
l=2
l=2 (z,j)∈J
As N ≤
2(AΣ /aΣ +1/2)
,
δ
we get
|Bǫ (FJ )| ≤ 2
3.6.4
2Aβ
√
cAΣ
|J|
AΣ 1
+
aΣ
2
δ −1−|J| .
Calculus for the function φ
From the Proposition 3.5.3, we obtain, for all ̟ > 0,
Z
̟
0
q
n
B
H[.] (ǫ, S(K,J)
, d⊗
H )dǫ
We need to control
R̟q
log
0
1
ǫ
0
̟∧1
0
s
1
dǫ
log
ǫ
(3.17)
dǫ, which is done in Maugis-Rabusseau and Meynet ([MMR12]).
Lemme 3.6.5. For all ̟ > 0,
Z s
̟
Z
q
p
≤ ̟ log(C) + D(K,J)
s #
"
√
1
1
log
dǫ ≤ ̟
π + log
.
ǫ
̟
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
123
Then, according to (3.17),
Z
̟
0
q
n
B
H[.] (ǫ, S(K,J)
, d⊗
H )dǫ
s
"
q
p
√
≤̟ log(C) + D(K,J) (̟ ∧ 1)
π + log
"s
q
≤̟ D(K,J)
log(C) √
+ π+
D(K,J)
s
log
1
̟∧1
1
̟∧1
#
Nevertheless,
K
log(C) ≤ log(2) + log(K) +
log(2πe)
2
2Aβ
AΣ 1
+
+ K|J| log √
+ (K − 1) log(3)
+ K log
aΣ
2
cAΣ
"
√
≤D(K,J) log(2) + log( 2πe) + 1 + log(3)
2Aβ
AΣ 1
+
+ log √
+ log
aΣ
2
cAΣ
r
Aβ AΣ 1
πe
5/2
≤D(K,J) 1 + log
+
+ log 2 3
.
AΣ a Σ
2
c
Then
Z
q
B
n
H[.] (ǫ, S(K,J)
, d⊗
H )dǫ
0
s
r
q
A
πe
A
1
β
Σ
5/2
+
+ log 2 3
≤̟ D(K,J) 1 + log
AΣ AΣ 2
c
s
#
√
1
+ π + log
̟∧1
s
s
"
#
q
Aβ AΣ 1
1
+ a + log
+
≤̟ D(K,J) 1 + log
AΣ a Σ
2
̟∧1
s
"
#
q
1
≤̟ D(K,J) B(Aβ , AΣ , aΣ ) + log
;
̟∧1
̟
with
B(Aβ , AΣ , aΣ ) = 1 +
s
q
p
log(25/2 3 πe
c ).
and a =
√
3.6.5
Proof of the Proposition 3.5.5
π+
log
We are interested in
P
(K,J)∈K×J
Aβ
AΣ
AΣ 1
+
aΣ
2
+ a;
e−w(K,J) . Considering
w(K,J) = D(K,J) log
4epq
(D(K,J) − q 2 ) ∧ pq
,
#
124
3.6. APPENDIX: TECHNICAL RESULTS
we could group models by their dimensions to compute this sum. Denote by CD the cardinal of
models of dimension D.
X
e
−D(K,J) log
4epq
(D(K,J) −q 2 )∧pq
K∈N∗
J∈P({1,...,q}×{1,...,p})
=
2
pq+q
X
e
−D log
=
−D
4
D=1
pq+q 2
≤
3.6.6
X
−D
2
+
=
X
CD e
−D log
4epq
(D−q 2 )∧pq
D≥1
4epq
(D−q 2 )
D=1
2
pq+q
X
epq
D − q2
epq
D − q2
−q2
+∞
X
+
D−q2
+∞
X
+
+∞
X
e
−D log
4epq
pq
2pq
D=pq+q 2 +1
e−D(log(4)+1)+pq log(2)
D=pq+q 2 +1
2−D = 2.
D=pq+q 2 +1
D=1
Proof of the Lemma 3.5.4
We know that D(K,J) = K − 1 + |J|K + Kq 2 . Then,
CD = card{(K, J) ∈ N∗ × P({1, . . . , q} × {1, . . . , p}), D(K, J) = D}
X X pq
≤
1
2
|J| K(|J|+q +1)−1=D
∗ 1≤z≤q
K∈N
1≤j≤p
X pq
≤
1
2 .
|J| |J|≤pq∧(D−q )
∗
|J|∈N
If pq < D − q 2 ,
X pq
pq
1
2 = 2 .
|J| |J|≤pq∧(D−q )
|J|>0
Otherwise, according to the Proposition 2.5 in Massart, [Mas07],
X pq
2
1
2 ≤ f (D − q )
|J| |J|≤pq∧(D−q )
|J|>0
where f (x) = (epq/x)x is an increasing function on {1, . . . , pq}. As pq is an integer, we get the
result.
125
CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE
PROCEDURE
3.6. APPENDIX: TECHNICAL RESULTS
126
Chapter 4
An oracle inequality for the
Lasso-Rank procedure
Contents
4.1
4.2
Introduction . . . . . . . . . . . . . . . . . . . .
The model and the model collection . . . . . .
4.2.1 Linear model . . . . . . . . . . . . . . . . . . .
4.2.2 Mixture model . . . . . . . . . . . . . . . . . .
4.2.3 Generalized EM algorithm . . . . . . . . . . . .
4.2.4 The Lasso-Rank procedure . . . . . . . . . . .
4.3 Oracle inequality . . . . . . . . . . . . . . . . . .
4.3.1 Framework and model collection . . . . . . . .
4.3.2 Notations . . . . . . . . . . . . . . . . . . . . .
4.3.3 Oracle inequality . . . . . . . . . . . . . . . . .
4.4 Numerical studies . . . . . . . . . . . . . . . . .
4.4.1 Simulations . . . . . . . . . . . . . . . . . . . .
4.4.2 Illustration on a real dataset . . . . . . . . . .
4.5 Appendices . . . . . . . . . . . . . . . . . . . . .
4.5.1 A general oracle inequality for model selection
4.5.2 Assumption Hm . . . . . . . . . . . . . . . . .
Decomposition . . . . . . . . . . . . . . . . . .
For the Gaussian . . . . . . . . . . . . . . . . .
For the mixture . . . . . . . . . . . . . . . . . .
4.5.3 Assumption K . . . . . . . . . . . . . . . . . .
127
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . 117
. . . . . 119
. . . . . . 119
. . . . . . 119
. . . . . . 120
. . . . . . 121
. . . . . 122
. . . . . . 122
. . . . . . 123
. . . . . . 123
. . . . . 125
. . . . . . 125
. . . . . . 126
. . . . . 126
. . . . . . 126
. . . . . . 128
. . . . . . 128
. . . . . . 129
. . . . . . 132
. . . . . . 134
128
4.1. INTRODUCTION
In this chapter, we focus on a theoretical result for the Lasso-Rank
procedure. Indeed, we get the same kind of results as in the previous chapter, with rank constraint on the estimator. We get a
theoretical penalty for which the model selected by a penalized criterion, among the collection, satisfies an oracle inequality. We also
illustrate in more details benefits of this procedure with simulated
and benchmark dataset including rank structure.
4.1
Introduction
The multivariate response regression model
Y = βX + ǫ
postulates a linear relationship between Y , the q×n matrix containing q responses for n subjects,
and X, the p×n matrix on p predictor variables. The term ǫ is an q ×n matrix with independent
columns, ǫi ∼ Nq (0, Σ) for all i ∈ {1, . . . , n}. The unknown q × p coefficient matrix β needs to
be estimate.
In a more general way, we could use finite mixture of linear model, which models the relationship between response and predictors, arising from different subpopulations: if the variable Y ,
conditionally to X, belongs to the cluster k, there exists βk ∈ Rq×p and Σk ∈ S++
such that
q
Y = βk X + ǫ, with ǫ ∼ Nq (0, Σk ).
If we use this model to deal with high-dimensional data, the number of variables could be
quickly much larger than the sample size, because and predictors and response variables could
be high-dimensional. To solve this problem, we have to reduce the parameter set dimension.
One way to cope the dimension problem is to select relevant variables, in order to reduce the
number of unknowns. Indeed, all the information should not be interesting for the clustering,
and could even be harmful. In a density estimation way, we could cite Pan and Shen, in
[PS07], who focus on mean variable selection, Meynet and Maugis in [MMR12] who extend their
procedure in high-dimension, Zhou et al., in [ZPS09], who use the Lasso estimator to regularize
Gaussian mixture model with general covariance matrices, Sun et al., in [SWF12], who propose
to regularize the k-means algorithm to deal with high-dimensional data, Guo et al, in [GLMZ10],
who propose a pairwise variable selection method, among others.
In a regression framework, we could use the Lasso estimator, introduced by Tibshirani in [Tib96],
which is a sparse estimator. It penalizes the maximum likelihood estimator by the ℓ1 -norm,
which achieves the sparsity, as the ℓ0 -penalty, but leads also to a convex optimization. Because
we work with the multivariate linear model, to deal with the matrix structure, we could prefer
the group-Lasso estimator, variables grouped by columns, which selects columns rather than
coefficients. This estimator was introduced by Zhou and Zhu in [ZZ10] in the general case. If
we select |J| columns among the p possible, we have to estimate |J|q coefficients rather than pq
for A, which could be smaller than nq if |J| is smaller enough.
Another estimator which reduces the dimension, is the low rank estimator. Introduced by
Izenman in [Ize75] in the linear model, and more used the last decades, with among others
Bunea et al. in [BSW12] and Giraud in [Gir11], the regression matrix is estimated by a matrix
of rank R, R < p ∧ q. Then, we have to estimate R(p + q − R) coefficients, which could be
smaller than nq.
In this chapter, we have chosen to mix these two estimators to provide a sparse and low rank
estimator in mixture models. This method was introduced by Bunea et al. in [BSW12], in the
case of linear model and known noise covariance matrix. They present different ways, more or
129
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
less computational, with more or less good results in theory. They get an oracle inequality, which
say that, among a model collection, they are able to choose an estimator with true rank and true
relevant variables. For this model, Ma and Sun in [MS14] get a minimax lower bound, which
precise that they attain nearly optimal rates of convergence adaptively for square Schatten norm
losses.
In this chapter, we consider finite mixture of K linear models in high-dimension. This model
is studied in details by Städler et al. for real response variable in [SBG10], and by Devijver
for multivariate response variable in [Dev14c]. We will estimate βk for all k ∈ {1, . . . , K}
by a column sparse and low rank estimator. The Lasso estimator is used to select variables,
whereas we refit the estimation by a low rank estimator, restricted on relevant variables. The
procedure we propose is based on a modeling that recasts variable selection, rank selection, and
clustering problems into a model selection problem. This procedure is developed in [Dev14c],
with methodology, computational issues, simulations and data analysis. In this chapter, we
focus on theoretical point of view, and developed simulations and data analysis for the low rank
issue. We construct a model collection, with models more or less sparse, and with vector of
ranks varying with values more or less small. Among this collection, we have to select a model.
We use the slope heuristic, which is a non-asymptotic criterion. In a theoretical way, in this
chapter, we get an oracle inequality for the collection constructed by our procedure, which makes
a performance comparison between our selected model and the oracle for a specified penalty.
This result is an extension of the work of Bunea et al. in [BSW12], to mixture models and
with unknown covariance matrices (Σk )1≤k≤K . They ensure that mixing sparse estimator and
low rank matrix could be interesting. Indeed, whereas we have to estimate q × p coefficients
in each cluster for the regression matrix, we get only R(|J| + q − R) unknown variables, which
could be smaller than the number of observations nq if |J| and R are small. Even if the oracle
inequality we get in this chapter is an extension of Bunea et al. result, we use a really different
way to prove it. Considering the model collection constructed, we want to select a model as
good as possible. For that, we use the slope heuristic, which leads to construct a penalty,
proportional to the dimension, and we select the model minimizing the penalized log-likelihood.
Theoretically, we construct also a penalty, proportional to the dimension (up to a logarithm
term). We provide an oracle inequality which compares, up to a constant, the Jensen-KullbackLeibler divergence between our model and the true model to the Kullback-Leibler divergence
between the oracle and the true model. Then, in estimation term, we do as well as possible.
This oracle inequality is deduced from a general model selection theorem for maximum likelihood
estimator of Massart developed in [Mas07]. Controlling the bracketing entropy of models, we
could prove the result. Remark that we work in a regression framework, then we rather use
an extension of this theorem proved in Cohen and Le Pennec article [CLP11]. As our model
collection is random, constructed by the Lasso estimator, we rather use an extension of this
theorem proved in [Dev14b]. To illustrate this procedure, in a computational way, we validate it
on simulated dataset, and benchmark dataset. If the data have a low rank structure, we could
easily find it with our methodology.
This chapter is organized as follows. In the Section 4.2, we describe the finite mixture regression
model used in this procedure, and the main step of the procedure. In the Section 4.3, we present
the main result of this chapter, which is an oracle inequality for the procedure proposed. Finally,
in Section 4.4, we illustrate the procedure on simulated and benchmark dataset. Proof details
of the oracle inequality are given in Appendix.
4.2. THE MODEL AND THE MODEL COLLECTION
4.2
130
The model and the model collection
We introduce our procedure of estimation by sparse and low rank matrix in the linear model,
in Section 4.2.1, and extend it in Section 4.2.2 for mixture models. We also present the main
algorithm used in this context, and we describe the procedure we propose in Section 4.2.4.
4.2.1
Linear model
We consider the observations (xi , yi )1≤i≤n which realized random variables (X, Y ), satisfying
the linear model
Y = βX + ǫ;
where Y ∈ Rq are the responses, X ∈ Rp are the regressors, β ∈ Rq×p is an unknown matrix,
and ǫ ∈ Rq are random errors, ǫ ∼ Nq (0, Σ), with Σ ∈ S++
a symmetric positive definite matrix.
q
We will work in high-dimension, then q × p could be larger than the number of observations nq.
We will construct an estimator which is sparse and low rank for β to cope with the high-dimension
issue. Moreover, to reduce the covariance matrix dimension, we compute a diagonal estimator of
Σ. The procedure we propose could be explained into two steps. First, we estimate the relevant
columns of β thanks to the Lasso estimator, for λ > 0, using the estimator
(4.1)
β̂ Lasso (λ) = argmin ||Y − βX||22 + λ||β||1 ;
β∈Rq×p
P
P
where ||β||1 = pj=1 qz=1 |βz,j |. We assume that the covariance matrix is unknown.
For λ > 0, computing the Lasso estimator of β̂ Lasso (λ), we could deduce the relevant columns.
Restricted to these relevant columns, in the second step of the procedure, we compute a low
rank estimator of β, saying of rank at most R. Indeed, as explained in Giraud in [Gir11], we
restrict the maximum likelihood estimator to have a rank at most R, keeping only the R biggest
singular values in the corresponding decomposition. We get an explicit formula.
This two steps procedure leads to an estimator of β which is sparse and has a low rank. We
have also reduced the dimension into two ways. We refit the covariance matrix estimator by the
maximum likelihood estimator.
This estimator is studied in Bunea et al. in [BSW12], in method 3. Let extend it in mixture
models.
4.2.2
Mixture model
We observe n independent couples (x, y) = (xi , yi )1≤i≤n of random variables (X, Y ), with Y ∈ Rq
and X ∈ Rp . We will estimate the unknown conditional density s∗ by a multivariate Gaussian
mixture regression model. In our model, if the observation i belongs to the cluster k, we assume
that there exists βk ∈ Rq×p , and Σk ∈ S++
such that yi = βk xi + ǫi where ǫi ∼ Nq (0, Σk ).
q
Thus, the random response variable Y ∈ Rq will be explained by a set of explanatory variables,
written X ∈ Rp , through a mixture of linear regression-type model. Give more precisions on the
assumptions.
— The variables Yi are independent conditionally to Xi , for all i ∈ {1, . . . , n} ;
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
131
— we let Yi |Xi = xi ∼ sK
ξ (y|xi )dy, with
sK
ξ (y|x)
K
X
(y − βk x)t Σ−1
πk
k (y − βk x)
exp
−
=
2
(2π)q/2 det(Σk )1/2
k=1
!
K
ξ = (π1 , . . . , πk , β1 , . . . , βk , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++
q )
(
)
K
X
ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and
πk = 1
(4.2)
k=1
is the set of symmetric positive definite matrices on Rq .
S++
q
For all k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix
in the mixture component k. P
The πk s are the mixture proportions. For all k ∈ {1, . . . , K}, for
t
all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j xj is the zth component of the mean of the mixture
component k for the conditional density sK
ξ (.|x).
To detect the relevant variables, we generalize the Lasso estimator defined by (4.1) for mixture
models. Indeed, we penalize the empirical contrast by an ℓ1 -penalty on the mean parameters
proportional to
p X
q
X
||Pk βk ||1 =
|(Pk βk )z,j |,
j=1 z=1
where the Cholesky decomposition Pkt Pk = Σ−1
k defines Pk for all k ∈ {1, . . . , K}. Then, we will
consider
(
)
n
K
X
X
1
Lasso
(λ) = argmin
ξˆK
−
log(sK
πk ||Pk βk ||1 .
(4.3)
ξ (yi |xi )) + λ
n
ξ=(π,β,Σ)∈ΞK
i=1
k=1
Remark that the penalty take into account the mixture weight. To reduce the dimension and
simplify computations, we will estimate Σk by a diagonal matrix, thus Pk will be also estimated
by a diagonal matrix, for all k ∈ {1, . . . , K}.
As in Section 4.2.1, we refit the estimator, restricted on releant columns, with low rank estimator.
In Section 4.2.3, we will extend the EM algorithm to compute those two estimators.
4.2.3
Generalized EM algorithm
In a computational way, we will use two generalized EM algorithms, in order to deal with
high-dimensional data and get a sparse and low rank estimator. Give some details about those
algorithms.
Initially, the EM algorithm was introduced by Dempster et al. in [DLR77]. It alternates two
steps until convergence, an expectation step to cluster data, and a maximization step to update
estimation.
In our procedure, we want to know which columns are relevant, then we have to compute
(4.3), and we want to refit the estimators by a maximum likelihood under low rank constraint
estimator.
From the Lasso estimator (4.3), we could use a generalization of the EM algorithm described in
[Dev14c]. From the estimate of β, we could deduce which columns are relevant.
The second algorithm we use leads to determine βk restricted on relevant columns, for all k ∈
{1, . . . , K}, with rank Rk . We alternate two steps, E-step and M-step, until relative convergence
of the parameters and of the likelihood. We restrict the dataset to relevant columns, and
construct an estimator of size q × |J| rather than q × p, where βk has for rank Rk , for all
k ∈ {1, . . . , K}. Explain the both steps at iteration (ite) ∈ N∗ .
4.2. THE MODEL AND THE MODEL COLLECTION
132
— E-step: compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n}, the expected value of the loglikelihood function,
γk
τi,k = Eθ(ite) ([Zi ]k |Y ) = PK
l=1 γl
where
(ite)
γl =
πl
(ite)
det Σl
exp
t
(ite)
(ite) y −β (ite) x
− 12 yi −βl
xi (Σ−1
i
i
l )
l
for l ∈ {1, . . . , K}, and Zi is the component-membership variable for an observation i.
— M-step:
— To get estimation in linear model, we assign each observation in its estimated cluster,
by the MAP principle. We could compute this thanks to the E-step. Indeed,yi is
assigned to the component number argmax τi,k .
k∈{1,...,K}
—
(ite)
Then, we could define β˜k
= (xt|k x|k )−1 xt|k y|k , in which x|k and y|k are the sample
(ite)
(ite)
restriction to the cluster k. We decompose β̃k in singular values such that β̃k =
t
U SV with S = diag(s1 , . . . , sq ) and s1 ≥ s2 ≥ . . . ≥ sq the singular values. Then, the
(ite)
(ite)
estimator β̂k
is defined by β̂k = U SRk V t with SRk = diag(s1 , . . . , sRk , 0, . . . , 0).
We do it for all k ∈ {1, . . . , K}.
4.2.4
The Lasso-Rank procedure
The procedure we propose, which is particularly interesting in high-dimension, could be decomposed into three main steps. First, we construct a model collection, with models more or
less sparse and with more or less components. Then, we refit estimations with the maximum
likelihood estimator, under rank constraint. Finally, we select a model with the slope heuristic.
Model collection construction Fix K ∈ K. To get various relevant columns, we construct
a data-driven grid of regularization parameters GK , coming from EM algorithm formula. See
[Dev14c] for more details. For each λ ∈ GK , we could compute the Lasso estimator (4.3),
and deduce relevant variables set, denoted by J(K,λ) . Then, varying λ ∈ GK , and K ∈ K, we
construct J = ∪K∈K ∪λ∈GK J(K,λ) the set of relevant variables sets.
Refitting We could also define a low rank estimator ŝ(K,J,R) restricted to relevant variables
detected by the Lasso estimator, indewed by J.
From this procedure, we construct a model with K clusters, |J| relevant columns and matrix of
regression coefficients of ranks R ∈ NK , as described by the next model S(K,J,R) .
(K,J,R)
S(K,J,R) = {y ∈ Rq |x ∈ Rp 7→ sξ
(y|x)}
(4.4)
where
(K,J,R)
sξ
(y|x)
=
K
X
πk det(Pk )
k=1
(2π)q/2
1
Rk [J] t −1
Rk [J]
exp − (y − (βk ) x) Σk (y − (βk ) x) ;
2
RK
ξ = (π1 , . . . , πK , β1R1 , . . . , βK
, Σ1 , . . . , ΣK ) ∈ Ξ(K,J,R) ;
K
Ξ(K,J,R) = ΠK × Ψ(K,J,R) × (S++
q ) ;
n
o
RK [J]
Ψ(K,J,R) = ((β1R1 )[J] , . . . , (βK
) ) ∈ (Rq×p )K | for all k ∈ {1, . . . , K}, Rank(βk ) = Rk .
133
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
Varying K ∈ K ⊂ N∗ , J ∈ J ⊂ P({1, . . . , p}), and R ∈ R ⊂ {1, . . . , |J| ∧ q}K , we get a
model collection with various number of components, relevant columns and matrix of regression
coefficients.
Model selection Among this model collection, during the last step, a model has to be selected.
As in Maugis and Michel in [MM11b], and in Maugis and Meynet in [MMR12], among others,
a non asymptotic penalized criterion is used. The slope heuristic was introduced by Birgé
and Massart in [BM07], and developed in practice by Baudry et al. in [BMM12] with the
Capushe package. To use it in our context, we have to extend theoretical result to determine the
penalty shape in the high-dimensional context, with a random model collection, in a regression
framework. The main result is described in the next section, whereas proof details are given in
Appendix.
4.3
Oracle inequality
In a theoretical point of view, we want to ensure that the slope heuristic which penalizes the loglikelihood with the model dimension will select a good model. We follow the approach developed
by Massart in [Mas07] which consists of defining a non-asymptotic penalized criterion, leading to
an oracle inequality. In the context of regression, Cohen and Le Pennec, in [CLP11], and Devijver
in [Dev14b], propose a general model selection theorem for maximum likelihood estimation. The
result we get is a theoretical penalty, for which the model selected is as good as the best one,
according to the Kullback-Leibler loss.
4.3.1
Framework and model collection
Among the model collection constructed by the procedure developed in Section 4.2.2, with
various rank and various sparsity, we want to select an estimator which is close to the best one.
The oracle is by definition the model belonging to the collection which minimizes the contrast
with the true model. In practice, we do not have access to the true model, then we could not
know the oracle. Nevertheless, the goal of the model selection step of our procedure is to be
nearest to the oracle. In this section, we present an oracle inequality, which means that if we
have penalized the log-likelihood in a good way, we will select a model which is as good as the
oracle, according to the Kullback-Leibler loss.
We consider the model collection defined by (4.4).
Because we work in high-dimension, p could be big, and it will be time-consuming to test all
the parts of {1, . . . , p}. We construct a sub-collection denoted by J L , which is constructed by
the Lasso estimator, which is also random. This step is explained in more details in [Dev14c].
Moreover, to get the oracle inequality, we assume that the parameters are bounded:
(K,J,R)
B
(4.5)
S(K,J,R) = sξ
∈ S(K,J,R) ξ = (π, β, Σ), for all k ∈ {1, . . . , K},
Σk = diag([Σk ]1,1 , . . . , [Σk ]q,q ),
for all z ∈ {1, . . . , q}, aΣ ≤ [Σk ]z,z ≤ AΣ ,
for all k ∈ {1, . . . , K}, βkRk =
Rk
X
r=1
[σk ]r [utk ].,r [vk ]r,. ,
for all r ∈ {1, . . . , Rk }, [σk ]r < Aσ .
134
4.3. ORACLE INEQUALITY
Remark that it is the singular value decomposition of βk is the singular value decomposition,
with ([σk ]r )1≤r≤Rk the singular values, and uk and vk unit vectors, for k ∈ {1, . . . , K}.
We also assume that covariates belong to an hypercube: without restrictions, we could assume
that X ∈ [0, 1]p .
Fixing K the possible number of components, J L the relevant columns set constructed by the
Lasso estimator, and R the possible vector of ranks, we get a model collection
[ [ [
B
S(K,J,R)
.
(4.6)
K∈K J∈J L R∈R
4.3.2
Notations
Before state the main theorem which leads to the oracle inequality for the model collection
(4.6), we need to define some metrics used to compare the conditional densities. First, the
Kullback-Leibler divergence is defined by
Z
s(y)
s(y)dy if sdy << tdy;
log
t(y)
(4.7)
KL(s, t) =
R
+ ∞ otherwise;
for s and t two densities. To deal with regression data, for observed covariates (x1 , . . . , xn ), we
define
!
n
1X
⊗n
KL (s, t) = E
KL(s(.|xi ), t(.|xi ))
(4.8)
n
i=1
for s and t two densities.
We also define the Jensen-Kullback-Leibler divergence, first introduced in Cohen and Le Pennec
in [CLP11], by
1
JKLρ (s, t) = KL(s, (1 − ρ)s + ρt)
ρ
for ρ ∈ (0, 1), s and t two densities. The tensorized one is defined by
!
n
X
1
n
JKL⊗
JKLρ (s(.|xi ), t(.|xi )) .
ρ (s, t) = E
n
i=1
Note that those divergences are not metrics, they do not satisfy the triangular inequality and
they are not symmetric, but they are also wildly used in statistics to compare two densities.
4.3.3
Oracle inequality
Let state the main theorem.
Theorem 4.3.1. Assume that we observe (xi , yi )1≤i≤n ∈ ([0, 1]p ×Rq )n with unknown conditional
density s∗ . Let M = K × J × R and ML = K × J L × R, where J L is constructed by the Lasso
B
B
estimator. For (K, J, R) ∈ M, let s̄(K,J,R) ∈ S(K,J,R)
, where S(K,J,R)
is defined by (4.5), such
that, for δKL > 0,
δKL
inf KL⊗n (s∗ , t) +
KL⊗n (s∗ , s̄(K,J,R) ) ≤
B
n
t∈S(K,J,R)
and there exists τ > 0 such that
s̄(K,J,R) ≥ e−τ s∗ .
(4.9)
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
135
For (K, J, R) ∈ M, consider the rank constraint log-likelihood minimizer ŝ(K,J,R) in S(K,J,R) ,
satisfying
)
(
n
1X
(K,J,R)
(K,J,R)
log sξ
(yi |xi )
.
ŝ
=
argmin
−
n
(K,J,R)
s
∈S B
ξ
i=1
(K,J,R)
B
Denote by D(K,J,R) the dimension of the model S(K,J,R)
. Let pen : M → R+ defined by, for all
(K, J, R) ∈ M,
(
D(K,J,R) 2
D(K,J,R)
2
pen(K, J, R) ≥ κ
B (Aσ , AΣ , aΣ ) ∧ 1
2B (Aσ , AΣ , aΣ ) − log
n
n
!)
K
X
4epq
+
Rk
+ (1 ∨ τ ) log
D(K,J,R)−q2 ∧ pq
k=1
where κ > 0 is an absolute constant, and B(Aσ , AΣ , aΣ ) is an absolute constant, depending on
parameter bounds.
ˆ
Then, the estimator ŝ(K̂,J,R̂) , with
(
)
n
X
1
ˆ R̂) = argmin
(K̂, J,
−
log(ŝ(K,J,R) (yi |xi )) + pen(K, J, R) ,
n
(K,J,R)∈ML
i=1
satisfies
ˆ R̂)
∗ (K̂,J,
n
E JKL⊗
(s
,
ŝ
)
ρ
inf
≤CE
inf
(K,J,R)∈ML
t∈S(K,J,R)
⊗n
KL
(1 ∨ τ )
(s , t) + pen(K, J, R) +
n
∗
(4.10)
some absolute positive constant C.
The proof of the Theorem 4.3.1 is given in Section 4.5. Note that condition (4.9) leads to
control the random model collection. The mixture parameters are bounded in order to construct
brackets over S(K,J,R) , and thus to upper bound the entropy. The inequality (4.10) not exactly
an oracle inequality, since the Jensen-Kullback-Leibler risk is upper bounded by the KullbackLeibler divergence. Note that we use the Jensen-Kullback-Leibler divergence rather than the
Kullback-Leibler divergence, because it is bounded. This boundedness turns out to be crucial
to control the loss of the penalized maximum likelihood estimator under mild assumptions on
the complexity of the model and on parameters.
Because we are looking at a random sub-collection of models which is small enough, our estimator
ŝ(K,J,R) is attainable in practice. Moreover, it is a non-asymptotic result, which allows us to
study cases for which p increases with n.
We could compare our inequality with the bound of Bunea et al, in [BSW12], who computed
a procedure similar to ours, for a linear model. According to consistent group selection for
the group-Lasso estimator, they get adaptivity of the estimator to an optimal rate, and their
estimators perform the bias variance trade-off among all reduced rank estimators. Nevertheless,
their results are obtained according to some assumptions, for instance the mutual coherence on
X t X, which postulates that the off-diagonal elements have to be small. Some assumptions on
the design are required, whereas our result just needs to deal with bounded parameters and
bounded covariates.
136
4.4. NUMERICAL STUDIES
4.4
Numerical studies
We will illustrate our procedure with simulated and benchmark datasets, to highlight the advantages of our method. We adapt the simulations part of Bunea et al. article, [BSW12]. Indeed,
we work in the same way, to get a sparse and low rank estimator. Nevertheless, we add a mixture
framework to be consistent with our clustering method, and to have more flexibility.
4.4.1
Simulations
To illustrate our procedure, we use simulations adapted from the article of Bunea [BSW12],
extended to mixture models.
The design matrix X has independent and identically distributed rows Xi , distributed from a
multivariate Gaussian Nq (0, Σ) with Σ = ρI, ρ > 0. We consider a mixture of 2 components.
According to the cluster, the coefficient matrix βk has the form
bk B 0 b k B 1
βk =
0
0
for k ∈ {1, 2}, with B 0 a J × Rk matrix and B 1 a Rk × q matrix. All entries in B 0 and
B 1 are independent and identically distributed according to N (0, 1). The noise matrix ǫ has
independent N (0, 1) entries. Let ǫi denotes its ith
row.
1 1
The proportion vector π is defined by π = 2 , 2 , i.e. all clusters have the same probability.
Each row Yi in Y is then generated as, if the observation i belongs to the cluster k, Yi = βk Xi +ǫi ,
for all i ∈ {1, . . . , n}. This setup contains many noise features, but the relevant ones lie in a
low-dimensional subspace. We report two settings:
— p > n: n = 50, |J| = 6, p = 100, q = 10, R = (3, 3), ρ = 0.1, b = (3, −3).
— p < n: n = 200, |J| = 6, p = 10, q = 10, R = (3, 3), ρ = 0.01, b = (3, −3).
The current setups show that variable selection, without taking the rank information into consideration, may be suboptimal, even if the correlations between predictors are low. Each model
are simulated 20 times, and Table 4.1 summarizes our findings. We evaluate the prediction
accuracy of each estimator β̂ by the Kullback-Leibler divergence (KL) using a test sample at
each run. We also report the median rank estimate (denoted by R̂) over all runs, rates of non
included true variables (denoted by M for misses) and the rates of incorrectly included variables
(F A for false actives). Ideally, we are looking for a model with low KL, low M and low F A.
p>n
p<n
KL
19.03
3.28
R̂
[2.8,3]
[3,3]
M
0
0
FA
20
0.6
ARI
0.95
0.99
Table 4.1: Performances of our procedure. Mean number {KL, R̂, M, F A, ARI} of the KullbackLeibler divergence between the model selected and the true model, the estimated rank of the
model selected in each cluster, the missed variables, the false relevant variables, and the ARI,
over 20 simulations
We could draw the following conclusions from Table 4.1. When we work in low-dimensional
framework, we get very good results. Even if we could use any estimator, because we do not
have dimension problem, with our procedure we get the matrix structure involved by the model.
Over 20 simulations, we get almost exact clustering, and the Kullback-Leibler divergence between
the model we construct and the true model is really low. In case of high-dimensional data, when
p is larger than n, we get also good results. We find the good structure, selecting the relevant
variables (our model will have false relevant variables, but no missed variables), and selecting
137
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
the true ranks. We could remark that false relevant variables have low values. Comparing to
another procedure which will not reduce the rank, we will perform the dimension reduction.
4.4.2
Illustration on a real dataset
In this section, we apply our procedure to real data set. The Norwegian paper quality data were
obtained from a controlled experiment that was carried out at a paper factory in Norway to
uncover the effect of three control variables X1 , X2 , X3 on the quality of the paper which was
measured by 13 response variables. Each of the control variables Xi takes values in {−1, 0, 1}.
To account for possible interactions and nonlinear effects, second order terms were added to the
set of predictors, yielding X1 , X2 , X3 , X12 , X22 , X32 , X1 X2 , X1 X3 , X2 X3 , and the intercept term.
There were 29 observations with no missing values made on all response and predictor variables.
The Box Behnken design of the experiment and the resulting data are described in Aldrin [Ald96]
and Izenman [Ize75]. Moreover, Bunea et al. in [BSW12] also study this dataset. We always
center the responses and the predictors. The dataset clearly indicates that dimension reduction
is possible, making it a typical application for reduced rank regression methods. Moreover, our
method will exhibit different clusters among this sample.
We construct a model collection varying the number of clusters in K = {2, . . . , 5}. We select a
model with 2 clusters. We select all variables except X1 X2 and X2 X3 , which is consistent with
comments of Bunea et al. In the two clusters, we get two mean matrices, with ranks equal to 2
and 4. One cluster describes the mean comportment (with rank equals to 2), whereas the other
cluster contains values more different.
4.5
Appendices
In those appendices, we present the details of the proof of the Theorem 4.3.1. It derives from
a general model selection theorem, stated in Section 4.5.1, and proved in the Chapter 3. Then,
the proof of the Theorem 4.3.1 could be summarized by satisfying assumptions Hm , Sepm and
K described in Section 4.5.1.
4.5.1
A general oracle inequality for model selection
Model selection appears with the AIC criterion and BIC criterion. A non-asymptotic theory
was developed by Birgé and Massart in [BM07]. With some assumptions detailed here, we get
an oracle inequality for the maximum likelihood estimator among a model collection. Cohen
and Le Pennec, in [CLP11], generalize this theorem in regression framework. We have to use a
generalization of this theorem detailed in [Dev14b] because we consider a random collection of
models. Let state the main theorem. We consider a model collection (Sm )m∈M , indexed by M.
Let (X, Y ) ∈ X × Y.
Begin by describe the assumptions. First, we impose a structural assumption. It is a bracketing
entropy condition on the model with respect to the tensorized Hellinger divergence
#
" n
1X 2
⊗n 2
dH (s(.|xi ), t(.|xi )) ;
(dH ) (s, t) = E
n
i=1
for two densities s and t. A bracket [l, u] is a pair of functions such that for all (x, y) ∈ X × Y,
l(y, x) ≤ s(y|x) ≤ u(y, x).
138
4.5. APPENDICES
n
For ǫ > 0, the bracketing entropy H[.] (ǫ, S, d⊗
H ) of a set S is defined as the logarithm of the
⊗n
minimum number of brackets [l, u] of width dH (l, u) smaller than ǫ such that every densities of
S belong to one of these brackets.
Let m ∈ M.
Assumption Hm . There is a non-decreasing function φm such that ̟ 7→ 1/̟φm (̟) is nonincreasing on (0, +∞) and for every ̟ ∈ R+ and every sm ∈ Sm ,
Z ̟q
n
H[.] (ǫ, Sm (sm , ̟), d⊗
H )dǫ ≤ φm (̟);
0
n
where Sm (sm , ̟) = {t ∈ Sm , d⊗
H (t, sm ) ≤ ̟}. The model complexity Dm is then defined as
2
n̟m with ̟m the unique solution of
√
1
φm (̟) = n̟.
̟
(4.11)
Remark that the model complexity depends on the bracketing entropies not of the global models
Sm but of the ones of smaller localized sets. It is a weaker assumption.
For technical reasons, a separability assumption is also required.
′
′
′
Assumption Sepm . There exists a countable subset Sm of Sm and a set Ym with λ(Y \Ym ) = 0,
for λ the Lebesgue measure, such that for every t ∈ Sm , there exists a sequence (tl )l≥1 of elements
′
′
of Sm such that for every x and every y ∈ Ym , log(tl (y|x)) goes to log(t(y|x)) as l goes to infinity.
According to this assumption, we could work with a countable subset.
We also need an information theory type assumption on our model collection. We assume the
existence of a Kraft-type inequality for the collection.
Assumption K. There is a family (wm )m∈M of non-negative numbers such that
X
e−wm ≤ Ω < +∞.
m∈M
Then, we could write our main global theorem to get an oracle inequality in regression framework,
with a random collection of models.
Theorem 4.5.1. Assume we observe (xi , yi )1≤i≤n ∈ ([0, 1]p × Rq )n with unknown conditional
density s∗ . Let S = (Sm )m∈M be at most countable collection of conditional density sets. Let
assumption K holds while assumptions Hm and Sepm hold for every models Sm ∈ S. Let δKL > 0,
and s̄m ∈ Sm such that
KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , t) +
t∈Sm
δKL
;
n
and let τ > 0 such that
s̄m ≥ e−τ s∗ .
(4.12)
Introduce (Sm )m∈M̌ some random sub-collection of (Sm )m∈M . Consider the collection (ŝm )m∈M̌
of η-log-likelihood minimizer in Sm , satisfying
!
n
n
X
X
− log(ŝm (yi |xi )) ≤ inf
− log(sm (yi |xi )) + η.
i=1
sm ∈Sm
i=1
139
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
Then, for any ρ ∈ (0, 1) and any C1 > 1, there are two constants κ0 and C2 depending only on
ρ and C1 such that, as soon as for every index m ∈ M,
(4.13)
pen(m) ≥ κ(Dm + (1 ∨ τ )wm )
with κ > κ0 , and where the model complexity Dm is defined in (4.11), the penalized likelihood
estimate ŝm̂ with m̂ ∈ M̌ such that
!
n
n
X
X
′
− log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf
− log(ŝm (yi |xi )) + pen(m) + η
m∈M̌
i=1
i=1
satisfies
∗
n
E(JKL⊗
ρ (s , ŝm̂ ))
≤ C1 E
inf
inf KL
m∈M̌ t∈Sm
Ω2 η ′
+ C2 (1 ∨ τ )
n
⊗n
+
pen(m)
(s , t) + 2
n
+η
.
n
∗
(4.14)
Remark 4.5.2. We get that, among a random model collection, we are able to choose a model
which is as good as the oracle, up to a constant C1 , and some additive terms being around 1/n.
This result is non-asymptotic, and gives a theoretical penalty to select this model.
Remark 4.5.3. The proof of this theorem is detailed in [Dev14b]. Nevertheless, we could give
the main ideas to understand the assumptions. From assumptions Hm and Sepm , we could use
maximal inequalities which lead to, except on a set of probability less than e−wm′ −w , for all w,
a control of the ratio of the centered empirical process of log(ŝm′ ) over the Hellinger distance
between s∗ and ŝm′ , this control being around 1/n. Thanks to Bernstein inequality, satisfied
according to the inequality (4.12), and thanks to the assumption K, we get the oracle inequality.
Now, to prove Theorem 4.3.1, we have to satisfy assumptions Hm and K, assumption Sepm being
true for our conditional densities.
4.5.2
Assumption Hm
Decomposition
As done in Cohen and Le Pennec [CLP11], we could decompose the entropy by
⊗n
B
n
H[.] (ǫ, S(K,J,R)
, d⊗
H ) ≤ H[.] (ǫ, ΠK , dH ) +
K
X
k=1
n
H[.] (ǫ, F(J,Rk ) , d⊗
H )
(4.15)
140
4.5. APPENDICES
where
B
S(K,J,R)
=
Ψ(K,J,R)
P
(K,J,R)
Rk [J]
y ∈ Rq |x ∈ Rp 7→ sξ
(y|x) = K
π
ϕ
y|(β
)
x,
Σ
k
k
k=1
k
RK [J]
R1 [J]
ξ = π1 , . . . , πK , (β1 ) , . . . , (βK ) , Σ1 , . . . , ΣK ∈ Ξ(K,J,R)
Ξ(K,J,R) = ΠK × Ψ̃(K,J,R) × ((aΣ , AΣ ]q )K
n
o
RK [J]
= ((β1R1 )[J] , . . . , (βK
) ) ∈ (Rq×p )K |Rank(βk ) = Rk ;
(
βkRk
F(J,R) =
RK [J]
((β1R1 )[J] , . . . , (βK
) ) ∈ Ψ(K,J,R) for all k ∈ {1, . . . , K},
Ψ̃(K,J,R) =
ΠK =
(
(
=
Rk
X
σr utr vr ,
r=1
(π1 , . . . , πK ) ∈ (0, 1)K ;
ϕ(.|(β R )[J] X, Σ); β R =
with σr < Aσ for all r ∈ {1, . . . , Rk }
K
X
k=1
R
X
)
πk = 1 ;
σr utr vr , with σr < Aσ ,
r=1
Σ = diag(Σ1,1 , . . . , Σq,q ) ∈ [aΣ , AΣ ]q
ϕ the Gaussian density.
)
For the proportions, it is known that (see Wasserman and Genovese in [GW00])
K−1 !
3
⊗n
.
H[.] (ǫ, ΠK , dH ) ≤ log K(2πe)K/2
ǫ
Look after the Gaussian entropy.
For the Gaussian
We want to bound the integrated entropy. For that, first we have to construct some brackets to
recover Sm . Fix f ∈ Sm . We are looking for functions l and u such that l ≤ f ≤ u. Because
f is a Gaussian, l and u are dilatations of Gaussian. We then have to determine the mean,
the variance and the dilatation coefficient of l and u. We need the both following lemmas to
construct these brackets.
Lemme 4.5.4. Let ϕ(.|µ1 , Σ1 ) and ϕ(.|µ2 , Σ2 ) be two Gaussian densities. If their variance
matrices are assumed to be diagonal, with Σa = diag([Σa ]1,1 , . . . , [Σa ]q,q ) for a ∈ {1, 2}, such
that [Σ2 ]z,z > [Σ1 ]z,z > 0 for all z ∈ {1, . . . , q}, then, for all x ∈ Rq ,
q p
1
1
(µ1 −µ2 )
,...,
ϕ(x|µ1 , Σ1 ) Y [Σ2 ]z,z 21 (µ1 −µ2 )t diag [Σ2 ]1,1 −[Σ
[Σ2 ]q,q −[Σ1 ]q,q
1 ]1,1
p
≤
.
e
ϕ(x|µ2 , Σ2 )
[Σ1 ]z,z
z=1
Lemme 4.5.5. The Hellinger distance of two Gaussian densities with diagonal variance matrices
141
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
is given by the following expression:
d2H (ϕ(.|µ1 , Σ1 ), ϕ(.|µ2 , Σ2 ))
!1/2
p
q
Y
2 [Σ1 ]z,z [Σ2 ]z,z
=2 − 2
[Σ1 ]z,z + [Σ2 ]z,z
z=1
(
!
)
1
1
× exp − (µ1 − µ2 )t diag
(µ1 − µ2 )
4
[Σ1 ]z,z + [Σ2 ]z,z z∈{1,...,q}
To get an ǫ-bracket for the densities, we have to construct a δ-net for the variance and the mean,
δ to be specify later.
— Step 1: construction of a net for the variance
j
Let ǫ ∈ (0, 1], and δ = √ǫ2q . Let bj = (1 + δ)1− 2 AΣ . For 2 ≤ j ≤ N , we have
S
S
[aΣ , AΣ ] = [bN , bN −1 ] . . . [b3 , b2 ], where N is chosen to recover everything. We want
that
⇔
⇔
aΣ = (1 + δ)1−N/2 AΣ
aΣ
N
log
= 1−
log(1 + δ)
AΣ
2
√
2 log( AaΣΣ 1 + δ)
N=
.
log(1 + δ)
We want N to be an integer, then N =
A
2 log( a Σ
Σ
√
1+δ)
log(1+δ)
. We get a regular net for the
variance. We could let B = diag(bi(1) , . . . , bi(q) ), close to Σ (and deterministic, independent of the values of Σ), where i is a permutation such that bi(z)+1 ≤ Σz,z ≤ bi(z) for all
z ∈ {1, . . . , q}.
— Step 2: construction of a net for the mean
PRvectors
We use the singular decomposition of β, β = r=1 σr utr vr , with (σr )1≤r≤R the singular
values, and (ur )1≤r≤R and (vr )1≤r≤R unit vectors. Those vectors are also bounded.
We are looking for l and u such that dH (l, u) ≤ ǫ, and l ≤ f ≤ u. We will use a dilatation
of a Gaussian to construct such an ǫ-bracket of ϕ. We let
l(x, y) = (1 + δ)−(p
u(x, y) = (1 + δ)p
2 qR+3q/4)
2 qR+3q/4
ϕ(y|νJ,R x, (1 + δ)−1/4 B 1 )
ϕ(y|νJ,R x, (1 + δ)B 2 )
where B 1 and B 2 are constructed such that, for all z ∈ {1, . . . , q}, [B 1 ]z,z ≤ Σz,z ≤ [B 2 ]z,z
(see step 1).
The means νJ,R ∈ Rq×p will be specified later. Just remark that J is the set of the relevant
P
t
|J|×R ,
columns, and R the rank of νJ,R : we will decompose νJ,R = R
r=1 σ̃r ũr ṽr , ũ ∈ R
q×R
and ṽ ∈ R
.
We get
l(x, y) ≤ f (y|x) ≤ u(x, y)
if we have
||βx − νJ,R x||22 ≤ p2 qR
δ2 2
a (1 − 2−1/4 ).
2 Σ
142
4.5. APPENDICES
Remark that ||βx − νJ,R x||22 ≤ p||β − νJ,R ||22 ||x||∞ We need then
||β − νJ,R ||22 ≤ pqR
δ2 2
a (1 − 2−1/4 )
2 Σ
(4.16)
According to [Dev14b], dH (l, u) ≤ 2(p2 qR + 3q/4)2 δ 2 , then, with
δ=√
ǫ
2(pqR + 3/4q)
we get the wanted bound.
Now, explain how to construct νJ,R to get (4.16).
||β −
νJ,R ||22
=
p X
q X
R
X
j=1 z=1 r=1
=
p X
q X
R
X
j=1 z=1 r=1
≤
p X
q X
R
X
j=1 z=1 r=1
2
σr ur,j vr,z − σ̃r ũr,j ṽr,z
2
|σr − σ̃r ||uj,r vz,r | − σ̃r |ũr,j − ur,j ||ṽz,r | − σ̃r ur,j |vr,z − ṽr,z |
2
|σr − σ̃r | + Aσ |ũr,j − ur,j | + Aσ |vr,z − ṽr,z |
≤ 2pqR max |σr − σ̃r |2 + Aσ max |ũr,j − ur,j |2 + Aσ max |ṽr,z − vr,z |2
r
r,j
2
We need ||β − νJ,R ||22 ≤ pqR δ2 a2Σ (1 − 2−1/4 ).
If we choose σ̃r , ũr,j and ṽr,z such that
then it works.
p
δ
|σr − σ̃r | ≤ √ aΣ 1 − 2−1/4 ,
12
p
δ
aΣ 1 − 2−1/4 ,
|ur,j − ũr,j | ≤ √
12Aσ
p
δ
|vr,z − ṽr,z | ≤ √
aΣ 1 − 2−1/4 ,
12Aσ
r,z
143
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
To get this, we let, for ⌊.⌋ the floor function,
Aσ
p
S = Z ∩ 0,
√δ aΣ 1 − 2−1/4
12
p
δ
σ̃r = argmin σr − √ aΣ 1 − 2−1/4 ς ,
12
ς∈S
Aσ
p
U = Z ∩ 0,
−1/4
√ δ
1
−
2
a
12Aσ Σ
p
δ
aΣ 1 − 2−1/4 µ ,
ũr,j = argmin ur,j − √
12Aσ
µ∈U
A
σ
p
V = Z ∩ 0,
−1/4
√ δ
a
1
−
2
Σ
12Aσ
p
δ
aΣ 1 − 2−1/4 ν
ṽz,r = argmin vz,r − √
12Aσ
ν∈V
for all r ∈ {1, . . . , R}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}.
Remark that we just need to determine the vectors ((ũr,j )1≤j≤J−r )1≤r≤R and
((ṽz,r )1≤z≤q−r )1≤r≤R because those vectors are unit.
Then, we let
for all j ∈ J c , for all z ∈ {1, . . . , p}, (νJ,R )z,j = 0
for all j ∈ J, for all z ∈ {1, . . . , p}, (νJ,R )z,j =
R
X
σ̃r ũr,j ṽz,r
r=1
— Step 3: Upper bound of the number of ǫ-brackets for F(J,R)
1−2−1/4
.
12
We have defined our bracket. Let c =
We want to control the entropy.
R
R( 2J−R−1
N
+ 2q−R−1
)
X
2
2
Aσ
A2σ
√
√
|Bǫ (F(J,R) )| ≤
δaΣ c
δaΣ c
l=2
R(J+q−R)
A2σ
√
≤(N − 1)
A−R
σ
δaΣ c
≤C(aΣ , AΣ , Aσ , J, R)δ −D(J,R) −1
with C(aΣ , AΣ , Aσ , J, R) = 2
AΣ
aΣ
+
1
2
A2σ
√
aΣ c
R(J+q−R)
A−R
σ .
For the mixture
Begin by computing the bracketing entropy: according to (4.15),
B
H[.] (ǫ, S(K,J,R)
, dH )
D(K,J,R) !
1
≤ log C
ǫ
144
4.5. APPENDICES
where
K
C = 2 K(2πe)
K/2 K−1
3
AΣ 1
+
aΣ
2
!D(K,J,R)
√
P
Aσ 12
− K R
p
Aσ k=1 k
a2Σ 1 − 2−1/4
K
P
and D(K,J,R) = K
k=1 Rk (|J| + q − Rk ).
We have to determine φ(K,J,R) such that
Z ̟q
B
n
H[.] (ǫ, S(K,J,R)
(s(K,J,R) , ̟), d⊗
H )dǫ ≤ φ(K,J,R) (̟).
0
Let compute the integral.
Z ̟q
n
B
H[.] (ǫ, S(K,J,R)
(s(K,J,R) , ̟), d⊗
H )dǫ
Z ̟s
q
p
1
≤ ̟ log(C) + D(K,J,R)
log
dǫ
ǫ
0
s
s
"
#
q
√
1
log(C)
+ log
π+
≤ D(K,J,R) ̟
D(K,J,R)
̟∧1
0
with, according to (4.17),
Then,
K
AΣ 1
+
log(2πe) + (K − 1) log(3) + K log
log(C) =K log(2) +
2
aΣ
2
!
√
K
X
1
A2 12
pσ
+ D(K,J,R) log
+ log(K) +
Rk log( )
−1/4
A
σ
aΣ 1 − 2
k=1
!!
√
A2σ 12
p
≤D(K,J,R) log(2) + log(2πe) + log 3 + 1 + log
aΣ 1 − 2−1/4
AΣ 1
+ D(K,J,R) log
+
aΣ
2
!
√
2 !
Aσ
12 12πe
AΣ 1
≤D(K,J,R) log p
+
+ log
−1/4
a
2
aΣ
Σ
1−2
Z
q
B
n
H[.] (ǫ, S(K,J,R)
(s(K,J,R) , ̟), d⊗
H )dǫ
0
s
2 s
!
q
Aσ
AΣ 1
1
≤ D(K,J,R) 2 + log
+ log
+
aΣ
2
aΣ
̟∧1
̟
Consequently, by putting
B =2+
s
log
AΣ 1
+
aΣ
2
we get that the function φ(K,J,R) defined on R∗+ by
q
φ(K,J,R) (̟) = D(K,J,R) ̟ B +
s
A2σ
;
aΣ
log
1
̟∧1
!
(4.17)
(4.18)
145
CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK
PROCEDURE
satisfies (4.18). Besides, φ(K,J,R) is nondecreasing and ̟ 7→ φ(K,J,R) (̟)/̟ is non-increasing,
then φ(K,J,R) is convenient.
Finally, we need to find an upper bound of ̟∗ satisfying
√
φ(K,J,R) (̟∗ ) = n̟∗2 .
Consider ̟∗ such that φ(K,J,R) (̟∗ ) =
This is equivalent to solve
̟∗ =
r
p
̟∗2 .
D(K,J,R)
n
B+
s
log
1
̟∗ ∧ 1
!
and then we could choose
̟∗2
4.5.3
D(K,J,R)
≤
n
1
2
2B + log
1∧
D(K,J,R) 2
B
n
!!
.
Assumption K
Let w(K,J,R) = D̃(K,J) log
we could compute the sum
X
(K,J,R)
e−w(K,J,R) ≤
≤
4epq
(D̃(K,J) −q 2 )∧pq
+
X
X
K≥1
X
K≥1
≤2
e
P
k∈{1,...,K} Rk ,
−D̃(K,J) log
−D̃(K,J) log
where D̃(K,J) = K(1 + |J|). Then,
4epq
(D̃(K,J) −q 2 )∧pq
1≤|J|≤pq
X
e
1≤|J|≤pq
K
X
e−R
4epq
(D̃(K,J) −q 2 )∧pq
R≥1
1K
The last inequality is computed in Proposition 4.5 in [Dev14b] by 2. Then,
X
e−w(K,J,R) ≤ 2.
(K,J,R)
4.5. APPENDICES
146
Chapter 5
Clustering electricity consumers using
high-dimensional regression mixture
models.
Contents
5.1
5.2
5.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Typical workflow using the example of the aggregated consumption 138
5.3.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3.2 Cluster days on the residential synchronous curve . . . . . . . . . . . . . 139
Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Model visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4 Clustering consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4.1 Cluster consumers on mean days . . . . . . . . . . . . . . . . . . . . . . 143
5.4.2 Cluster consumers on individual curves . . . . . . . . . . . . . . . . . . 144
Selected mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Analyzing clusters using classical profiling features . . . . . . . . . . . . 144
Cross analysis using survey data . . . . . . . . . . . . . . . . . . . . . . 147
Using model on one-day shifted data . . . . . . . . . . . . . . . . . . . . 148
Remarks on similar analyses . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 149
147
5.1. INTRODUCTION
5.1
148
Introduction
New metering infrastructures as smart meters provide new and potentially massive informations
about individual (household, small and medium enterprise) consumption. As an example, in
France, ERDF (Electricite Reseau Distribution de France the French manager of the public
electricity distribution network) deployed 250 000 smart meters, covering a rural and an urban
territory and providing half-hourly household energy used each day. ERDF plans to install 35
millions of them over the French territory by the end of 2020 and exploiting such an amount of
data is an exciting but challenging task (see http://www.erdf.fr/Linky).
Many applications coming from individual data analysis can be found in the literature. The
first and most popular one is load profiling. Understanding consumers time of use, seasonal
patterns and the different features that drive their consumption is a fundamental task for electricity providers to design their offer and more generally for marketing studies (see e.g. [Irw86]).
Energy agencies and states can also benefit from profiling for efficiency programs and improve
recommendation policies. Customer segmentation based on load classification is a natural approach for that and [ZYS13] proposes a nice review of the most popular methods, concluding
that classification for smart grids is a hard task due to the complexity, massiveness, high dimension and heterogeneity of the data. Another problem pointed out is the dynamic structure
of smart meters data and particularly the issue of portfolio variations (losses and gains of customers), the update of a previous classification when new customers arrive or the clustering
of a new customer with very few observations on its load. In [KFR14], the authors propose a
segmentation algorithm based on K-means to uncover shape dictionaries that help to summarize
information and cluster a large population of 250 000 households in California. However, the
proposed solution exploits a quite long historic of data of at least 1 year.
Recently, other important questions were raised by smart meters and the new possibility to
send potentially complex signal to consumers (incentive payments, time varying prices...) and
demand response program tailoring attracts a lot of attention (see [US 06], [H+ 13]). Local optimization of electricity production and real time management of individual demand thus play an
important role in the smart grid landscape. It induces a need for local electricity load forecasting
at different levels of the grid and favorites bottom-up approaches based on a two stage process.
First, it consists in building classes in a population such that each class could be sufficiently
well forecast but corresponds to different load shapes or reacts differently to exogenous variables
like temperature or prices (see e.g. [LSD15] in the context of demand response). The second
stage consists in aggregating forecasts to forecast the total or any subtotal of the population
consumption. For example, identify and forecast the consumption of a sub-population reactive
to an incentive is an important need to optimize a demand response program. Surprisingly, few
papers consider the problem of clustering individual consumption for forecasting and specially
for forecasting at a disaggregated level (e. g. in each class). In [AS13], clustering procedures are
compared according to the forecasting performances of their corresponding bottom-up forecasts
of the total consumption of 6 000 residential customers and small-to-medium enterprises in Ireland. Even if they achieve nice performances at the end, the proposed clustering methods are
quite independent to the VAR model used for forecasting. In [MMOP10], a clustering algorithm
is proposed that couples hierarchical clustering and multi-linear regression models to improve
the forecast of the total consumption of a French industrial subset. They obtain a real forecasting gain but need a sufficiently long dataset (2-3 years) and the algorithm is computationally
intensive.
We propose here a new methodology based on high dimensional regression models. Our main
contribution is that we focus on uncovering classes corresponding to different regression models.
As a consequence, these classes could then be exploited for profiling as well as forecasting in
149
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
each classes or for bottom-up forecasts in an unified view. More precisely, we consider regression
models where Yd = Xd β + εd is typically an individual -high dimension- load curve for day d
and Xd could be alternatively Yd−1 or any other exogenous covariates.
We consider a real dataset of Irish individual consumers of 4 225 meters, each with 48 halfhourly meter reads per day over 1 year: from 1st January 2010 up to 31st December 2010.
These data have already been studied in [AS13] and [Cha14] and we refer to those papers for
a presentation of the data. For computational and time reasons, we draw a random sample
of around 500 residential consumers among the 90% closest to the mean, to demonstrate the
feasibility of our methods. We show that, considering only 2 days of consumption, we obtain
physically interpretable clusters of consumers.
According to the Fig. 5.1, deal with individual consumption curves is an hard task, because of
the high variability.
7
Consumption (kW)
6
5
4
3
2
1
0
0
50
100
150
200
250
300
350
Figure 5.1: Load consumption of a sample of 5 consumers over a week in winter
5.2
Method
We propose to use model-based clustering and adopt the model selection paradigm. Indeed, we
consider the model collection of conditional mixture densities,
n
o
(K,J)
S = sξ
, K ∈ K, J ∈ J ,
where K denotes the number of clusters, J the set of relevant variables for clustering, and K
and J being respectively the set of possible values of K and J.
The basic model we propose to use is a finite mixture regression of K multivariate Gaussian
densities (see [SBvdGR10] for a recent and fruitful reference), the conditional density being, for
x ∈ Rp , y ∈ Rq , ϕ denoting the Gaussian density,
(K,J)
sξ
(y|x) =
K
X
[J]
πk ϕ(βk x, Σk )
k=1
Such a model can be interpreted and used from two different viewpoints.
First, from a clustering perspective, given the estimation ξˆ of the parameters ξ = (π, β, Σ), we
could deduce data clustering from the Maximum A Posteriori principle: for each observation
ˆ of each cluster k from the estimation ξ,
ˆ and we
i, we compute the posterior probability τi,k (ξ)
ˆ Proportions of each cluster are
assign the observation i to the cluster k̂i = argmax τi,k (ξ).
k∈{1,...,K}
estimated by π̂.
5.3. TYPICAL WORKFLOW USING THE EXAMPLE OF THE
AGGREGATED CONSUMPTION
150
Second, in each cluster, the corresponding model is meaningful and its interpretation allows to
understand the relationship between variables Y and X since it is of the form
Y = Xβk + ǫ,
the noise intensity being measured by Σk .
Parameters are estimated from the Lasso-MLE procedure, which is described in details in
[Dev14c], and theoretically approved in [Dev14b]. To overcome the high-dimension issue, we
use the Lasso estimator on the regression parameters and we restrict the covariance matrix to
be diagonal. To avoid shrinkage, we estimate parameters by Maximum Likelihood Estimator on
relevant variables selected by the Lasso estimator. Rather than selecting a regularization parameter, we present this issue at a model selection problem, considering a grid of regularization
parameters. Indices of relevant variables for this grid of regularization parameters are denoted
by J . Since we also have to estimate the number of components, we compute those models for
different number of components, belonging to K. In this paper, K = {1, . . . , 8}.
Among this collection, we could focus on a few models which seem interesting for clustering,
depending on which characteristics we want to highlight. We propose to use the slope heuristics to extract potentially interesting models. The selected model minimizes the log-likelihood
penalized by 2κ̂D(K,J) /n, where Dm denotes the dimension of the model m, and where κ̂ is
constructed from a completely data-driven procedure. In practice, we use the Capushe package,
see [BMM12].
In addition to this family of models, we need to have powerful tools to translate curves into
variables. Rather than dealing with the discretization of the load consumption, we project it
onto a functional basis to take into account the functional structure. Since we are interested
in not only representing the curves into a functional basis, but also to benefit from a timescale interpretation of coefficients, we propose to use wavelet basis, see [Mal99] for a theoretical
approach, and [MMOP07] for a practical purpose. To simplify our presentation, we will focus
on the Haar basis.
5.3
5.3.1
Typical workflow using the example of the aggregated consumption
General framework
The goal is to cluster electricity consumers using a regression mixture model. We will consider
the consumption of the eve for the regressors, to explain the consumption of the day. Consider
the daily consumption, where we observe 48 points. We project the signal onto the Haar basis
at level 4. The signal could be decomposed in approximation, denoted by A4 , and several
details, denoted by D4 , D3 , D2 , and D1 . We illustrate it in Fig. 5.2, where in addition the
decomposition in sum of orthogonal signals on the left, one can find a colored representation of
the corresponding wavelet coefficients in the time scale plane. For an automatic denoising, we
remove details of level 1 and 2, which correspond to high-frequency components. Two centerings
will be considered:
— preprocessing 1: before projecting, we center each signal individually.
— preprocessing 2: we consider details coefficients of level 4 and 3. Here, we remove also a
low-frequency approximation.
Depending on the preprocessing, we will get different clusterings
We observe the load consumption of n residentials over a year, denoted by (zi,t )1≤i≤n,t∈T . We
consider
P
— Zt = ni=1 zi,t the aggregated signal,
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
D4
A4
s
151
0.5
0
−0.5
0
0.2
0
−0.2
−0.4
0
0.2
D3
30
40
10
20
30
40
0.5
10
20
30
40
0
10
20
30
40
−0.5
10
20
30
40
−1
10
20
30
40
10
20
30
40
0
−0.2
0
0.1
D2
1
20
0
−0.2
0
0.2
0
−0.1
0
0.1
D1
0.5
0
−0.5
0
10
0
−0.1
0
Figure 5.2: Projection of a load consumption for one day into Haar basis, level 4. By construction, we get s = A4 + D4 + D3 + D2 + D1 . On the left side, the signal is considered with
reconstruction of dataset, the dotted being preprocessing 1 and the dotted-dashed being the
preprocessing 2
— Zd = (Zt )48(d−1)≤t≤48d the aggregated signal for the day d,
— zi,d = (zi,t )48(d−1)≤t≤48d the signal for the residential i for the day d.
We consider three different ways to analyze this dataset.
The first one consider (Zd , Zd+1 )1≤d≤338 over time, and the results are easy to interpret. We
take this opportunity to develop in details the steps of the method we propose from the model
to the clusters via model visualization and interpretation. In the second one, we want to cluster
consumers on mean days. Working with mean days leads to some stability. The last one is
the most difficult, since we consider individuals curves (zi,d0 , zi,d0 +1 )1≤i≤n and we classify these
individuals for the days (d0 , d0 + 1).
5.3.2
Cluster days on the residential synchronous curve
In this Section, we focus on the residential synchronous (Zt )t∈T . We will illustrate our procedure
step by step, and highlight some features of data. The whole analysis will be done for the
preprocessing 2.
Model selection
Our procedure leads to a model collection, with various number of components and various
sparsities. Let us explain how to select some interesting models, thanks to the slope heuristic.
We define
(K(κ), J(κ)) = argmin −γn (ŝ(K,J) ) + 2κD(K,J) /n ,
(K,J)
where γn is the log-likelihood function and ŝ(K,J) is the log-likelihood minimizer among the
collection S (K,J) . We consider the step function κ 7→ DK(κ),J(κ) , κ̂ being the abscissa which
ˆ = (K(2κ̂), J(2κ̂)). To improve
leads to the biggest dimension jump. We select the model (K̂, J)
(K,J)
that, we could consider the points (D(K,J) , −γn (s
) + 2κ̂D(K,J) /n)(K,J) , and select some
models minimizing this criterion. According to Figs. 5.3 and 5.4, we could consider some κ̂
which seem to create big jumps, and several models which seem to minimize the penalized
log-likelihood.
Model dimension
Penalized log-likelihood for 2κ̂
5.3. TYPICAL WORKFLOW USING THE EXAMPLE OF THE
AGGREGATED CONSUMPTION
m̂
0
0
κ̂
κ
2κ̂
152
6000
4000
2000
0
−2000
−4000
−6000
0
100
200
300
400
500
600
700
Model dimension
Figure 5.3: We select the
model m̂ using the slope
heuristic
Figure 5.4: Minimization of the penalized log-likelihood. Interesting models are branded by red squares, the selected one
by green diamond
Model visualization
Thanks to the model-based clustering, we have constructed a model in each cluster. Then,
we could understand differences between clusters from β̂ and Σ̂ estimations. We represent it
with an image, each coefficient being represented by a pixel. As we consider the linear model
Y = Xβk for each cluster, rows correspond to regressors coefficients and columns to response
coefficients. Diagonal coefficients will explain the main part. The Figs. 5.5 and 5.6 explain the
image construction, whereas we compute it for the model selected by the previous step in Fig.
5.7.
A4
D4
D4
D3
D3
A4 D 4
D3
Figure 5.5: Representation of the regression
matrix βk for the preprocessing 1.
D4
D3
Figure 5.6: Representation of the regression
matrix βk for the preprocessing 2.
|β −β |
1
2
0.4
0.3
0.2
0.1
Figure 5.7: For the selected model, we represent β̂ in each cluster. Absolute values of coefficients
are represented by different colormaps, white for 0. Each color represents a cluster
To highlight differences between clusters, we also plot β̂1 − β̂2 . First, we remark that β̂1 and β̂2
are sparse, thanks to the Lasso estimator. Moreover, the main difference between β̂1 and β̂2 is
row 4, columns 1, 2 and 6. We could say that the procedure uses, depending on cluster, more or
less the first coefficient of D3 of X to describe coefficients 1 and 2 of D3 and coefficient 3 of D4
of Y . The Fig. 5.11 enlightens those differences between clusters.
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
153
We represent the covariance matrix in Fig. 5.8. Because we estimate it by a diagonal matrix
in each cluster, we just display the diagonal coefficients. We keep the same scale for all the
clusters, to highlight which clusters are noisier.
Figure 5.8: For the model selected, we represent Σ in each cluster
Model-based clustering
We could compute the posterior probability for each observation. In Fig. 5.9, we represent it by
boxplots. Closer there are to 1, more different are the clusters.
1
0.8
0.6
π1=0.69
π2 =0.31
Figure 5.9: Assignment boxplots per cluster
In the present case, the two clusters are well defined and the clustering problem is quite easy, but
see for example Fig. 5.13, in a different clustering issue, which presents a model with affectations
less well separated.
Clustering
Mean of the load consumption in each cluster
Now, we are able to try to interpret each cluster. In Fig. 5.10, we represent the mean curves
for each cluster. We can also use functional descriptive statistics (see [Sha11]). Because clusters
are done on the reliance between a day and its eve, we represent the both days.
0.8
Cluster 1
Cluster 2
0.7
0.6
0.5
0.4
0.3
0.2
0
X
10
20
Y
30
40
50
60
Instant of the day
70
80
90
Figure 5.10: Clustering representation. Each curve is the mean in each cluster
154
5.4. CLUSTERING CONSUMERS
Discussion
Mean of the load consumption in each cluster
According to the preprocessing 2, we cluster weekdays and weekend days. The same procedure
done with the preprocessing 1 shows the temperature influence. We construct four clusters, two
of them being weekend days, and the two others are weekdays, differences made according to
the temperature. In Fig. 5.11, we summarize clusters by the mean curves for this second model.
1
0.9
Cluster 1
Cluster 2
Cluster 3
Cluster 4
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
X
10
20
Y
30
40
50
60
Instant of the day
70
80
90
Figure 5.11: Clustering representation. Each curve is the mean in each cluster
In Table 5.1, we summarize the both models considered according to the day type.
Interpretation
week
weekend
week, low T.
weekend, low T.
week, high T.
weekend, high T.
Mon.
0.88
0.12
0.26
0.1
0.64
0
Tue.
0.96
0.04
0.46
0.02
0.52
0
Wed.
0.94
0.06
0.46
0.03
0.5
0
Thur.
0.98
0.02
0.47
0
0.53
0
Fri.
0.96
0.04
0.51
0
0.45
0.04
Sat.
0
1
0
0.2
0
0.79
Sun.
0
1
0
0.65
0
0.35
Table 5.1: For each model selected, we summarize the proportion of day type in each cluster,
and interpret it, T corresponding to the temperature.
According to Table 5.1, difference between both preprocessing is the temperature influence: when
we center curves before projecting, we translate the whole curves, but when we remove the low
frequency approximation, we skip the main features. Depending on the goal, each preprocessing
could be interesting.
5.4
Clustering consumers
Another important approach considered in this paper is to cluster consumers. Before dealing with
their daily consumption, in Section 5.4.2, which is an hard problem because of the irregularity
of signals, we cluster consumers on mean days in Section 5.4.1.
155
5.4.1
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
Cluster consumers on mean days
Define z̃i,d the mean signal over all days d for the residential i, over all days d ∈ {1, . . . , 7}. Then
we consider couples (z̃i,d , z̃i,d+1 )1≤i≤n .
If we look on the model collection constructed by our procedure for K = {1, . . . , 8}, we always
select models with only one component, for every days d. Nevertheless, if we force the model to
′
have several clusters, restricting K to K = {2, . . . , 8}, we get some interesting informations. All
results get here are done for preprocessing 2.
For weekdays couples, Monday/Tuesday, Tuesdays/Wednesday, Wednesday/Thursday, Thursday/Friday, we select models with two clusters, with same means and same covariance matrices:
the model with one component is needed. The only difference on load consumption is on the
mean comportment. It is relevant with clusterings obtained in Section 5.3.2.
We focus here on Saturday/Sunday, for which there are different interesting clusters, see Fig.
5.12. Remark that we cannot summarize a cluster to its mean because of the high variability.
The main differences between those two clusters are on differences between lunch time and
afternoon, and on the Sunday morning. Notice that the big variability over the two days is not
explained by our model, for which the variability is small, explaining the noise for the reliance
between a day and its eve.
3.5
Load consumption in each cluster
3
2.5
2
1.5
1
0.5
0
0
10
20
30
40
50
60
Instant of the day
70
80
90
10
20
30
40
50
60
Instant of the day
70
80
90
3.5
Load consumption in each cluster
3
2.5
2
1.5
1
0.5
0
0
Figure 5.12: Saturday and Sunday load consumption in each cluster.
On Sunday/Monday, we get also three different clusters. Even if we identify differences on
the shape, the main difference is still on the mean consumption. On Friday/Saturday, we see
differences between people who have the same consumptions, and people who have a really
different comportment.
However, because the selected model is again with one component, we think that consider the
mean over days of each consumer cancel interesting effects, as the temperature and seasonal
156
5.4. CLUSTERING CONSUMERS
influence.
5.4.2
Cluster consumers on individual curves
One major objective of this work is individual curves clustering. As already pointed in the introduction, this is a very challenging task that has various applications for smart grid management,
going from demand response programs to energy reduction recommendations or household segmentation. We consider the complex situation where an electricity provider as access to very
few recent measurements -2 days here- of each individual customers but need to classify them
anyway. That could happen in a competitive electricity market where customers can change
their electricity provider at any time without providing their past consumption curves.
First, we focus on two successive days of electricity consumption measurements - Tuesday and
Wednesday in winter: January 5th and 6th 2010- for our 487 residential customers. Note that
we choose week days in winter following our experience on electricity data, as those days often
bring important information about residential electricity usage (electrical heating, intra-day
cycle, tariff effect...).
Selected mixture models
We apply the model-based clustering approach presented in Section 5.2 for preprocessing 1 and
obtain two models minimizing the penalized log-likelihood corresponding to 2 and 5 clusters.
Although these two classifications are based on auto-regression mixture model, we are able to
analyze and interpret clusters in terms of consumption profiles and provide bellow a physical
interpretation for the classes. In Fig. 5.13, we plot the proportions in each cluster for both
models constructed by our procedure. The first remark is that this issue is harder than the one
in Section 5.3.2. Nevertheless, even if there are a lot of outliers, for the model M1, a lot of
affectations are well-separated. It is obviously less clear with the model M2, with 5 components.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.4
0.5
π1=0.5
π2 =0.5
π1=0.21
π2 =0.17
π3 =0.2
π4 =0.25
π5 =0.17
Figure 5.13: Proportions in each cluster for models constructed by our procedure
In Fig. 5.14, we plot the regression matrix to highlight differences between clusters. Indeed,
those two matrices are different, for example more variables are needed to describe the cluster
2.
Analyzing clusters using classical profiling features
We first represent the main features of the cluster centres (the mean of all individual curves
in a cluster). Fig. 5.15 represents daily mean consumptions of these centres along the year.
We clearly see that the two classifications separate customers that have different mean level of
consumption (small, -middle- and big residential consumers) and different ratio winter/summer
consumption probably due to the amount of electrical heating among house usage. Let recall that
the model based clustering is done on centered data and that the mean level discrimination is
157
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
β1
β2
0.6
0.4
0.4
0.2
0.2
0
0
Figure 5.14: Regression matrix in each cluster for the model with 2 clusters
0.2 0.4 0.6 0.8 1.0
Consumption (kW)
not straightforward. Schematically, the 2 clusters classification seems to separate big customers
with electrical heating from small customers with few electrical heating. Whereas the 5 clusters
classification separates the small customers with few electrical heating but peaks in consumption
(flat curve with peaks, probably due to auxiliary heating when temperature is very low) and
middle customers with electrical heating. The two clusters in the middle customers population
don’t present any visible differences on this figure. The two big customers clusters have a
different ratio winter/summer probably due to a difference in terms of electrical heating usage.
50
100
150
200
250
300
350
200
250
300
350
Day
0.2 0.4 0.6 0.8 1.0
Consumption (kW)
0
0
50
100
150
Day
Figure 5.15: Daily mean consumptions of the cluster centres along the year for 2 (top) and 5
clusters (bottom)
This analysis is confirmed in Fig. 5.16 where we represent the daily mean consumptions of the
cluster centres in function of the daily mean temperature for the two classifications. Points
correspond to the mean consumption of one day and smooth curves are obtained with P-spline
regression. We observe that for all classes, the temperature effect due to electrical heating starts
at around 12˚C and that the different clusters have various heating profiles. The 2 clusters
classification profiles confirm the observation of Fig. 5.15 that the population is divided into
small and big customers with electrical heating. Concerning the 5 clusters classification, we
clearly see on the small customer profile (purple points/curves) an inflexion at around 0˚C
-this inflexion is also observed in the small customer cluster of the 2 clusters classificationcorresponding to e.g. an auxiliary heating device effect or at least an increase of consumption
158
5.4. CLUSTERING CONSUMERS
0.9
0.9
of the house for very low temperature. This is also what distinguishes the 2 middle customers
classes (blue and green points/curves). The two big customers’ clusters have similar heating
profile, except that the green cluster correspond to higher electrical heating usage.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0.8
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●
●●
●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
● ●
●
● ●
●
●●
●
●
●
●● ● ●●
●
● ●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
● ● ●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
0.2
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
20
●
●
●
15
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●●
● ●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ● ●
●●
●
●
● ●
●
●
●
● ● ●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●● ● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
10
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
−5
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
● ●●●
● ● ●●
● ●
● ●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●● ● ● ●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
0.3
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
0.4
●
●
●
●
● ● ●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
0.3
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
●
●
●●
●
0.6
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
0.5
0.6
●
●●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
−5
0
Daily Temperature (C°)
5
10
15
20
Daily Temperature (C°)
Figure 5.16: Daily mean consumptions of the cluster centres in function of the daily mean
temperature for 2 (on the left) and 5 clusters (on the right)
0.8
0.4
0.0
Tue
Wed
Thu
Fri
Sat
Sun
Mon
Tue
Wed
Thu
Fri
Sat
Sun
0.4
0.8
Mon
0.0
Consumption (kW)
Consumption (kW)
Another interesting observation concern the weekly and daily profiles of the centres. We represent on Fig. 5.17 an average (over time) week of consumption for each centre of the two
classifications. For the 2 clusters classification, we see again the difference in average consumption between the big customer cluster profile and the small customer one. They have similar
shapes but the difference day/night and peak loads (at around 8h, 13h and 18h for weekdays),
are more marked. For the 5 clusters curves, even if the weekly profiles are quite similar (no
major difference in the week days/week ends profiles in each clusters), the daily shapes exhibit
more differences. The day/night ratio could be very different as well as small variation along the
day, probably related to different tariff options (see [Com11a] for a description of the tariffs).
Figure 5.17: Average (over time) week of consumption for each centre of the two classifications
(2 clusters on the top and 5 on the bottom)
159
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
Cross analysis using survey data
60
40
50
Error rate (%)
70
80
To enrich this analysis, we also consider extra information providing in a survey realized by the
Irish Commission for Energy Regulation. We summarize this large amount of information into 10
variables: ResidentialTariffallocation corresponds to the tariff option (see [Com11a], [Com11b]),
Socialclass: AB, C1, C2, F in UK demographic classifications, Ownership whether or not a
customer owns his/her house, ResidentialStimulusallocation the stimulus sends to the customer
(see [Com11a], [Com11b]), Built.Year the year of construction of the building, Heat.Home and
Heat.Water electrical or not, Windows.Doubleglazed the proportion of double gazed window
in the house (none, quarter, half, 3 quarters, all), Home.Appliance.White.goods the number of
white goods/major appliances of the household. To relate our clusters to those variables we
consider the classification problem consisting in recovering model based classes with a random
forest classifier. Random forest introduced in [Bre01] is a well known and tested non-parametric
classification method that has the advantage to be quite automatic and easy to calibrate. In
addition, it provides a nice summary of the previous covariate in terms of their importance for
classification. On the Fig. 5.18 we represent the out of bag error of the random forest classifiers
in function of the number of trees (one major parameter of the random forest classifier) for the
two clusterings. That corresponds to a good estimate of what could be the classification error
on a independent dataset. We have observed that, choosing a sufficiently large number of trees
for the forest (300), the classification error rate attains 40% in the 2 clusters case and 75% in the
5 classes case which has to be compared to a random classifier who has respectively a 50% and
80% error rate. That means that the 10 previous covariates provide few but some information
about the clusters.
0
100
200
300
400
500
Nb Trees
Figure 5.18: Out of bag error of the random forest classifiers in function of the number of trees
Quantifying the importance of each survey covariate in the classifications we observe that in both
cases, Home.Appliance.White.goods and Socialclass play a major role. That could be explained
as those covariates could discriminate small and big customers. Another interesting point is
that in the 5 clusters classification, the variable Built.Year plays an important role which is
probably related to different heating profiles explained by different isolation standards. That
could explain the two big customers clusters. Then come the tariffs options which, in the 5
clusters case, could explain the different daily shapes of the Fig. 5.17.
It is noteworthy that these two classifications provide clusters that exhibit those very nice
160
5.4. CLUSTERING CONSUMERS
physical interpretation, considering that we only use two days of consumption in winter for each
customer.
Using model on one-day shifted data
To highlight advantages of our procedure, we compare prediction for the consumption of Thursday, 7th January, 2010. Indeed, even if the method is not designed for forecasting purposes,
we want to show that model-based clustering is an interesting tool also for prediction. We will
compare linear models, estimated on couples Tuesday, 5th January and Wednesday, 6th January, and we will predict Thursday from Wednesday. This is suggested by the clustering get
in Section 5.3.2, showing that transitions between weekdays are similar. We then compare the
following models. First, the most common is the linear model, without clustering. The second
model is the first constructed by our procedure, described before, with 2 components. Moreover, we could use the clustering get by the models constructed by our procedure, but estimate
parameters without variable selection, using full linear model in each component. We restrict
here our study to one model to narrow the analysis, but everything is also computable with the
models with 5 clusters.
For each consumer i, for each prediction procedure, we compute two prediction errors: the
RMSE on Thursday prediction, and the RMSE of Wednesday prediction. Remind that RMSE,
for a consumer i, is defined by
v
u
48
u1 X
RM SE(i) = t
(ẑi,t − zi,t )2 .
48
t=1
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1 cluster
LMLE 2 clusters
LMLE 2 clusters bis
Figure 5.19: RMSE on Thursday prediction for each procedure over all consumers
We remark that if RMSE are almost the same for the three different models, the one estimated
by our procedure leads to smaller median and interquartile range. For the three considered
models, the median of the RMSE on Wednesday prediction (learning sample) and the RMSE
on Thursday prediction (test sample) are close to each other, which means that the clustering
remains good, even for one-day shifted data, of course as long as we remain in the class of working
days, according to Section 5.3.2. To highlight this remark, we also compute our procedure on
couple Wednesday/Thursday. Then, we select three different models, and involved clusterings
are quite similar to clusterings given by models in Section 5.4.2.
161
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
Remarks on similar analyses
0.9
Alternatively, we make the same analysis on two successive weekdays of electricity consumption
measurement in summer. We obtain three models, corresponding to 2, 3 and 5 clusters respectively. We compute, as in the Subsection 5.4.2, daily mean consumptions of the cluster center
along the year, and in function of the daily mean temperature. The main difference is about
the inflexion at around 0˚C. Because clustering is done for summer days, we do not distinguish
cold effects. Moreover there are no cooling effects. We could remark again that clusterings are
hierarchical, but different from those get in the winter study, as we expected.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ● ●
● ●
●
●●
● ●
●● ●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●● ●
● ●●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●● ●
●
●●
●●
●●
●● ●
● ●
●
●●
●
●
●●
● ●
●
●
●
● ● ●
●
●
●●
● ● ●● ●
●
● ●
●
●
●●
● ● ● ● ●●
● ●
● ●
●●
●
●
● ● ●
●● ● ●
● ●
●
●
●● ● ●
● ●
●
●
●
●
●
●
●●
●
●● ●
●
●●
●
● ●●
●
● ●
●
●
●
●
●
●
●●
●
●● ● ●
●
● ● ● ●
●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●●
● ●
●●
●
●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
● ●●
●●
●
●
●
●
●
●●●
●
●●
●●
●
●
● ●
●●
●
●
●●
●
●
●
●
●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●● ●
●● ●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
0.3
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
● ●
●
● ●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
● ●
● ● ●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
● ●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
Daily Consumption (kW)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
0.2
●
●
−5
0
5
10
15
20
Daily Temperature (C°)
Figure 5.20: Daily mean consumptions of the cluster centres in function of the daily mean
temperature for 5 clusters, clustering done by observing Thursday and Wednesday in summer
We also study two successive weekend days of electricity consumption, in winter and in summer. We recognize different clusters, depending on behavior of consumers. We work with Friday/Saturday couples. The main thing we observe in summer is a cluster with no-consumption,
consumers who leave their home. It could be useful to predict the Sunday consumption, but no
more general for other weekend.
1
Consumption (kW)
0.8
0.6
0.4
0.2
0
0
50
100
150
Day
200
250
300
350
Figure 5.21: Daily mean consumptions of the cluster centres along the year for 3 clusters,
clusering done on weekend observation
5.5
Discussion and conclusion
Massive information about individual (household, small and medium enterprise) consumption
are now provided with new metering technologies and smart grids. Two major exploitations
of individual data are load profiling and forecasting at different scales on the grid. Customer
segmentation based on load classification is a natural approach for that and is a prolific way
of research. We propose here a new methodology based on high-dimensional regression models.
5.5. DISCUSSION AND CONCLUSION
162
The novelty of our approach is that we focus on uncovering clusters corresponding to different
regression models that could then be exploited for profiling as well as forecasting. We focus
on profiling and show how, exploiting few temporal measurements of 500 residential customers
consumption, we can derive informative clusters. We provide some encouraging elements about
how to exploit these models and clusters for bottom up forecasting.
163
CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING
HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS.
5.5. DISCUSSION AND CONCLUSION
164
Bibliography
[Aka74]
H. Akaike. A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19(6):716–723, 1974.
[Ald96]
M. Aldrin. Moderate projection pursuit regression for multivariate response data.
Computational Statistics & Data Analysis, 21(5):501–531, 1996.
[And51]
T. W. Anderson. Estimating Linear Restrictions on Regression Coefficients
for Multivariate Normal Distributions. The Annals of Mathematical Statistics,
22(3):327–351, 1951.
[AS13]
C. Alzate and M. Sinn. Improved electricity load forecasting via kernel spectral
clustering of smart meters. In 2013 IEEE 13th International Conference on Data
Mining, Dallas, TX, USA, December 7-10, 2013, pages 943–948, 2013.
[ASS98]
C.W. Anderson, E.A. Stolz, and S. Shamsunder. Multivariate autoregressive models for classification of spontaneous electroencephalographic signals during mental
tasks. IEEE Transactions on Biomedical Engineering, 45(3):277–286, 1998.
[Bah58]
R. R. Bahadur. Examples of inconsistency of maximum likelihood estimates.
Sankhya: The Indian Journal of Statistics (1933-1960), 20(3/4):pp. 207–210, 1958.
[Bau09]
J-P Baudry. Model Selection for Clustering. Choosing the Number of Classes.
Ph.D. thesis, Université Paris-Sud 11, 2009.
[BC11]
A. Belloni and V. Chernozhukov. ℓ1 -penalized quantile regression in highdimensional sparse models. The Annals of Statistics, 39(1):82–130, 2011.
[BC13]
A. Belloni and V. Chernozhukov. Least squares after model selection in highdimensional sparse models. Bernoulli, 19(2):521–547, 2013.
[BCG00]
C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering
with the integrated completed likelihood. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(7):719–725, July 2000.
[BCG03]
C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models.
Computational Statistics & Data Analysis, 41(3-4):561–575, 2003.
[BGH09]
Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown
variance. The Annals of Statistics, 37(2):630–672, 2009.
[BKM04]
E. Brown, R. Kass, and P. Mitra. Multiple neural spike train data analysis: stateof-the-art and future challenges. Nature neuroscience, 7(5):456–461, 2004.
[BLM13]
S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities:
Nonasymptotic Theory of Independence. OUP Oxford, 2013.
[BM93]
L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators.
Probability Theory and Related Fields, 97(1-2):113–150, 1993.
165
A
BIBLIOGRAPHY
166
[BM01]
L. Birgé and P. Massart. Gaussian model selection. Journal of the European
Mathematical Society, 3(3):203–268, 2001.
[BM07]
L. Birgé and P. Massart. Minimal penalties for Gaussian model selection. Probab.
Theory Related Fields, 138(1-2), 2007.
[BMM12]
J-P Baudry, C. Maugis, and B. Michel. Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455–470, 2012.
[Bre01]
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[BRT09]
P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig
selector. The Annals of Statistics, 37(4):1705–1732, 2009.
[BSW11]
F. Bunea, Y. She, and M. Wegkamp. Optimal selection of reduced rank estimators
of high-dimensional matrices. The Annals of Statistics, 39(2):1282–1309, 2011.
[BSW12]
F. Bunea, Y. She, and M. Wegkamp. Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics,
40(5):2359–2388, 2012.
[Bun08]
F. Bunea. Consistent selection via the Lasso for high dimensional approximating
regression models. In Pushing the limits of contemporary statistics: contributions
in honor of Jayanta K. Ghosh, volume 3 of Inst. Math. Stat. Collect., pages 122–
137. Inst. Math. Statist., Beachwood, OH, 2008.
[BvdG11]
P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods,
Theory and Applications. Springer Series in Statistics. Springer, 2011.
[CDS98]
S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit.
Siam Journal On Scientific Computing, 20:33–61, 1998.
[CG93]
G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Technical
Report RR-2028, INRIA, 1993.
[Cha14]
M. Chaouch. Clustering-based improvement of nonparametric functional time
series forecasting: Application to intra-day household-level load curves. IEEE
Transactions on Smart Grid, 5(1):411–419, 2014.
[CLP11]
S. Cohen and E. Le Pennec. Conditional density estimation by penalized likelihood
model selection and applications. Research Report RR-7596, 2011.
[CO14]
A. Ciarleglio and T. Ogden. Wavelet-based scalar-on-function finite mixture regression models. Computational Statistics & Data Analysis, 2014.
[Com11a]
Commission for energy regulation, Dublin. Electricity smart metering customer
behaviour trials findings report. 2011.
[Com11b]
Commission for energy regulation, Dublin. Results of electricity coast-benefit
analysis, customer behaviour trials and technology trials commission for energy
regulation. 2011.
[CT07]
E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is
much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
[Dau92]
I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 1992.
[Dev14a]
E. Devijver. An ℓ1 -oracle inequality for the lasso in finite mixture of multivariate
gaussian regression models, 2014. arXiv:1410.4682.
[Dev14b]
E. Devijver. Finite mixture regression: A sparse variable selection by model selection for clustering, 2014. arXiv:1409.1331.
167
BIBLIOGRAPHY
[Dev14c]
E. Devijver. Model-based clustering for high-dimensional data. application to
functional data, 2014. arXiv:1409.1333.
[Dev15]
E. Devijver. Joint rank and variable selection for parsimonious estimation in highdimension finite mixture regression model, 2015. arXiv:1501.00442.
[DJ94]
D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage.
Biometrika, 81:425–455, 1994.
[DLR77]
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Discussion. Journal of the Royal Statistical
Society. Series B, 39:1–38, 1977.
[EHJT04]
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The
Annals of Statistics, 32(2):407–499, 2004.
[FHP03]
K.J. Friston, L. Harrison, and W.D. Penny. Dynamic Causal Modelling. NeuroImage, 19(4):1273–1302, 2003.
[FP04]
J. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of
parameters. The Annals of Statistics, 32(3):928–961, 2004.
[FR00]
C. Fraley and A. Raftery. Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association, 97:611–631,
2000.
[FV06]
F. Ferraty and P. Vieu. Nonparametric functional data analysis : theory and
practice. Springer series in statistics. Springer, New York, 2006.
[Gir11]
C. Giraud. Low rank multivariate regression. Electronic Journal of Statistics,
5:775–799, 2011.
[GLMZ10]
J. Guo, E. Levina, G. Michailidis, and J. Zhu. Pairwise variable selection for
high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
[GW00]
C. Genovese and L. Wasserman. Rates of convergence for the Gaussian mixture
sieve. Annals of Statistics, 28(4):1105–1127, 2000.
[H+ 13]
L. Hancher et al. Think topic 11: ‘Shift, not drift: Towards active demand response
and beyond’. 2013.
[HK70]
A. Hoerl and R. Kennard. Ridge Regression: Biased Estimation for Nonorthogonal
Problems. Technometrics, 12(1):55–67, 1970.
[HTF01]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
[Irw86]
G.W. Irwin. Statistical electricity demand modelling from consumer billing data.
IEE Proceedings C (Generation, Transmission and Distribution), 133:328–335(7),
September 1986.
[Ize75]
A. Izenman. Reduced-rank regression for the multivariate linear model. Journal
of Multivariate Analysis, 5(2):248–264, 1975.
[KFR14]
J. Kwac, J. Flora, and R. Rajagopal. Household energy consumption segmentation
using hourly data. IEEE Transactions on Smart Grid, 5(1):420–430, 2014.
[LSD15]
W. Labeeuw, J. Stragier, and G. Deconinck. Potential of active demand reduction
with residential wet appliances: A case study for belgium. Smart Grid, IEEE
Transactions on, 6(1):315–323, Jan 2015.
[Mal73]
C. L. Mallows. Some comments on Cp. Technometrics, 15:661–675, 1973.
168
BIBLIOGRAPHY
[Mal89]
S. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
11:674–693, 1989.
[Mal99]
S. Mallat. A wavelet tour of signal processing. Academic Press, 1999.
[Mas07]
P. Massart. Concentration inequalities and model selection. Lecture Notes in
Mathematics. Springer, 33, 2003, Saint-Flour, Cantal, 2007.
[MB88]
G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications
to Clustering. Marcel Dekker, New York, 1988.
[MB06]
N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection
with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
[MB10]
N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
[Mey12]
C. Meynet. Sélection de variables pour la classification non supervisée en grande
dimension. Ph.D. thesis, Université Paris-Sud 11, 2012.
[Mey13]
C. Meynet. An ℓ1 -oracle inequality for the lasso in finite mixture gaussian regression models. ESAIM: Probability and Statistics, 17:650–671, 2013.
[MK97]
G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley
& Sons, New York, 1997.
[MM11a]
P. Massart and C. Meynet. The Lasso as an ℓ1 -ball model selection procedure.
Electronic Journal of Statistics, 5:669–687, 2011.
[MM11b]
C. Maugis and B. Michel. A non asymptotic penalized criterion for Gaussian
mixture model selection. ESAIM Probab. Stat., 15:41–68, 2011.
[MMOP04]
M. Misiti, Y. Misiti, G. Oppenheim, and J-M. Poggi. Matlab Wavelet Toolbox
User’s Guide. Version 3. The Mathworks, Inc., Natick, MA., 2004.
[MMOP07]
M. Misiti, Y. Misiti, G. Oppenheim, and J-M Poggi. Clustering signals using
wavelets. In Francisco Sandoval, Alberto Prieto, Joan Cabestany, and Manuel
Graña, editors, Computational and Ambient Intelligence, volume 4507 of Lecture
Notes in Computer Science, pages 514–521. Springer Berlin Heidelberg, 2007.
[MMOP10]
M. Misiti, Y. Misiti, G. Oppenheim, and J-M Poggi. Optimized clusters for disaggregated electricity load forecasting. REVSTAT, 8(2):105–124, 2010.
[MMR12]
C. Meynet and C. Maugis-Rabusseau. A sparse variable selection procedure in
model-based clustering. Research report, September 2012.
[MP04]
G. McLachlan and D. Peel. Finite Mixture Models. Wiley series in probability
and statistics: Applied probability and statistics. Wiley, 2004.
[MS14]
Z. Ma and T. Sun.
arXiv:1403.1922.
[MY09]
N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for
high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009.
[OPT99]
M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. Journal of
Computational and Graphical Statistics, 9:319–337, 1999.
[PC08]
T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical
Association, 103(482):681–686, 2008.
[PS07]
W. Pan and X. Shen. Penalized model-based clustering with application to variable
selection. Journal of Machine Learning Research, 8:1145–1164, 2007.
Adaptive sparse reduced-rank regression,
2014.
169
BIBLIOGRAPHY
[Ran71]
W. Rand. Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical Association, 66(336):846–850, 1971.
[RD06]
A. Raftery and N. Dean. Variable selection for model-based clustering. Journal
of the American Statistical Association, 101:168–178, 2006.
[RS05]
J. O. Ramsay and B. W. Silverman. Functional data analysis. Springer series in
statistics. Springer, New York, 2005.
[RT11]
P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse
estimation. The Annals of Statistics, 39(2):731–771, 2011.
[SBG10]
N. Städler, P. Bühlmann, and S. Van de Geer. ℓ1 -penalization for mixture regression models. Test, 19(2):209–256, 2010.
[SBvdGR10] N. Städler, P. Bühlmann, S. van de Geer, and Rejoinder. Comments on ℓ1 penalization for mixture regression models. Test, 19(2):209–256, 2010.
[Sch78]
G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics,
6(2):461–464, 1978.
[SFHT13]
Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. A sparsegroup lasso. Journal of Computational and Graphical Statistics, 2013.
[Sha11]
H-L Shang. rainbow: An R Package for Visualizing Functional Time Series . The
R Journal, 3(2):54–59, dec 2011.
[SWF12]
W. Sun, J. Wang, and Y. Fang. Regularized k-means clustering of highdimensional data and its asymptotic consistency. Electronic Journal of Statistics,
6:148–167, 2012.
[SZ12]
T. Sun and C-H Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898,
2012.
[Tib96]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B., 58(1):267–288, 1996.
[TMZT06]
A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. Tseng. Evaluation and
comparison of gene clustering methods in microarray analysis. Bioinformatics,
22(19):2405–2412, 2006.
[TSM85]
D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite
mixture distributions. Wiley series in probability and mathematical statistics:
Applied probability and statistics. Wiley, 1985.
[TSR+ 05]
R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and
smoothness via the fused lasso. Journal of the Royal Statistical Society Series B,
pages 91–108, 2005.
[US 06]
US Department of Energy . Benefits of demand response in electricity markets
and recommendations for achieving them - a report to the united states congress
pursant to section 1252 of the energy policy act of 2005. page 122, 02/2006 2006.
[Vap82]
V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 1982.
[vdG13]
S. van de Geer. Generic chaining and the ℓ1 -penalty. Journal of Statistical Planning
and Inference, 143(6):1001 – 1012, 2013.
[vdGB09]
S. van de Geer and P. Bühlmann. On the conditions used to prove oracle results
for the Lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.
170
BIBLIOGRAPHY
[vdGBRD14] S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically
optimal confidence regions and tests for high-dimensional models. The Annals of
Statistics, 42(3):1166–1202, 06 2014.
[vdGBZ11]
S. van de Geer, P. Bühlmann, and S. Zhou. The adaptive and the thresholded lasso
for potentially misspecified models (and a lower bound for the lasso). Electronic
Journal of Statistics, 5:688–749, 2011.
[vdVW96]
AW van der Vaart and J. Wellner. Weak Convergence and Empirical Processes:
With Applications to Statistics. Springer Series in Statistics. Springer, 1996.
[YB99]
Y. Yang and A. Barron. Information-theoretic determination of minimax rates of
convergence. The Annals of Statistics, 27(5):1564–1599, 1999.
[YFL11]
F. Yao, Y. Fu, and T. Lee.
(2):341–353, 2011.
[YLL06]
M. Yuan, Y. Lin, and Y. Lin. Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67,
2006.
[YLL12]
M-S Yang, C-Y Lai, and C-Y Lin. A robust EM clustering algorithm for Gaussian
mixture models. Pattern Recognition, 45(11):3950–3961, 2012.
[ZH05]
H. Zou and T. Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
[ZH08]
C-H Zhang and J. Huang. The sparsity and bias of the LASSO selection in highdimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
[Zha10]
C-H Zhang. Nearly unbiased variable selection under minimax concave penalty.
The Annals of Statistics, 38(2):894–942, 2010.
[ZHT07]
H. Zou, T. Hastie, and R. Tibshirani. On the “degrees of freedom” of the lasso.
The Annals of Statistics, 35(5):2173–2192, 2007.
[ZOR12]
Y. Zhao, T. Ogden, and P. Reiss. Wavelet-Based LASSO in Functional Linear
Regression. Journal of Computational and Graphical Statistics, 21(3):600–617,
2012.
[Zou06]
H. Zou. The adaptive lasso and its oracle properties. Journal of the American
Statistical Association, 101(476):1418–1429, 2006.
[ZPS09]
H. Zhou, W. Pan, and X. Shen. Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3:1473–1496, 2009.
[ZY06]
P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine
Learning Research, 7:2541–2563, 2006.
[ZYS13]
K-L Zhou, S-L Yang, and C. Shen. A review of electric load classification in smart
grid environment. Renewable and Sustainable Energy Reviews, 24(0):103 – 110,
2013.
[ZZ10]
N. Zhou and J. Zhu. Group variable selection via a hierarchical lasso and its oracle
property. Statistics and Its Interface, 3:557–574, 2010.
Functional mixture regression.
Biostatistics,