Academia.eduAcademia.edu

Discipline: Mathématiques

2016

présentée pour obtenir le grade de docteur en sciences de l’Université Paris-Sud

Université Paris-Sud École Doctorale de mathématiques de la Région Paris-Sud Laboratoire de Mathématiques d’Orsay THÈSE présentée pour obtenir le grade de docteur en sciences de l’Université Paris-Sud Discipline : Mathématiques par Emilie Devijver Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles. Soutenue le 02 juillet 2015 devant la commission d’examen : Francis Christophe Yannig Jean-Michel Pascal Jean-Michel Bach Biernacki Goude Loubes Massart Poggi INRIA Paris-Rocquencourt Université de Lille 1 EDF R&D Université de Toulouse Université Paris-Sud Université Paris-Sud Examinateur Rapporteur Examinateur Rapporteur Directeur de thèse Directeur de thèse Thèse préparée sous la direction de Pascal Massart et de Jean-Michel Poggi au Département de Mathématiques d’Orsay Laboratoire de Mathématiques (UMR 8625), Bât. 425, Université Paris Sud 91405 Orsay Cedex Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles. Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on étudie des prédicteurs de grande dimension et une réponse de grande dimension. Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1 . On propose aussi deux procédures pour pallier ce problème de classification en grande dimension. La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur de type Lasso. La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer la dimension. Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour sélectionner un modèle proche de l’oracle. On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations et des données réelles. En particulier, on illustre la première méthode par des données de consommation électrique. Mots-clés : modèles de mélange en régression, classification non supervisée, grande dimension, sélection de variables, sélection de modèles, inégalité oracle, données fonctionnelles, consommation électrique, ondelettes. 4 High-dimensional mixture regression models, application to functional data. Finite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1 -oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1 -regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1 -penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset. Keywords: mixture regression models, clustering, high dimension, variable selection, model selection, oracle inequality, functional data, electricity consumption, wavelets. 9 À mon grand-père. 10 Contents Résumé 6 Remerciements 7 Introduction 15 Notations 38 1 Two procedures 1.1 Introduction . . . . . . . . . 1.2 Gaussian mixture regression 1.3 Two procedures . . . . . . . 1.4 Illustrative example . . . . . 1.5 Functional datasets . . . . . 1.6 Conclusion . . . . . . . . . . 1.7 Appendices . . . . . . . . . 45 46 48 52 54 60 67 67 . . . . . models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 An ℓ1 -oracle inequality for the Lasso in finite mixture of multivariate Gaussian regression models 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Proof of the oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Proof of the theorem according to T or T c . . . . . . . . . . . . . . . . . . . . . . 2.6 Some details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 An 3.1 3.2 3.3 3.4 3.5 3.6 oracle inequality for the Lasso-MLE procedure Introduction . . . . . . . . . . . . . . . . . . . . . . . The Lasso-MLE procedure . . . . . . . . . . . . . . . An oracle inequality for the Lasso-MLE model . . . . Numerical experiments . . . . . . . . . . . . . . . . . Tools for proof . . . . . . . . . . . . . . . . . . . . . Appendix: technical results . . . . . . . . . . . . . . 4 An 4.1 4.2 4.3 4.4 4.5 oracle inequality for the Lasso-Rank Introduction . . . . . . . . . . . . . . . . The model and the model collection . . Oracle inequality . . . . . . . . . . . . . Numerical studies . . . . . . . . . . . . . Appendices . . . . . . . . . . . . . . . . 11 procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 76 77 79 81 85 90 . . . . . . 99 100 101 104 107 110 118 . . . . . 127 128 130 133 136 137 CONTENTS 5 Clustering electricity consumers using high-dimensional regression models. 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Typical workflow using the example of the aggregated consumption . . . 5.4 Clustering consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 mixture 147 . . . . . 148 . . . . . 149 . . . . . 150 . . . . . 154 . . . . . 161 List of Figures 1 2 3 4 Exemple de données simulées Illustration de l’estimateur du Saut de dimension . . . . . . Heuristique des pentes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 21 28 28 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 Number of FR and TR . . . . . . . . . . . . . . . . . . . . . . . . . . Zoom in on number of FR and Tr . . . . . . . . . . . . . . . . . . . . Slope graph for our Lasso-Rank procedure . . . . . . . . . . . . . . . Slope graph for our Lasso-MLE procedure . . . . . . . . . . . . . . . Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . . Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . . Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . . Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot of the Kullback-Leibler divergence . . . . . . . . . . . . . . . Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot of the ARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plot of the 70-sample of half-hour load consumption, on the two days Plot of a week of load consumption . . . . . . . . . . . . . . . . . . . Summarized results for the model 1 . . . . . . . . . . . . . . . . . . . Summarized results for the model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 57 58 58 58 58 59 59 59 59 63 63 63 65 66 3.1 3.2 3.3 3.4 3.5 Boxplot of the Kullback-Leibler divergence Boxplot of the Kullback-Leibler divergence Boxplot of the ARI . . . . . . . . . . . . . Summarized results for the model 1 . . . . Summarized results for the model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 108 108 109 109 5.1 5.2 Load consumption of a sample of 5 consumers over a week in winter . . . . . . . Projection of a load consumption for one day into Haar basis, level 4. By construction, we get s = A4 + D4 + D3 + D2 + D1 . On the left side, the signal is considered with reconstruction of dataset, the dotted being preprocessing 1 and the dotted-dashed being the preprocessing 2 . . . . . . . . . . . . . . . . . . . . . We select the model m̂ using the slope heuristic . . . . . . . . . . . . . . . . . . . Minimization of the penalized log-likelihood. Interesting models are branded by red squares, the selected one by green diamond . . . . . . . . . . . . . . . . . . . Representation of the regression matrix βk for the preprocessing 1. . . . . . . . . Representation of the regression matrix βk for the preprocessing 2. . . . . . . . . For the selected model, we represent β̂ in each cluster . . . . . . . . . . . . . . . 149 5.3 5.4 5.5 5.6 5.7 . . . . Lasso . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 152 152 152 152 152 LIST OF FIGURES 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 14 For the model selected, we represent Σ in each cluster . . . . . . . . . . . . . . . 153 Assignment boxplots per cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Clustering representation. Each curve is the mean in each cluster . . . . . . . . . 153 Clustering representation. Each curve is the mean in each cluster . . . . . . . . . 154 Saturday and Sunday load consumption in each cluster. . . . . . . . . . . . . . . 155 Proportions in each cluster for models constructed by our procedure . . . . . . . 156 Regression matrix in each cluster for the model with 2 clusters . . . . . . . . . . 157 Daily mean consumptions of the cluster centres along the year for 2 (top) and 5 clusters (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Daily mean consumptions of the cluster centres in function of the daily mean temperature for 2 (on the left) and 5 clusters (on the right) . . . . . . . . . . . . 158 Average (over time) week of consumption for each centre of the two classifications (2 clusters on the top and 5 on the bottom) . . . . . . . . . . . . . . . . . . . . . 158 Out of bag error of the random forest classifiers in function of the number of trees 159 RMSE on Thursday prediction for each procedure over all consumers . . . . . . . 160 Daily mean consumptions of the cluster centres in function of the daily mean temperature for 5 clusters, clustering done by observing Thursday and Wednesday in summer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Daily mean consumptions of the cluster centres along the year for 3 clusters, clusering done on weekend observation . . . . . . . . . . . . . . . . . . . . . . . . 161 Introduction Faire des classes pour mieux modéliser un échantillon est une méthode classique en statistiques. L’exemple pratique développé dans cette thèse est la classification de consommateurs électriques en Irlande. Habituellement, on classe des individus dans un problème d’estimation, mais ici, pour souligner l’aspect prédictif, on classe les observations qui ont le même type de comportement d’un jour sur l’autre. C’est une classification plus fine, qui agit sur des régresseurs homoscédastiques. Dans notre exemple, on classifie les consommateurs sur un modèle de régression d’un jour par rapport à leur consommation de la veille. Dans cette thèse, on a développé ce contexte d’un point de vue méthodologique, en proposant deux procédures de classification en régression par modèles de mélange qui pallient le problème de grande dimension, mais aussi d’un point de vue théorique en justifiant la partie sélection de modèles de nos procédures, et d’un point de vue pratique en travaillant sur le jeu de données réelles de consommation électrique en Irlande. Dans cette introduction, nous présentons les notions développées dans cette thèse. Régression linéaire Les méthodes de régression consistent à chercher le lien qui existe entre deux variables aléatoires X et Y . La variable X représente les régresseurs, les variables explicatives, alors que Y décrit la réponse. Le modèle linéaire Gaussien, le modèle le plus simple en régression, suppose que Y dépend de X de façon linéaire, à un bruit Gaussien près. Plus formellement, si X et Y sont deux vecteurs aléatoires, X ∈ Rp et Y ∈ R, étudier un modèle linéaire Gaussien sur (X, Y ) consiste à trouver β ∈ Rp tel que Y = βX + ǫ (1) où ǫ ∼ N (0, σ 2 ), σ 2 étant connue ou à estimer suivant les cas. L’étude de ce modèle est, à ce jour, assez complète. Soit (x, y) = ((x1 , y1 ), . . . , (xn , yn )) un échantillon. Si l’échantillon est de taille suffisante, on connaît un estimateur consistant (c’est-à-dire qui converge vers la vraie valeur), l’estimateur des moindres carrés (qui coïncide dans ce cas avec l’estimateur du maximum de vraisemblance). On notera (β̂, σ̂ 2 ) cet estimateur, où β̂ = (xt x)−1 xt y ; σ̂ 2 = ||y − xβ̂||2 . n−p On connaît la loi de cet estimateur : β̂ ∼ N (β, σ 2 (xt x)−1 ) et (n − p)σ̂ 2 /σ 2 ∼ χ2n−p , ce qui nous permet de déduire des intervalles de confiance pour chaque paramètre. On peut généraliser ce modèle à une variable Y ∈ Rq multivariée. Dans ce cas, on observe un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , et on construit un estimateur 15 16 Introduction (β̂, Σ̂) ∈ Rq×p × S++ q , où β̂ = (xt x)−1 xt y ; Σ̂ = h(y − xβ̂)l1 , (y − xβ̂)l2 i n−p ! (2) ; 1≤l1 ≤q 1≤l2 ≤q et S++ est l’ensemble des matrices symétriques définies positives de taille q. q Ce modèle de régression est utilisé, par exemple, pour prédire de nouvelles valeurs. Si on connaît le modèle sous-jacent à nos données, c’est-à-dire que l’on a déterminé β̂ et Σ̂ à partir d’un échantillon ((x1 , y1 ), . . . , (xn , yn )), et que l’on observe un nouvel xn+1 , on peut calculer ŷn+1 = β̂xn+1 . Dans ce cas ŷn+1 est la valeur dite prédite. Si on s’intéresse au couple des variables aléatoires (X, Y ), on peut estimer la densité du couple, mais on peut aussi étudier la densité conditionnelle. C’est cette dernière quantité que l’on a décrite par un modèle linéaire dans (1). Les covariables peuvent aussi avoir différents statuts : soit elles sont fixées, déterministes, soit elles sont aléatoires, et on travaille alors conditionnellement à la loi sous-jacente. Dans cette thèse, on s’intéresse à la loi conditionnelle, pour des régresseurs fixes ou aléatoires. Cependant, cette hypothèse de modèle linéaire est très contraignante : si on observe un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )), on suppose que chaque yi dépend de la même façon de xi , à un bruit près, pour i ∈ {1, . . . , n}. Si le modèle est adapté aux données, le bruit sera petit, ce qui signifie que les coefficients de la matrice de covariance Σ seront petits. Cependant, de nombreux jeux de données ne peuvent pas être bien résumés par un modèle linéaire. Modèles de mélange en régression Pour affiner ce modèle, on peut choisir de construire plusieurs classes, et de faire dépendre nos estimateurs β̂ et Σ̂ de la classe. Plus formellement, on étudie un modèle de mélange en régression de K Gaussiennes : si on observe un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , et si on suppose que l’observation i appartient à la classe k, alors il existe βk et Σk tels que yi = β k x i + ǫ i où ǫi ∼ N (0, Σk ). Dans cette thèse, on considère un mélange avec un nombre de classes K fini. Les modèles de mélange sont traditionnellement étudiés dans des problèmes d’estimation (citons par exemple McLachlan et Basford, [MB88], McLachlan et Peel, [MP04], qui sont deux livres fondateurs pour les modèles de mélange, Fraley et Raftery, [FR00], pour un état de l’art, ou encore Celeux et Govaert, [CG93], pour l’étude d’une densité de mélange dans un but de classification). L’idée principale est d’estimer la densité inconnue s∗ par un mélange de densités classiques (sθk )1≤k≤K : on peut alors écrire ∗ s = K X πk s θ k k=1 K X πk = 1 k=1 Dans cette thèse, on s’intéresse à un mélange de modèles linéaires Gaussiens à poids constants, les densités sθk sont donc des densités Gaussiennes conditionnelles. 17 Introduction Plusieurs idées émergent avec les modèles de mélange, en régression ou non. Si on connaît les paramètres de notre modèle, on peut calculer pour chaque observation la probabilité a posteriori d’appartenir à une classe. D’après le principe du maximum a posteriori (noté MAP), on peut alors affecter une classe à chaque observation. Pour ce faire, on calcule la probabilité a posteriori de chaque observation d’appartenir à une classe, et on affecte les observations aux classes les plus probables. On peut ainsi caractériser chaque observation par les paramètres de la densité conditionnelle associée à la classe. Dans le cadre de la classification supervisée, on connaît l’affectation de chaque observation, et on cherche à comprendre la formation des classes pour classer une nouvelle observation ; quand on fait de la classification semi-supervisée, on connaît certaines affectations, et on cherche à comprendre le modèle (souvent, c’est très coûteux de connaître les affectations des observations) ; dans le cadre de la classification non supervisée, on ne connaît pas du tout les affectations. Nous nous plaçons dans cette thèse dans le contexte de classification non supervisée, avec une approche en régression. Cela a déjà été envisagé, citons par exemple Städler et coauteurs ([SBG10]) ou Meynet ([Mey13]), qui travaillent avec ce modèle pour des réponses Y univariées. On observe des couples (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , et on veut faire des classes, pour regrouper les observations (xi , yi ) pour i ∈ {1, . . . , n} qui ont la même relation entre yi et xi . Sur certains jeux de données, cette approche semble naturelle, et on connaît le nombre de classes. 14 12 10 8 6 4 2 0 −2 −4 −6 −4 −3 −2 −1 0 1 2 3 4 5 Figure 1 – Exemple de données simulées où les classes en régression se comprennent par la réalisation graphique. On a ici envie de construire 3 classes, représentées par les différents symboles : ∗ pour la classe 1, ♦ pour la classe 2,  pour la classe 3. Les données sont en dimension un, (X, Y ) ∈ R × R. Les observations x sont issues d’une loi Gaussienne centrée réduite, et y est simulée suivant un mélange de Gaussiennes de moyennes βk x et de variance 0.25, où β = [−1, 0.1, 3]. Néanmoins, sur un jeu de données quelconque, on va surtout chercher à mieux comprendre la relation entre les variables aléatoires Y et X en regroupant les observations qui ont la même dépendance entre Y et X. La structure de groupes n’est pas forcément clairement induite par les données, et on ne sait pas forcément combien de classes on a intérêt à former : dans certains cas, il va falloir estimer K. Si on considère le modèle de mélange de Gaussiennes multivariées en régression à K classes, on 18 Introduction peut décrire ce modèle à l’aide des outils statistiques classiques. Si on suppose que les vecteurs aléatoires Y sont indépendants conditionnellement à X, on considère la densité conditionnelle sK ξ , où q sK ξ :R →R y 7→ sK ξ (y|x) =   1 πk t −1 (y − β x) Σ (y − β x) ; exp − k k k q/2 det(Σ )1/2 2 (2π) k k=1 K X où les paramètres à estimer sont ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK , avec ΞK = ΠK × Rq×p ( ΠK = K × S++ q (π1 , . . . , πK ) ∈ [0, 1] K K ; K X πk = 1 k=1 ) . A partir de cette densité conditionnelle, on peut définir l’estimateur du maximum de vraisemblance par   1 EM V (3) ξˆK = argmin − l(ξ, x, y) ; n ξ∈ΞK où la log-vraisemblance est définie par l(ξ, x, y) = n X i=1  log sK ξ (yi |xi ) . Comme dans la plupart des modèles de mélange classiques, dans le cadre de la régression, l’estimateur du maximum de vraisemblance n’est pas explicite. Cela complique l’analyse théorique : on n’a pas accès à la loi des estimateurs, ni des affectations. D’un point de vue pratique, on a recours à l’algorithme EM pour approcher cet estimateur. Cet algorithme, introduit par Dempster et coauteurs dans [DLR77], permet d’estimer les paramètres d’un modèle de mélange. Il consiste à alterner deux étapes, jusqu’à convergence des paramètres ou d’une fonction des paramètres. Décrivons ce résultat. On note Z = (Z1 , . . . , Zn ) le vecteur aléatoire des affectations des observations, et Z l’ensemble de toutes les partitions possibles : Zi,k = 1 si l’observation i est issue de la classe k, 0 sinon. On notera aussi, pour i ∈ {1, . . . , n}, f (zi |xi , yi , ξ) la probabilité que Zi = zi sachant (xi , yi ) et connaissant le paramètre ξ. On définit la log-vraisemblance complétée par lc (ξ, (x, y, z)) = n X i=1   log f (zi |xi , yi , ξ) + log sK ξ (yi |xi ) . Alors, on peut décomposer la log-vraisemblance comme suit : ˜ − H(ξ|ξ) ˜ l(ξ, x, y) = Q(ξ|ξ) ˜ est l’espérance sur les variables latentes Z de la vraisemblance complétée, où Q(ξ|ξ) ˜ = Q(ξ|ξ) n XX z∈Z i=1 ˜ lc (ξ, xi , yi , zi )f (zi |xi , yi , ξ) 19 Introduction et ˜ = H(ξ|ξ) n XX z∈Z i=1 ˜ log f (z|xi , yi , ξ)f (zi |xi , yi , ξ) l’espérance étant prise sur les variables latentes Z. Dans l’algorithme EM, on itère le calcul jusqu’à la convergence et la maximisation de Q(ξ|ξ (ite) ), où ξ (ite) est l’estimation des paramètres à l’itération (ite) ∈ N∗ de l’algorithme EM. En effet, si on fait croître ξ 7→ Q(ξ|ξ (ite) ), on fait aussi croître la vraisemblance. Dempster, dans [DLR77], a donc proposé l’algorithme suivant. Algorithme 1 : Algorithme EM Data : x,y,K Result : ξˆEM V K 1. Initialisation de ξ (0) 2. Itération jusqu’à atteindre un critère d’arrêt — Étape E : calcul, pour tout ξ de Q(ξ|ξ (ite) ) — Étape M : calcul de ξ (ite+1) tel que ξ (ite+1) ∈ argmax Q(ξ|ξ (ite) ) Dans la première étape, on affecte les observations à des classes, et dans la seconde étape on met à jour l’estimation des paramètres. Pour l’étude générale de cet algorithme, on peut citer le livre de McLachlan et Krishnan dans [MK97]. Un des points problématiques est l’initialisation de cet algorithme : même si on peut, dans de nombreux cas (le nôtre par exemple), montrer que l’algorithme converge vers la valeur voulue, il faut l’initialiser correctement. On peut citer Biernacki, Celeux et Govaert, [BCG03], qui décrivent diverses stratégies d’initialisation, ou encore Yang, Lai et Lin dans [YLL12] qui proposent un algorithme EM robuste pour l’initialisation. Tous ces problèmes, classiques dans l’étude des modèles de mélange, se retrouvent dans notre cadre. D’un point de vue théorique, un résultat important pour les modèles de mélange est l’identifiabilité. Rappelons qu’un modèle paramétrique est dit identifiable si différentes valeurs des paramètres génèrent différentes distributions de probabilité. Ici, les modèles de mélange ne sont pas identifiables, car il est possible d’intervertir les étiquettes des classes sans changer la densité (ce que l’on appelle le label switching). Pour un point de vue détaillé sur ces questions, citons par exemple Titterington, dans [TSM85]. A partir d’un échantillon (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , on peut construire un modèle de mélange de régression, à K classes, en estimant les paramètres par exemple avec l’estimateur (3). Avec ce modèle, on peut obtenir une classification des données en calculant les probabilités a posteriori. Chaque observation (xi , yi ) est alors affectée à une classe k̂i , et on a accès au lien β̂k̂i qui existe entre xi et yi , et au bruit Σ̂k̂i associé à cette classe grâce à l’estimateur (3). Tant qu’on a de bonnes propriétés sur l’estimateur du maximum de vraisemblance (par exemple quand la taille de l’échantillon n est grande), si on règle les soucis d’initialisation et de convergence de l’algorithme EM, on peut bien estimer les paramètres de notre modèle. Cet algorithme a été généralisé par Städler et coauteurs dans [SBG10] pour l’estimation des paramètres d’un modèle de mélange de Gaussiennes en régression univariée. On l’a généralisé, dans cette thèse, pour les données multivariées. 20 Introduction Sélection de variables et estimateur du Lasso Nous nous sommes intéressés à un problème de classification en régression non supervisée de données en grande dimension, c’est-à-dire que les vecteurs aléatoires X ∈ Rp et Y ∈ Rq peuvent être des vecteurs de grande taille, éventuellement plus grande que le nombre d’observations n. Ce problème est très étudié actuellement. En effet, avec l’amélioration de l’informatique, on a de plus en plus de données de grande dimension, et on a accès à de plus en plus de variables explicatives pour décrire un même objet. Dans le cadre du modèle linéaire, si p > n, on ne peut plus calculer β̂ avec la formule (2) : la matrice xt x n’est plus inversible. En fait, on cherche à estimer, dans le cas du modèle linéaire, pq + q 2 paramètres, ce qui peut être plus grand que le nombre d’observations n si p et q sont grands. Dans un premier temps, restreignons-nous au modèle linéaire. L’estimateur du Lasso, introduit par Tibshirani dans [Tib96], et parallèlement par Chen et coauteurs en théorie du signal dans [CDS98], est un estimateur classique pour pallier le problème de la grande dimension. On peut aussi citer l’estimateur de Dantzig, introduit par Candès et Tao dans [CT07] ; l’estimateur Ridge, qui utilise une pénalité ℓ2 plutôt que la pénalité ℓ1 exploitée par l’estimateur du Lasso, introduit par Hoerl et Kennard dans [HK70] ; l’Elastic net, introduit par Zou et Hastie dans [ZH05], et qui est défini avec une double pénalisation ℓ1 et ℓ2 , qui fait donc un compromis entre l’estimateur Ridge et l’estimateur Lasso ; le Fused Lasso, introduit par Tibshirani dans [TSR+ 05] ; le Lasso adaptatif, introduit par Zou dans [Zou06], l’estimateur du Group-Lasso, introduit par Yuan et Lin dans [YLL06], pour avoir de la parcimonie par groupes, . . . Dans cette thèse, nous nous concentrerons sur l’estimateur du Lasso (ou du Group-Lasso), même si certains résultats théoriques restent valables pour n’importe quelle méthode de sélection de variables. L’idée introduite par [Tib96] est de supposer que la matrice β dans le modèle linéaire est creuse, ce qui réduit le nombre de paramètres à estimer. En effet, s’il y a peu de coefficients non nuls, en notant J ⊂ P ({1, . . . , q} × {1, . . . , p}) l’ensemble des indices des coefficients de la matrice de régression non nuls et |J| son cardinal, la taille de l’échantillon pourra être plus grande que le nombre de paramètres à estimer, qui est alors |J| + q 2 . Dans le cas du modèle linéaire (1), l’estimateur du Lasso est défini par  β̂ Lasso (λ) = argmin ||Y − βX||22 + λ||β||1 ; (4) β∈Rq×p avec λ > 0 un paramètre de régularisation à préciser. On peut aussi le définir, de manière équivalente, par  β̃ Lasso (A) = argmin ||Y − βX||22 . β∈Rq×p ||β||1 ≤A 21 Introduction β̂ β̂ Figure 2 – Illustration de l’estimateur du Lasso (à gauche) et de l’estimateur Ridge (à droite). β̂ représente l’estimateur des moindres carrés, les lignes correspondant à l’erreur des moindres carrés, la boule ℓ1 correspond au problème de l’estimateur du Lasso et la boule ℓ2 correspond au problème de l’estimateur Ridge. Cette figure est issue de [Tib96]. Cet estimateur a été beaucoup étudié ces dernières années. Citons Friedman, Hastie et Tibshirani, [HTF01] qui ont étudié le chemin de régularisation du Lasso ; Bickel, Ritov et Tsybakov, [BRT09], qui ont étudié cet estimateur en comparaison avec le sélecteur Dantzig. On peut aussi citer Osborne, [OPT99], qui étudie l’estimateur du Lasso et sa forme duale. Notons que la norme ℓ1 apparaît ici comme une relaxation convexe de la pénalité ℓ0 (où ||β||0 = Card(j|βj 6= 0)), qui vise à estimer les coefficients par 0. Grâce à la géométrie de la boule ℓ1 , le Lasso a tendance à annuler des coefficients, comme on peut le voir sur la figure 2. Plus λ est grand, plus on pénalise, et plus on a de coefficients de β̂ Lasso (λ) qui sont nuls. Si λ = 0, on retrouve l’estimateur du maximum de vraisemblance. Résumons les résultats principaux obtenus ces dernières années pour l’estimateur du Lasso. Cet estimateur peut facilement être approximé par l’algorithme LARS, introduit dans [EHJT04]. La relaxation convexe de la pénalité ℓ0 par la pénalité ℓ1 permet d’approcher numériquement plus facilement cet estimateur. On sait qu’il est linéaire par morceaux, et on peut expliciter ses valeurs grâce aux conditions de Karush-Kuhn-Tucker. Citons par exemple [EHJT04] ou [ZHT07]. D’un point de vue théorique, sous des hypothèses plus ou moins fortes, il existe des inégalités oracle pour l’erreur de prédiction ou l’erreur ℓq des coefficients estimés, avec un paramètre de p régularisation de l’ordre de log(p)/n. On peut citer par exemple Bickel, Ritov et Tsybakov, [BRT09], qui obtiennent une inégalité oracle pour le risque en prédiction dans un modèle général non paramétrique en régression, et une inégalité oracle pour la perte ℓp en estimation (1 ≤ p ≤ 2) pour le modèle linéaire. Les hypothèses nécessaires et les résultats obtenus dans ce sens sont résumés par van de Geer et Bühlmann dans [vdGB09]. Il est aussi à noter que l’estimateur du Lasso a de bonnes propriétés de sélection de variables. Citons par exemple Meinshausen et Bühlmann [MB06], Zhang et Huang, dans [ZH08], Zhao et Yu, dans [ZY06], ou Meinshausen et Yu, [MY09], qui montrent que le Lasso est consistant en sélection de variables sous diverses hypothèses plus ou moins contraignantes. Il paraît donc cohérent d’utiliser l’estimateur du Lasso pour sélectionner les variables importantes. Dans les modèles de mélange de modèles linéaires Gaussiens, on peut étendre la définition de l’estimateur du Lasso par   1 Lasso ˆ (5) ξ (λ) = argmin − lλ (ξ, x, y) ; n ξ∈ΞK 22 Introduction où lλ (ξ, x, y) = l(ξ, x, y) − λ K X k=1 πk ||Pk βk ||1 ; ΞK = ΠK × (Rq×p )K × S++ q K ; où Pk est la racine de Cholesky de l’inverse de la matrice de covariance, i.e. Pkt Pk = Σ−1 k , pour tout k ∈ {1, . . . , K}. Cette définition a été introduite par Städler et coauteurs dans [SBG10]. La pénalité est différente de celle de l’estimateur (4). Premièrement, on ne pénalise non pas par la moyenne conditionnelle dans chaque classe, mais par une version reparamétrisée par la variance. En effet, dans les modèles de mélange, il est important d’avoir un bon estimateur de la variance, et de l’estimer en même temps que la moyenne, pour ne pas favoriser les classes à variance trop élevée. De plus, pour avoir un estimateur invariant par changement d’échelle, il faut prendre en compte la variance dans la pénalité ℓ1 . Ensuite, l’étude de la racine de Cholesky de l’inverse de la matrice de covariance plutôt que de la matrice de covariance permet d’obtenir une optimisation convexe, ce qui facilitera la partie algorithmique. Enfin, on pénalise la vraisemblance par la somme de l’estimateur de la moyenne reparamétrisé par la variance, pondérée sur chaque classe, pour prendre en compte la différence de taille entre les différentes classes. Notons que la sélection de variables dans les modèles de mélange a déjà été envisagée dans des problèmes d’estimation. Citons par exemple Raftery et Dean, [RD06], ou Maugis et Michel dans [MM11b]. L’estimateur (5) peut s’approcher algorithmiquement à l’aide d’une généralisation de l’algorithme EM, introduite pas Städler et coauteurs dans le cas univarié, et étendue dans cette thèse. Dans le cadre des modèles de mélange, en régression, on connaît principalement deux résultats théoriques pour l’estimateur du Lasso, valables pour Y réel et à nombre de classes K fixé et connu. Pour Y ∈ R, et X ∈ Rp , Städler et coauteurs, dans [SBG10], ont montré que, sous la condition de valeurs propres restreintes (notée REC, citée ci-dessous), l’estimateur du Lasso vérifie une inégalité oracle pour des covariables fixes. Rappelons le contexte de mélange de Gaussiennes en régression univariées. Si Y , conditionnellement à X, appartient à la classe k, on note Y = βk X + ǫ, avec ǫ ∼ N (0, σk2 ). On note de plus φk = σk−1 βk , et J l’ensemble des indices des coefficients non nuls de la matrice de régression. Hypothèse REC. Il existe κ ≥ 1 tel que, pour tout φ ∈ (Rp )K vérifiant ||φJ c || ≤ 6||φJ ||1 , on a ||φJ ||22 où Σ̂x = 1 n Pn ≤κ 2 K X φtk Σ̂x φk k=1 t i=1 xi xi . Dans le même cadre, mais sans l’hypothèse REC, Meynet, dans [Mey13], a montré une inégalité oracle ℓ1 pour l’estimateur du Lasso. Dans cette thèse, nous nous sommes intéressés aux propriétés de régularisation ℓ1 de l’estimateur du Lasso, dans notre cadre de modèles de mélange en régression multivariée. On fixe les variables explicatives x = (x1 , . . . , xn ). Sans restriction, on peut supposer que xi ∈ [0, 1]p pour tout i ∈ {1, . . . , n}. On suppose qu’il existe (Aβ , aΣ , AΣ , aπ ) des réels positifs, qui définissent 23 Introduction l’ensemble des paramètres ( Ξ̃K = ξ ∈ ΞK pour tout k ∈ {1, . . . , K}, max sup |[βk x]z | ≤ Aβ , z∈{1,...,q} x∈[0,1]p aΣ ≤ m(Σ−1 k ) ≤ M (Σ−1 k )  ≤ AΣ , aπ ≤ πk ; (6) où m(A) et M (A) désignent respectivement la valeur absolue de la plus petite et de la plus grande valeur propre de la matrice A. Soit, pour ξ = (π, β, Σ) ∈ ΞK , [2] N1 (sK ξ ) = ||β||1 = p X q K X X k=1 j=1 z=1 |[βk ]z,j |. (7) la pénalité envisagée, et KLn la divergence de Kullback-Leibler à design fixé : n 1X KLn (s, t) = KL(s(.|xi ), t(.|xi )) n i=1    n 1X s(.|xi ) = . Es log n t(.|xi ) i=1 Voici le théorème que nous obtenons. Inégalité oracle ℓ1 pour le Lasso. Soit (x, y) = (xi , yi )1≤i≤n les observations, issues d’une densité conditionnelle inconnue s∗ = sξ0 , où ξ0 ∈ Ξ̃K , cet ensemble étant défini par l’équation [2] (6), et le nombre de classes K étant fixé. On notera a ∨ b = max(a, b). Soit N1 (sK ξ ) définie par Lasso (7). Pour λ ≥ 0, on définit l’estimateur du Lasso, noté ŝ (λ), par ! n X 1 [2] K log(sK (8) ŝLasso (λ) = argmin − ξ (yi |xi )) + λN1 (sξ ) ; n K s ∈S i=1 ξ avec o n . S = sK , ξ ∈ Ξ̃ K ξ Si  1 λ ≥ κ AΣ ∨ aπ    r   p K log(n) 2 1 + 4(q + 1)AΣ Aβ + 1 + q log(n) K log(2p + 1) aΣ n avec κ une constante positive, alors l’estimateur (8) vérifie l’inégalité suivante : E[KLn (s∗ , ŝLasso (λ))] ≤(1 + κ−1 ) inf r sK ξ ∈S   [2] K KLn (s∗ , sK ) + λN (s ) +λ ξ ξ 1 1 K e− 2 π q/2 aπ p 2q +κ q/2 n AΣ     3/2 log(n) 1 ′K 2 +κ √ AΣ ∨ 1 + 4(q + 1)AΣ Aβ + aπ aΣ n 2  q ; × 1 + Aβ + aΣ ′ où κ′ est une constante positive. Introduction 24 Ce théorème peut être vu comme une inégalité oracle ℓ1 , mais ce n’est pas l’approche que l’on souhaite développer ici. En effet, la démonstration de ce théorème passe par un théorème de sélection de modèles, mais ce thème sera abordé dans la partie correspondante. Ici, on voit plutôt ce théorème comme une assurance que l’estimateur du Lasso fonctionne bien pour la régularisation ℓ1 . La particularité de ce résultat est qu’il ne requiert que peu d’hypothèses : on travaille avec des prédicteurs fixés, qui sont supposés (sans restriction) être inclus dans [0, 1]p , et les paramètres de nos densités conditionnelles sont supposés bornés, au sens où ils appartiennent à Ξ̃K . Cette hypothèse est nécessaire, pour assurer en particulier que la vraisemblance est finie. On la retrouvera dans les autres théorèmes démontrés dans cette thèse. On remarque aussi que la borne sur le paramètre de régularisation λ n’est pas celle, optimale, classique, obtenue dans d’autres cas plus généraux, mais cela est dû au fait que nous ne faisons pas d’hypothèses sur les régresseurs. L’article de van de Geer, [vdG13], permet d’obtenir cette borne optimale sous des hypothèses plus fortes sur le design. Il est à noter que le design est supposé fixe ici. Comme l’estimateur du Lasso surestime les paramètres (citons par exemple Fan et Peng, [FP04], ou Zhang [Zha10]), nous proposons de l’utiliser pour la sélection de variables, et non pour l’estimation des paramètres. Ainsi, pour un paramètre de régularisation λ ≥ 0 à définir, on sélectionnera les variables importantes pour expliquer Y en fonction de X. Plus formellement, définissons la notion de variable active pour la classification. Définition. Une variable est active pour la classification si elle est non nulle dans au moins une classe : la variable d’indice (z, j) ∈ {1, . . . , q} × {1, . . . , p} est active s’il existe k ∈ {1, . . . , K} tel que [βk ]z,j 6= 0. Ainsi, pour un certain λ ≥ 0, on peut estimer ξˆLasso (λ) et en déduire l’ensemble Jλ des variables actives pour la classification. Ré-estimations En se restreignant aux variables sélectionnées indicées par Jλ par l’estimateur du Lasso de paramètre de régularisation λ ≥ 0, on travaille avec un modèle de dimension plus petite. En effet, plutôt que (pq + q 2 + 1)K − 1 paramètres à estimer, on en a (|Jλ | + q 2 + 1)K − 1. Si on suppose de plus que la matrice de covariance est diagonale (ce qui implique que les variables sont non corrélées), on obtient un modèle de dimension (|Jλ | + q + 1)K − 1, et la dimension du modèle peut devenir plus petite que le nombre d’observations, ou au moins être raisonnable. On peut alors utiliser un autre estimateur, restreint aux variables sélectionnées, qui aura de bonnes propriétés d’estimation en dimension raisonnable (meilleures que celles de l’estimateur du Lasso). Ré-estimer les paramètres sur les variables sélectionnées n’est pas une idée nouvelle. On veut tirer parti des avantages de la sélection de variables par l’estimateur du Lasso (ou par une autre technique), mais on veut aussi diminuer le biais induit par cet estimateur. Citons par exemple Belloni et Chernozhukov, [BC11], qui obtiennent une inégalité oracle pour montrer que l’estimateur du maximum de vraisemblance calculé sur les variables sélectionnées par le Lasso fonctionne mieux que l’estimateur du Lasso, pour un modèle linéaire en grande dimension. On peut aussi citer Zhang et Sun, [SZ12], qui estiment le bruit et la matrice de régression dans un modèle linéaire en grande dimension par l’estimateur des moindres carrés après sélection de modèles. On propose dans une première procédure, appelée procédure Lasso-EMV, d’estimer les paramètres, en se restreignant aux variables actives, par l’estimateur du maximum de vraisemblance, qui a de bonnes propriétés pour un échantillon suffisamment grand. 25 Introduction On propose aussi, dans une seconde procédure que l’on appellera procédure Lasso-Rang, d’utiliser le maximum de vraisemblance avec une contrainte de faible rang. En effet, jusqu’ici, nous n’avons pas tenu compte de la structure matricielle de β. Comme les matrices de covariance (Σk )1≤k≤K sont supposées diagonales, on aurait pu travailler avec chaque coordonnée de Y comme q problèmes distincts et indépendants. En cherchant une structure de faible rang, on suppose que peu de combinaisons linéaires de prédicteurs suffisent à expliquer la réponse. C’est aussi une seconde méthode pour diminuer la dimension, au cas où la sélection de variables par pénalisation ℓ1 ne soit pas suffisante. Certains jeux de données sont particulièrement propices à cette réduction dimensionnelle par faible rang. On peut citer par exemple l’analyse d’image fMRI (Harrison, Penny, Frishen, [FHP03]), l’analyse de décodage de données EEG (Anderson, Stolz, Shamsunder, [ASS98]), la modélisation de réponse de neurones (Brown, Kass, Mitra, [BKM04]), ou encore l’analyse de données génomiques (Bunea, She, Wegkamp, [BSW11]). D’un point de vue plus théorique, on peut citer Izenman ([Ize75]) qui a introduit cette méthode dans le cas du modèle linéaire, Giraud ([Gir11]) ou Bunea et coauteurs ([Bun08]) qui ont complété l’étude théorique et pratique de sélection de rang. Dans cette thèse, nous employons ces méthodes dans un cadre de mélange en régression. Il est à noter qu’on aura sélectionné, par l’estimateur du Lasso, des colonnes, ou des lignes, ou les deux, mais que les lignes ou les colonnes où seuls certains coefficients sont inactifs ne seront pas sélectionnées, la structure matricielle étant requise pour estimer un paramètre par faible rang. On n’impose pas que toutes les moyennes conditionnelles aient le même rang. Notons ξˆJEM V et ξˆJRang les estimateurs associés à chaque procédure, avec J comme ensemble des variables actives :   1 ξˆJEM V = argmin − l(ξ [J] , x, y) ; (9) n ξ∈Ξ(K,J)   1 [J] Rang ˆ (10) ξJ = argmin − l(ξ , x, y) ; n ξ∈Ξ̌(K,J,R)  où Ξ̌(K,J,R) = ξ = (π, β, Σ) ∈ Ξ(K,J) | rang(βk ) = Rk pour tout k ∈ {1, . . . , K} , où ξ [J] signifie que l’on a sélectionné les variables d’indice J, et où Ξ(K,J) est défini par K Ξ(K,J) = ΠK × (Rq×p )K × (S++ q ) ( ) K X K ΠK = (π1 , . . . , πK ) ∈ (0, 1) πk = 1 k=1 A noter que l’on ré-estime tous les paramètres de notre modèle : les moyennes conditionnelles, les variances et les poids. D’un point de vue pratique, la généralisation de l’algorithme EM (Algorithme 1 page 19) permet de calculer l’estimateur du maximum de vraisemblance, sous contrainte de faible rang ou non, dans le cas de mélange de Gaussiennes en régression. Sélection de modèles Pour un paramètre de régularisation λ ≥ 0 fixé, après avoir ré-estimé nos paramètres, nous obtenons un modèle associé à nos observations (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n qui est de dimension raisonnable et qui est bien estimé, si λ a été bien choisi. Cependant, on a dû faire de nombreux choix pour construire ce modèle. Pour le modèle de mélange, il est possible qu’on ne connaisse pas au préalable le nombre de classes K, il faut donc le choisir ; pour la 26 Introduction sélection de variables, construire Jλ correspond à sélectionner un paramètre de régularisation λ ≥ 0 ; dans les cas de ré-estimation par faible rang, on doit sélectionner le vecteur des rangs dans chaque composante. Dans chacun de ces trois cas, différentes méthodes existent dans la littérature. Pour le paramètre λ de régularisation du Lasso, on peut citer par p exemple le livre de van de Geer et Bühlmann, [BvdG11], où une valeur λ proportionnelle à log(p)/n est considérée comme optimale. Pour le choix du nombre de classes K et du vecteur de rangs R, de nombreux auteurs se ramènent à un problème de sélection de modèles. Dans cette thèse, nous allons voir le problème de sélection de paramètres comme un problème de sélection de modèles, en construisant une collection de modèles, avec plus ou moins de classes, et plus ou moins de coefficients actifs. Il restera à choisir un modèle parmi cette collection. En toute généralité, notons S = (Sm )m∈M la collection de modèles que l’on considère, indicée par M. Il est à noter que, contrairement à l’idée que l’on pourrait s’en faire, avoir une collection de modèles trop grande peut porter préjudice, par exemple en sélectionnant des estimateurs non consistants (voir Bahadur, [Bah58]) ou sous-optimaux (voir Birgé et Massart, [BM93]). C’est ce qu’on appelle le paradigme du choix de modèle. On considère une fonction de contraste, notée γ, telle que s∗ = argmin E(γ(t)). La fonction de t∈S perte associée, notée l, est définie par pour tout t ∈ S, l(s∗ , t) = E(γ(t)) − E(γ(s∗ )). On définit aussi le contraste empirique γn par n pour tout t ∈ S, γn (t) = 1X γ(t, xi , yi ) n i=1 pour un échantillon (x, y) = (xi , yi )1≤i≤n . Pour le modèle m, on considère ŝm la densité qui minimise le contraste empirique γn . C’est cette densité que l’on va utiliser pour représenter ce modèle. Par exemple, on peut prendre la log-vraisemblance comme contraste, et la divergence de Kullback-Leibler comme fonction de perte. Le but de la sélection de modèles est de sélectionner le meilleur estimateur parmi la collection (ŝm )m∈M . Le meilleur estimateur peut être défini comme l’estimateur qui minimise le risque avec la vraie densité notée s∗ . Cet estimateur, et le modèle correspondant, seront appelés dans cette thèse oracle (voir Donoho et Johnstone par exemple, [DJ94]). On note mO = argmin[l(s∗ , ŝm )]. (11) m∈M Malheureusement, on ne peut pas évaluer cette quantité, puisque l’on n’a pas accès à s∗ . On va utiliser l’oracle en théorie pour évaluer notre sélection de modèles : on veut que le risque associé à l’estimateur du modèle sélectionné soit le plus près possible de celui de l’oracle. Une partie des résultats théoriques en sélection de modèles sont des inégalités oracle, qui permettent d’assurer la cohérence de la sélection de modèles. Ces inégalités sont de la forme, pour m̂ l’indice du modèle sélectionné, E(l(s∗ , ŝm̂ )) ≤ C1 E(l(s∗ , sO )) + C2 n (12) où (C1 , C2 ) sont des constantes absolues, avec C1 la plus proche possible de 1. L’inégalité oracle est dite exacte si C1 = 1. 27 Introduction Désormais, décrivons comment sélectionner un modèle. On va minimiser un critère pénalisé, pour parvenir à un compromis biais/variance. En effet, on peut décomposer le risque de la façon suivante : l(s∗ , ŝm ) = l(s∗ , sm ) + E (γ(sm ) − γ(ŝm )); | {z } | {z } biaism variancem où sm ∈ argmin [E(γ(t))] (c’est une des meilleures approximations de s∗ dans Sm ). Or, pour t∈Sm minimiser le biais, il faut un modèle complexe, qui colle de très près aux données ; et pour minimiser la variance, il ne faut pas considérer des modèles trop complexes, pour ne pas surapprendre des données. Une méthode pour rendre compte de cette remarque est de considérer le contraste empirique Ln pénalisé par la dimension : on va pénaliser les modèles trop complexes, qui suraprennent de nos données. Soit pen : M → R+ une pénalité à construire ; on va sélectionner m̂ = argmin {−γn (ŝm ) + pen(m)} . m∈M Akaike et Schwarz ont introduit cette méthode pour l’estimation avec la vraisemblance, voir respectivement [Aka74] et [Sch78]. Ils proposaient les critères désormais classiques AIC et BIC, où la pénalité vaut respectivement penAIC (m) = Dm ; log(n)Dm penBIC (m) = ; 2 où Dm est la dimension du modèle m, et n est la taille de l’échantillon considéré. Ces critères sont fortement utilisés aujourd’hui. Il est à noter qu’ils sont basés sur des approximations asymptotiques (AIC sur le théorème de Wilks, et BIC sur une approche bayésienne), et on peut se méfier de leur comportement en non asymptotique. Ils supposent de plus, en toute rigueur, une collection de modèles fixée. Mallows, au même moment, dans [Mal73], a étudié cette méthode dans le cadre de la régression linéaire. Il a obtenu penM allows (m) = 2Dm σ 2 n où σ 2 est la variance des erreurs, supposée connue. Birgé et Massart, dans [BM01], ont introduit l’heuristique des pentes, qui est une méthodologie non asymptotique pour sélectionner un modèle parmi une collection de modèles. Décrivons les idées de cette heuristique. On cherche une pénalité proche de m ∈ M 7→ l(s∗ , ŝm )− γn (ŝm ). Comme on ne connaît pas s∗ , on va essayer d’approcher cette quantité : l(s∗ , ŝm ) − γn (ŝm ) = E(γ(ŝm )) − E(γ(sm )) + E(γ(sm )) − E(γ(s∗ )) | {z } | {z } vm (1) − (γn (ŝm ) − γn (sm )) − (γn (sm ) − γn (s∗ )) −γn (s∗ ) {z } | {z } | v̂m (2) où vm peut être vue comme un terme de variance, et v̂m comme une version empirique. On définit ∆n (sm ) = (1) + (2), qui correspond à la différence entre le terme de biais et sa version 28 Introduction empirique. Si on choisit pen(m) = v̂m , on va choisir un modèle qui limite le biais mais pas la variance : on va sélectionner un modèle trop complexe. Cette pénalité est minimale : si on pose pen(m) = κv̂m , si κ < 1 on va choisir un modèle trop complexe, et si κ > 1, la dimension du modèle sera plus raisonnable. En fait, la pénalité optimale est le double de la pénalité minimale. Comme v̂m est la version empirique de vm , vm ≈ v̂m . Comme ∆n (sm ) est d’espérance nulle, on peut contrôler ses fluctuations. On a donc envie de choisir pen(m) = 2vm . Ainsi, on peut trouver, sur un jeu de données, si on utilise l’heuristique des pentes, la pénalité qui nous permettra de sélectionner un modèle : soit on cherche le plus grand saut de dimension, soit on regarde la pente asymptotique de γn (sm ), ce qui nous donne la pénalité minimale, et il suffit de la multiplier par deux pour obtenir la pénalité optimale. Les figures 3 et 4 illustrent ces idées. Dimension du modele 100 m̂ 50 0 0 5 κ̂ κ 2κ̂ −10 x 10 Figure 3 – Illustration de l’heuristique des pentes : on estime κ par κ̂ le plus grand saut de dimension. On sélectionne alors le modèle qui minimise la log-vraisemblance pénalisée par 2κ̂. −10 0 x 10 Log−vraisemblance −0.5 −1 −1.5 −2 −2.5 −3 0 100 200 300 400 500 Dimension du modele 600 700 800 900 Figure 4 – Illustration de l’heuristique des pentes : on estime κ par κ̂ la pente asymptotique de la log-vraisemblance. D’un point de vue pratique, on utilise le package Capushe, développé par Baudry et coauteurs dans [BMM12] sur le logiciel Matlab. D’un point de vue théorique, on a obtenu des inégalités oracle, qui justifient la sélection de modèles dans chacune de nos procédures. Citons le théorème général issu de [Mas07] qui est à la base de nos résultats théoriques. On travaille avec la log-vraisemblance comme contraste empirique. On note KL la divergence de 29 Introduction Kullback-Leibler, définie par      Es log s si s << t t KL(s, t) =  + ∞ sinon. D’abord, nous avons besoin d’une hypothèse structurelle. C’est une condition sur le crochet d’entropie du modèle Sm par rapport à la distance de Hellinger, définie par Z 1 2 (dH (s, t)) = (s − t)2 . 2 Un crochet [l, u] est une paire de fonctions telles que pour tout y, l(y) ≤ s(y) ≤ u(y). Pour ǫ > 0, on définit l’entropie à crochet H[.] (ǫ, S, dH ) d’un ensemble S par le logarithme du nombre minimal de crochets [l, u] de largeur dH (l, u) inférieure à ǫ telle que toutes les densités de S appartiennent à l’un de ces crochets. Soit m ∈ M. Hypothèse Hm . Il existe une fonction croissante φm telle que ̟ 7→ sur (0, +∞) et telle que pour tout ̟ ∈ R+ et tout sm ∈ Sm , Z ̟q H[.] (ǫ, Sm (sm , ̟), dH )dǫ ≤ φm (̟); 1 ̟ φm (̟) est décroissante 0 où Sm (sm , ̟) = {t ∈ Sm , dH (t, sm ) ≤ ̟}. La complexité du modèle Dm est alors définie par 2 avec ̟ l’unique solution de n̟m m √ 1 φm (̟) = n̟. ̟ (13) Notons que la complexité du modèle ne dépend pas du crochet d’entropie des modèles globaux Sm , mais des ensembles plus petits, localisés. C’est une hypothèse plus faible. Pour des raisons techniques, une hypothèse de séparabilité est aussi nécessaire. ′ ′ Hypothèse Sepm . Il existe un ensemble dénombrable Sm de Sm et un ensemble Ym avec ′ λ(Rq \ Ym ) = 0, pour λ la mesure de Lebesgue, tel que pour tout t ∈ Sm , il existe une ′ ′ suite (tl )l≥1 d’éléments de Sm telle que pour tout y ∈ Ym , log(tl (y)) tend vers log(t(y)) quand l tend vers l’infini. On a aussi besoin d’une hypothèse de théorie de l’information sur notre collection de modèles. Hypothèse K. La famille de nombres positifs (wm )m∈M vérifie X e−wm ≤ Ω < +∞. m∈M Alors, on peut écrire le théorème général de sélection de modèles. Inégalité oracle pour une famille d’EMV. Soient (X1 , . . . , Xn ) des variables aléatoires de densité inconnue s∗ . On en observe une réalisation (x1 , . . . , xn ). Soit {Sm }m∈M une collection de modèles au plus dénombrable, où, pour tout m ∈ M, les éléments de Sm sont des densités de probabilité, et Sm vérifie l’hypothèse Sepm . On considère de plus la collection d’estimateurs de maxima de vraisemblance à ρ près notés (ŝm )m∈M : on a, pour tout m ∈ M, n n i=1 i=1 1X 1X ln(ŝm (xi )) ≤ inf − ln(t(xi )) + ρ. − t∈Sm n n 30 Introduction Soient {wm }m∈M une famille de nombres positifs vérifiant l’hypothèse K, et, pour tout m ∈ M, on considère φm qui vérifie la condition Hm , avec ̟m l’unique solution positive de l’équation √ φm (̟) = n̟2 . On suppose de plus que, pour tout m ∈ M, l’hypothèse Sepm est vérifiée. Soit pen : M → R+ et soit le critère de log-vraisemblance pénalisé n crit(m) = − 1X ln(ŝm (xi ))) + pen(m). n i=1 Alors, il existe des constantes κ et C telles que, lorsque  wm  2 pen(m) ≥ κ ̟m + n pour tout m ∈ M, alors il existe m̂ qui minimise le critère crit sur M, et de plus,     Ω 2 inf KL(s, t) + pen(m) + ρ + Es (dH (s, ŝm̂ )) ≤ C inf m∈M t∈Sm n où dH est la distance de Hellinger, et KL la divergence de Kullback-Leibler. Ce théorème nous indique que, si notre collection de modèles est bien construite (c’est-à-dire satisfait les hypothèses Hm , K et Sepm ), on peut trouver une pénalité telle que le modèle minimisant le critère pénalisé satisfasse une inégalité oracle. Cette approche a déjà été envisagée pour sélectionner le nombre de classes d’un modèle de mélange. On peut citer par exemple Maugis et Michel, [MM11b], ou Maugis et Meynet, [MMR12]. Ces auteurs voient le problème de sélection du nombre de composantes et de sélection de variables comme un problème de sélection de modèles. Pour le rang, Giraud, dans [Gir11], et Bunea, dans [Bun08], proposent une pénalité pour choisir de façon optimale le rang. Ils obtiennent, à variance connue et inconnue, des inégalités oracle qui permettent de sélectionner le rang, où la pénalité est proportionnelle au rang. Ma et Sun, dans [MS14], obtiennent pour ces modèles une borne minimax, ce qui revient à dire que la pénalité construite est optimale pour la sélection du rang. Les procédures d’estimation que l’on a décrites dans la partie précédente sont sujettes à des choix de paramètres (le nombre de classes, le paramètre de régularisation du Lasso, et le rang dans la deuxième procédure). On peut réécrire la sélection de ces paramètres comme un problème de sélection de modèles : en faisant varier ces paramètres, on obtient une collection de modèles. Commençons par définir les collections de modèles associés à chacune de nos procédures. Pour la procédure Lasso-EMV, pour (K, J) ∈ K × J , n o (K,J) S(K,J) = y ∈ Rq 7→ sξ (y|x) (14)   K X πk 1 (K,J) [J] t −1 [J] p sξ (y|x) = exp − (y − βk x) Σk (y − βk x) 2 2π det(Σk ) k=1 [J] [J] ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ Ξ(K,J) K Ξ(K,J) = ΠK × (Rq×p )K × (S++ q ) 31 Introduction Pour la procédure Lasso-Rang, pour (K, J, R) ∈ K × J × R, o n (K,J,R) S(K,J,R) = y ∈ Rq 7→ sξ (y|x)   K X 1 πk (K,J,R) Rk [J] t −1 Rk [J] p exp − (y − (βk ) x) Σk (y − (βk ) x) (y|x) = sξ 2 2π det(Σk ) (15) k=1 RK [J] ξ = (π1 , . . . , πK , (β1R1 )[J] , . . . , (βK ) , Σ1 , . . . , ΣK ) ∈ Ξ(K,J,R) K Ξ̌(K,J,R) = ΠK × Ψ(K,J,R) × (S++ q ) n o RK [J] Ψ(K,J,R) = ((β1R1 )[J] , . . . , (βK ) ) ∈ (Rq×p )K pour tout k ∈ {1, . . . , K}, Rang(βk ) = Rk . où K est l’ensemble des valeurs possibles pour le nombre de composantes, J est l’ensemble des ensembles d’indices de variables actives possibles, et R est l’ensemble des valeurs possibles pour les vecteurs de rang. Pour obtenir des résultats théoriques, on a besoin de borner nos paramètres. On considère n o (K,J) B S(K,J) = sξ ∈ S(K,J) ξ ∈ Ξ̃(K,J) (16) Ξ̃(K,J) = ΠK × ([−Aβ , Aβ ]|J| )K × ([aΣ , AΣ ]q )K (17) et n o (K,J,R) B S(K,J,R) = sξ ∈ S(K,J,R) ξ ∈ Ξ̃(K,J,R) (18) Ξ̃(K,J,R) = ΠK × Ψ̃(K,J,R) × ([aΣ , AΣ ]q )K ( Ψ̃(K,J,R) = RK (β1R1 , . . . , βK ) ∈ Ψ(K,J,R) pour tout k ∈ {1, . . . , K}, βkRk = Rk X l=1 (19) ) σl utl vl , σl < Aσ . Comme on travaille en régression, on définit une version tensorisée de la divergence de KullbackLeibler # n X 1 KL(s(.|xi ), t(.|xi )) ; KL⊗n (s, t) = E n " i=1 une version tensorisée de la distance d’Hellinger, # " n 1X 2 ⊗n 2 dH (s(.|xi ), t(.|xi )) . (dH ) (s, t) = E n i=1 On a aussi besoin de la divergence de Jensen-Kullback-Leibler, définie par, pour ρ ∈ (0, 1), par JKLρ (s, t) = et de sa version tensorisée n JKL⊗ ρ (s, t) " 1 KL(s, (1 − ρ)s + ρt); ρ # n 1X =E JKLρ (s(.|xi ), t(.|xi )) . n i=1 Le nombre de classes est supposé inconnu, on va l’estimer ici, contrairement à l’inégalité oracle ℓ1 pour le Lasso. On suppose que les covariables, bien qu’aléatoires, sont bornées. Pour simplifier la lecture, on suppose que X ∈ [0, 1]p . Dans cette thèse, on obtient alors les deux théorèmes suivants. 32 Introduction Inégalité oracle Lasso-EMV. Soit (xi , yi )i∈{1,...,n} les observations, issues d’une densité conditionnelle inconnue s∗ . Soit S(K,J) définie par (14). On considère J L ⊂ J la sous-collection d’ensembles d’indices construite en suivant le chemin de régularisation de l’estimateur du Lasso. B Pour (K, J) ∈ K × J L , notons S(K,J) le modèle défini par (16). On considère l’estimateur du maximum de vraisemblance ( ) n 1X (K,J) (K,J) log(sξ (yi |xi )) . ŝ = argmin − n (K,J) B s ∈S ξ i=1 (K,J) (K,J) B Notons D(K,J) la dimension du modèle S(K,J) , D(K,J) = K(|J| + q + 1) − 1. Soit s̄ξ telle que δKL (K,J) ; ) ≤ inf KL⊗n (s∗ , t) + KL⊗n (s∗ , s̄ξ B n t∈S(K,J) B ∈ S(K,J) (K,J) ≥ e−τ s∗ . Soit pen : K × J → R+ , et supposons qu’il existe une et soit τ > 0 tel que s̄ξ constante absolue κ > 0 telle que, pour tout (K, J) ∈ K × J ,    D(K,J) D(K,J) 2 2 B (Aβ , AΣ , aΣ ) − log B (Aβ , AΣ , aΣ ) ∧ 1 pen(K, J) ≥ κ n n   4epq ; +(1 ∨ τ ) log (D(K,J) − q 2 ) ∧ pq ˆ où les constantes Aβ , AΣ , aΣ sont définies par (17). Si on sélectionne le modèle indicé par (K̂, J), où ( n ) X ˆ = argmin (K̂, J) − log(ŝ(K,J) (yi |xi )) + pen(K, J) , (K,J)∈K×J L i=1 alors l’estimateur ŝ(K̂,J) ˆ vérifie, pour tout ρ ∈ (0, 1), E  ∗ n JKL⊗ ρ (s , ŝm̂ )  ≤C E  inf (K,J)∈K×J L  inf t∈S(K,J) ⊗n KL ∗ (s , t) + pen(K, J)  + 4 ; n pour une constante C absolue. Ce théorème nous donne une pénalité théorique pour laquelle le modèle minimisant le critère pénalisé a de bonnes propriétés d’estimation. Les constantes, bien que non optimales, sont explicites, en fonction des bornes de l’espace des paramètres. La pénalité est presque proportionnelle (à un terme logarithmique près) à la dimension du modèle. Le terme logarithmique a été étudié dans la thèse de Meynet, [Mey12]. Ici, en pratique, on prend la pénalité proportionnelle à la dimension. Inégalité oracle Lasso-Rang. Soit (xi , yi )i∈{1,...,n} les observations, issues d’une densité condiB tionnelle inconnue s∗ . Pour (K, J, R) ∈ K × J × R, soit S(K,J,R) définie par (15), et S(K,J,R) définie par (18). Soit J L ⊂ J une sous-collection construite en suivant le chemin de régularisation de l’estimateur du Lasso. B Soit s̄(K,J,R) ∈ S(K,J,R) telle que, pour δKL > 0, KL⊗n (s∗ , s̄(K,J,R) ) ≤ inf B t∈S(K,J,R) KL⊗n (s∗ , t) + δKL n 33 Introduction et telle qu’il existe τ > 0 tel que s̄(K,J,R) ≥ e−τ s∗ . (20) B On considère la collection d’estimateurs {ŝ(K,J,R) }(K,J,R)∈K×J ×R de S(K,J,R) , vérifiant ŝ (K,J,R) = argmin (K,J,R) sξ B ∈S(K,J,R) ( ) n   1X (K,J,R) log sξ (yi |xi ) . − n i=1 B Notons D(K,J,R) la dimension du modèle S(K,J,R) . Soit pen : K × J × R → R+ définie par, pour tout (K, J, R) ∈ K × J × R,    D(K,J,R) 2 2 B (Aβ , AΣ , aΣ , Aσ ) ∧ 1 2B (Aβ , AΣ , aΣ , Aσ ) − log n !) K X 4epq + Rk , (D(K,J,R) − q 2 ) ∧ pq D(K,J,R) pen(K, J, R) ≥κ n + log k=1 avec κ > 0 une constante absolue. Alors, l’estimateur ŝ(K̂,J, ˆ R̂) , avec ˆ R̂) = (K̂, J, argmin (K,J,R)∈K×J L ×R ( vérifie, pour tout ρ ∈ (0, 1),    ˆ R̂) ∗ (K̂,J, ⊗n E JKLρ s , ŝ   ≤CE inf (K,J,R)∈K×J L ×R ) n 1X (K,J,R) log(ŝ (yi |xi )) + pen(K, J, R) , − n i=1 (21) inf t∈S(K,J,R) KL⊗n (s∗ , t) + pen(K, J, R)  + 4 , n pour une constante C > 0. Ce théorème nous donne une pénalité théorique pour laquelle le modèle minimisant le critère pénalisé a de bonnes propriétés d’estimation. Ces deux théorèmes ne sont pas des inégalités oracle exactes, à cause de la constante C, mais on contrôle cette constante. Ces résultats sont non-asymptotiques, ce qui nous permet de les utiliser dans notre cadre de grande dimension. Ils justifient l’utilisation de l’heuristique des pentes. Données fonctionnelles Ce travail de classification de données en régression a été développé en toute généralité, mais aussi plus particulièrement dans le cas des données fonctionnelles. L’analyse des données fonctionnelles s’est beaucoup développée, grâce notamment aux progrès techniques récents qui permettent d’enregistrer des données sur des grilles de plus en plus fines. Pour une analyse générale de ce type de données, on peut citer par exemple le livre de Ramsay et Silverman, [RS05] ou celui de Ferraty et Vieu, [FV06]. Dans cette thèse, on prend le parti de projeter les fonctions observées sur une base orthonormale. Cette approche a l’avantage de considérer l’aspect fonctionnel, par rapport à l’analyse multivariée de la discrétisation du signal, et de résumer le signal en peu de coefficients, si la base est bien choisie. 34 Introduction Plus précisément, on choisit de projeter les fonctions étudiées sur des bases d’ondelettes. On décompose de façon hiérarchique les signaux dans le domaine temps-échelle. On peut alors décrire une fonction à valeurs réelles par une approximation de cette fonction, et un ensemble de détails. Pour une étude générale des ondelettes, on peut citer Daubechies, [Dau92], ou Mallat, [Mal99]. Il est à noter que, par données fonctionnelles, dans notre cadre de modèles de mélange en régression, on entend soit des régresseurs fonctionnels et une réponse vectorielle (typiquement, l’analyse spectrophotométrique de données, comme le jeu de données classiques Tecator où la proportion de gras d’un morceau de viande est exprimée en fonction de la courbe de spectrophotométrie), soit des régresseurs vectoriels et une réponse fonctionnelle, soit des régresseurs et une réponse fonctionnels (comme la régression de la consommation électrique d’un jour sur la veille). On considère un échantillon de signaux (fi )1≤i≤n observés sur une grille de temps {t1 , . . . , tT }. On peut considérer l’échantillon (fi (tj )) 1≤i≤n pour faire notre analyse (ces observations sont soit 1≤j≤p les régresseurs, soit la réponse, soit les deux suivant la nature des données), mais on peut aussi considérer les projections de l’échantillon sur une base orthonormée B = {αj }j∈N∗ . Dans ce cas, il existe (bj )j∈N∗ tels que, pour tout f , pour tout t, f (t) = ∞ X bj αj (t). j=1 On peut choisir une base d’ondelettes B = {φ, ψl,h } l≥0 0≤h≤2l −1 , où R — ψ est une ondelette réelle, vérifiant ψ ∈ L1 ∩ L2 , tψ ∈ L1 , et R ψ(t)dt = 0. — pour tout t, ψl,h (t) = 2l/2 ψ(2l t − h) pour (l, h) ∈ Z2 . — φ une fonction d’échelle associée à ψ. — pour tout t, φl,h (t) = 2l/2 φ(2l t − h) pour (l, h) ∈ Z2 . On peut alors écrire, pour tout f , pour tout t, X XX f (t) = βL,h (f )φL,h (t) + dl,h (f )ψl,h (t) h∈Z h∈Z l≤L où  βl,h (f ) =< f, φl,h > pour tout (l, h) ∈ Z2 , dl,h (f ) =< f, ψl,h > pour tout (l, h) ∈ Z2 . Alors, plutôt que de considérer l’échantillon (fi )1≤i≤n , on peut travailler avec l’échantillon (xi )1≤i≤n = (βL,h (fi ), (dl,h (fi ))l≥L,h∈Z )1≤i≤n . Comme la base est orthonormale, supposer que les (fi )1≤i≤n suivent une loi Gaussienne revient à supposer que les (xi )1≤i≤n suivent une loi Gaussienne. D’un point de vue pratique, la décomposition d’un signal sur une base d’ondelettes est très efficace. Citons Mallat, [Mal99], pour une description détaillée des ondelettes et de leur utilisation. L’intérêt majeur est la décomposition temps-échelle du signal, qui permet d’analyser les coefficients. De plus, si on choisit bien la base, on peut résumer un signal en peu de coefficients, ce qui permet de réduire la dimension du problème. D’un point de vue plus pratique, on peut citer Misiti et coauteurs, [MMOP07] pour la mise en pratique de la décomposition d’un signal sur une base d’ondelettes. L’application principale de cette thèse est la classification des consommateurs électriques dans un but de prédiction de la consommation agrégée. Si on prévoit la consommation de chaque consommateur, et qu’on somme ces prédictions, on va sommer les erreurs de prédiction, donc 35 Introduction on peut faire beaucoup d’erreurs. Si on prévoit la consommation totale, on risque de faire des erreurs en n’étudiant pas assez les variations de chaque consommation individuelle. Cela explique le besoin de faire de la classification, et la classification en régression est faite dans un but de prédiction. Cependant, on sait que la prédiction de la consommation électrique peut être améliorée avec des modèles beaucoup plus adaptés. L’objectif de cette procédure est de classer ensemble, dans une étape préliminaire, les consommateurs qui ont le même comportement d’un jour à l’autre, ces groupes seront alors construits dans un but de prédiction. Plan de la thèse Cette thèse est principalement centrée sur les modèles de mélange en régression, et les problèmes de classification avec des données de régression en grande dimension. Elle est découpée en 5 chapitres, qui peuvent tous être lus de manière indépendante. Le premier chapitre est consacré à l’étude du modèle principal. On décrit le modèle de mélange de Gaussiennes en régression, où la réponse et les régresseurs sont multivariés. On propose plusieurs approches pour estimer les paramètres de ce modèle, entre autres en grande dimension (pour la réponse et pour les régresseurs). On définit, dans ce cadre, plusieurs estimateurs pour les paramètres inconnus : une extension de l’estimateur du Lasso, l’estimateur du maximum de vraisemblance et l’estimateur du maximum de vraisemblance sous contrainte de faible rang. Dans chaque cas, on a cherché à optimiser les définitions des estimateurs dans un but algorithmique. On décrit, de manière précise, les deux procédures proposées dans cette thèse pour classifier des variables dans un cadre de régression en grande dimension, et estimer le modèle sous-jacent. C’est une partie méthodologique, qui décrit précisément le fonctionnement de nos procédures, et qui explique comment les mettre en œuvre numériquement. Des illustrations numériques sont proposées, pour confirmer en pratique l’utilisation de nos procédures. On utilise pour cela des données simulées, où l’aspect de grande dimension, l’aspect fonctionnel, et l’aspect de classification sont surlignés, et on utilise aussi des données de référence, où la vraie densité (inconnue donc) n’appartient plus à la collection de modèles considérée. Dans un deuxième chapitre, on obtient un résultat théorique pour l’estimateur du Lasso dans les modèles de mélange de Gaussiennes en régression, en tant que régularisateur ℓ1 . Remarquons qu’ici, la pénalité est différente de celle du chapitre 1, cet estimateur étant une extension directe de l’estimateur du Lasso introduit par Tibshirani pour le modèle linéaire. Nous établissons une inégalité oracle ℓ1 qui compare le risque de prédiction de l’estimateur Lasso à l’oracle ℓ1 . Le point important de cette inégalité oracle est que, contrairement aux résultats habituels sur l’estimateur du Lasso, nous n’avons pas besoin d’hypothèses sur la non colinéarité entre les variables. En contrepartie, la borne sur le paramètre de régularisation n’est pas optimale, dans le sens où des résultats d’optimalité ont été démontrés pour une borne inférieure à celle que l’on obtient, mais sous des hypothèses plus contraignantes. Notons que les constantes sont toutes explicites, même si l’optimalité de ces quantités n’est pas garantie. Dans les chapitres 3 et 4, on propose une étude théorique de nos procédures de classification. On justifie théoriquement l’étape de sélection de modèles en établissant une inégalité oracle dans chaque cas (correspondant respectivement à l’inégalité oracle pour la procédure Lasso-EMV et l’inégalité oracle pour la procédure Lasso-Rang). Dans un premier temps, on a obtenu un théorème général de sélection de modèles, qui permet de choisir un modèle parmi une sous-collection aléatoire, dans un cadre de régression. Ce résultat, démontré à l’aide d’inégalités de concentration et de contrôles par calcul d’entropie métrique, est une généralisation à une sous-collection aléatoire de modèles d’un résultat déjà existant. Cette amélioration nous permet d’obtenir une inégalité oracle pour chacune de nos procédures : en effet, nous considérons une sous-collection aléatoire de modèles, décrite par le chemin de régularisation de l’estimateur du Lasso, et cet effet Introduction 36 aléatoire nécessite de prendre des précautions dans les inégalités de concentration. Ce résultat fournit une forme de pénalité minimale garantissant que l’estimateur du maximum de vraisemblance pénalisé est proche de l’oracle ℓ0 . En appliquant une telle pénalité lors de nos procédures, nous sommes sûrs d’obtenir un estimateur avec un faible risque de prédiction. L’hypothèse majeure que l’on fait pour obtenir ce résultat est de borner les paramètres du modèle de mélange. Remarquons que la pénalité n’est pas proportionnelle à la dimension, il y a un terme logarithmique en plus. On peut alors s’interroger quant à la nécessité de ce terme. On illustre aussi cette étape dans chacun des chapitres sur des jeux de données simulées et des jeux de données de référence. Il est important de souligner que ces résultats, théoriques et pratiques, sont envisageables car nous avons réestimé les paramètres par l’estimateur du maximum de vraisemblance, en se restreignant aux variables actives pour ne plus avoir de problème de grande dimension. Dans le chapitre 5, on s’intéresse à un jeu de données réelles. On met en pratique la procédure Lasso-EMV de classification des données en régression en grande dimension pour comprendre comment classer les consommateurs électriques, dans le but d’améliorer la prédiction. Ce travail a été effectué en collaboration avec Yannig Goude et Jean-Michel Poggi. Le jeu de données utilisé est un jeu de données irlandaises, publiques. Il s’agit de consommations électriques individuelles, relevées sur une année. Nous avons aussi accès à des données explicatives, telles que la température, et des données personnelles pour chaque consommateur. Nous avons utilisé la procédure Lasso-EMV de trois manières différentes. Un problème simple, qui nous a permis de calibrer la méthode, est de considérer la consommation agrégée sur les individus, et de classifier les transitions de jour. Les données sont alors assez stables, et les résultats interprétables (on veut classer les transitions de jours de semaine ensemble par exemple). Le deuxième schéma envisagé est de classifier les consommateurs, sur leur consommation moyenne. Pour ne pas perdre l’aspect temporel, on a considéré les jours moyens. Finalement, pour compléter l’analyse, on a classifié les consommateurs sur leur comportement sur deux jours fixés. Le problème majeur de ce schéma est l’instabilité des données. Cependant, l’analyse des résultats, par des critères classiques en consommation électrique, ou grâce aux variables explicatives disponibles avec ce jeu de données, permet de justifier l’intérêt de notre méthode pour ce jeu de données. A travers ce manuscrit, nous illustrons donc l’utilisation des modèles de mélange de Gaussiennes en régression, d’un point de vue méthodologique, mis en oeuvre d’un point de vue pratique, et justifié d’un point de vue théorique. Perspectives Pour poursuivre l’exploration des résultats de nos méthodes sur des données réelles, on pourrait utiliser un modèle de prédiction dans chaque classe. L’idée serait alors de comparer la prédiction obtenue plus classiquement avec la prédiction agrégée obtenue à l’aide de notre classification. D’un point de vue méthodologique, on pourrait développer des variantes de nos procédures. Par exemple, on pourrait envisager de relaxer l’hypothèse d’indépendance des variables induite par la matrice de covariance diagonale. Il faudrait la supposer parcimonieuse, pour réduire la dimension sous-jacente, et ainsi considérer des variables possiblement corrélées. On pourrait aussi améliorer le critère de sélection de modèles, en l’orientant plus pour la classification. Par exemple, le critère ICL, introduit dans [BCG00] et développé dans la thèse de Baudry, [Bau09], tient compte de cet objectif de classification en considérant l’entropie. D’un point de vue théorique, d’autres résultats pourraient être envisagés. Les inégalités oracles obtenues donnent une pénalité minimale conduisant à de bons résultats, mais on pourrait vouloir démontrer que l’ordre de grandeur est le bon, à l’aide d’une borne 37 Introduction minimax. On pourrait aussi s’intéresser à des intervalles de confiance. Le théorème de van de Geer et al., dans [vdGBRD14], valable pour des pertes convexes, peut être généralisé à notre cas, et on obtient ainsi assez facilement un intervalle de confiance pour la matrice de régression. Cependant, dans un but de prédiction, il pourrait être plus intéressant d’obtenir un intervalle de confiance pour la réponse, mais c’est un problème bien plus difficile. Dans un but de classification, on pourrait aussi vouloir obtenir des résultats similaires aux inégalités oracles pour un autre critère que la divergence de Kullback-Leibler, plus orienté classification. Notations 38 Notations In this thesis, we denote (unless otherwise stated) by capital letter random variables, by lower case observations, and in bold letters the observation vector. For a matrix A, we denote by [A]i,. its ith row, [A].,j its jth column, and [A]i,j its coefficient indexed by (i, j). For a vector B, we denote by [B]j its jth component. Usual notations cA complement of A t A transpose of A T r(A) trace of a square matrix A E(X) esperance of the random variable X Var(X) variance of the random variable X N Gaussian distribution Nq Gaussian multivariate distribution of size q χ2 chi-squared distribution B orthonormal basis 1A indicator function on a set A ⌊a⌋ floor function of a a∨b notation for the maximum between a and b a∧b notation for the minimum between a and b f ≍g f is asymptotically equivalent to g ∆ discriminant for a quadratic polynomial function Iq identity matrix of size q < a, b > scalar product between a and b P({1, . . . , p}) set of parts of {1, . . . , p} S++ set of positive-definite matrix of size q q Acronym AIC Akaike Information Criterion ARI Adjusted Rand Index BIC Bayesian Information Criterion EM Expectation-Maximization (algorithm) EMV Estimateur du Maximum de Vraisemblance FR False Relevant (variables) LMLE Lasso-Maximum Likelihood Estimator procedure LR Lasso-Rank procedure MAPE Mean Absolute Percentage Error MLE Maximum Likelihood Estimator REC Restricted Eigenvalue Condition SNR Signal-to-Noise Ratio TR True Relevant (variables) 39 Notations 40 Variables and observations X regressors: random variable of size p xi ith observation of the variable X x vector of the observations Y response: random variable of size q yi ith observation of the variable Y y vector of the observations Z random variable for the affectation: vector of size K, Zk = 1 if the variable Y , conditionally to X, belongs to the cluster k, 0 otherwise zi observation of the variable Z for the observation yi conditionally to Xi = xi F functional regressor: random functional variable fi ith observation of the variable F G functional response: random functional response gi ith observation of the variable G ỹ reparametrization of observation y, matrix of size n × K × q x̃ reparametrization of observation x, matrix of size n × K × p ǫ Gaussian variable ǫi ith observation of the variable ǫ ŷi prediction of the value of yi from observation xi Parameters β conditional mean, of size q × p × K σ variance, in univariate models, of size K Σ covariance matrix, in multivariate models, of size q × q × K Φ reparametrized conditional mean, of size q × p × K P reparametrized covariance matrix, of size q × q × K π proportions coefficients, of size K τ̂ A Posteriori Probability, matrix of size n × K ξ vector of all parameters before reparametrization: (π, β, Σ) θ vector of all parameters after reparametrization: (π, φ, P) λ regularization parameter for the Lasso estimator λk,j,z Lasso regularization parameter to cancel coefficient [φk ]z,j in mixture models Ω parameter for the assumption K wm weights for the assumption K τi,k (θ) probability for the observation i to belong to the cluster k, according to the parameter θ κ parameter for the slope heuristic R vector of ranks for conditional mean, of size K Rk rank value of the conditional mean φk in the component k ξ0 true parameter (Chapter 2) 41 Estimators β̂ σ̂ 2 Σ̂ θ̂Lasso (λ) β̂ Lasso (λ) β̃ Lasso (A) Lasso (λ) ξˆK EM V ξˆK ξˆJEM V ξˆJRank Σ̂x θ̂Group−Lasso (λ) β̂ LR (λ) P̂ LR (λ) Notations estimator of the conditional mean in linear model estimator of the variance in linear model estimator of the covariance matrix in multivariate linear model estimator of θ by the Lasso estimator, with regularization parameter λ estimator of β by the Lasso estimator, with regularization parameter λ estimator of β by the Lasso estimator, with regularization parameter A according to the dual formulae estimator of ξK by the Lasso estimator, with regularization parameter λ estimator of ξK by the maximum likelihood estimator estimator of ξK by the maximum likelihood estimator, restricted to J for the relevant variables estimator of ξK by the low rank estimator, restricted to J for the relevant variables Gram matrix, according to the sample x estimator of θ by the Group-Lasso estimator with regularization parameter λ estimator of β by the low-rank estimator, restricted to variables detected by β̂ Lasso (λ) estimator of P by the low-rank estimator, restricted to variables detected by β̂ Lasso (λ) Sets of densities H(K,J) set of conditional densities, with parameters θ, with K clusters, and J for relevant variable set Ȟ(K,J) set of conditional densities, with parameters θ, with K clusters, and J for relevant variable set, and with vector of ranks R S(K,J) set of conditional densities, with parameters ξ, with K clusters, and J for relevant variable set S(K,J,R) set of conditional densities, with parameters ξ, with K clusters, and J for relevant variable set, and with vector of ranks R B S(K,J) subset of S(K,J) with bounded parameters B S(K,J,R) subset of S(K,J,R) with bounded parameters Dimensions p number of regressors q response size K number of components n sample size Dm dimension of the model Sm D(K,J) dimension of the model S(K,J) D(K,J,R) dimension of the model S(K,J,R) Notations 42 Sets of parameters ΘK set of θ with K components Θ(K,J) set of θ with K components and J for relevant variables set Θ(K,J,R) set of θ with K components and J for relevant variables set, and R for vector of ranks for the conditional mean ΞK set of ξ with K components Ξ(K,J) set of ξ with K components and J for relevant variables set Ξ(K,J,R) set of ξ with K components and J for relevant variables set, and R for vector of ranks for the conditional mean Ξ̃(K,J,R) subset of Ξ(K,J,R) with bounded parameters K set of possible number of components J set of set of relevant variables L J set of set of relevant variables, determined by the Lasso estimator Je set of set of relevant variables, determined by the Group-Lasso estimator R set of possible rank vectors S set of densities M model collection indices for the model collection constructed by our procedure L M random model collection indices for the model collection constructed by our procedure, according to the Lasso estimator M̌ random model collection indices M̃ model collection indices for the Group-Lasso-MLE model collection L M̃ random model collection indices for the Group-Lasso-MLE model collection ΠK simplex of proportion coefficients Tq upper triangular matrices, of size q J set of relevant variables Jλ set of relevant variables dected by the Lasso estimator with regularization parameter λ ˜ J set of relevant variables detected by the Group-Lasso estimator Ψ(K,J,R) set of conditional means, with J for relevant columns Ψ̃(K,J,R) subset of Ψ(K,J,R) with bounded parameters and R for vector of ranks, in a mixture with K components FJ set of conditional Gaussian density, with bounded conditional means and bounded covariance coefficients, and relevant variables set defined by J F(J,R) set of conditional Gaussian density, with relevant variables set defined by J and vector of ranks defined by R, and bounded covariance coefficients and bounded singular values GK grid of regularization parameters, for model with K clusters 43 Notations Functions KL KLn KL⊗n dH n d⊗ H JKLρ n JKL⊗ ρ sξ sK ξ (K,J) sξ (K,J,R) sξ s∗ sO sm l lλ ˜lλ lc γ l γn pen l u ϕ ψ φ ξ(x) m(A) M (A) H[.] (ǫ, S, ||.||) Kullback-Leibler divergence Kullback-Leibler divergence for fixed covariates tensorized Kullback-Leibler divergence Hellinger distance tensorized Hellinger distance Jensen-Kullback-Leibler divergence, with parameter ρ ∈ (0, 1) tensorized Jensen-Kullback-Leibler divergence, with parameter ρ ∈ (0, 1) conditional density, with parameter ξ conditional density, with parameter ξ and with K components conditional density, with parameter ξ and with K components and J for relevant variables set conditional density, with parameter ξ and with K components and J for relevant variables set and R for vector of ranks true density oracle conditional density density for the model m log-likelihood function penalized log-likelihood function for the Lasso estimator penalized log-likelihood function for the Group-Lasso estimator complete log-likelihood function constrast function loss function empirical contrast function penalty lower function in a bracket upper function in a bracket Gaussian density wavelet function scaling function in wavelet decomposition parameters in mixture regressions, defining from regressors x smallest eigenvalue of the matrix A biggest eigenvalue of the matrix A bracketing entropy of a set S, with brackets of width ||.|| smaller than ǫ Indices j varying from 1 to p z varying from 1 to q k varying from 1 to K i varying from 1 to n m varying in M mO index of the oracle m̂ selected index Notations 44 Chapter 1 Two procedures Contents 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . Gaussian mixture regression models . . . . . . . . . . 1.2.1 Gaussian mixture regression . . . . . . . . . . . . . . . 1.2.2 Clustering with Gaussian mixture regression . . . . . 1.2.3 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 The model collection . . . . . . . . . . . . . . . . . . . Two procedures . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Lasso-MLE procedure . . . . . . . . . . . . . . . . . . 1.3.2 Lasso-Rank procedure . . . . . . . . . . . . . . . . . . Illustrative example . . . . . . . . . . . . . . . . . . . . 1.4.1 The model . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Sparsity and model selection . . . . . . . . . . . . . . 1.4.3 Assessment . . . . . . . . . . . . . . . . . . . . . . . . Functional datasets . . . . . . . . . . . . . . . . . . . . 1.5.1 Functional regression model . . . . . . . . . . . . . . . 1.5.2 Two procedures to deal with functional datasets . . . Projection onto a wavelet basis . . . . . . . . . . . . . Our procedures . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Numerical experiments . . . . . . . . . . . . . . . . . . Simulated functional data . . . . . . . . . . . . . . . . Electricity dataset . . . . . . . . . . . . . . . . . . . . Tecator dataset . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 EM algorithms . . . . . . . . . . . . . . . . . . . . . . EM algorithm for the Lasso estimator . . . . . . . . . EM algorithm for the rank procedure . . . . . . . . . 1.7.2 Group-Lasso MLE and Group-Lasso Rank procedures Context - definitions . . . . . . . . . . . . . . . . . . . Group-Lasso-MLE procedure . . . . . . . . . . . . . . Group-Lasso-Rank procedure . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 42 42 43 44 45 46 46 47 48 48 49 50 54 55 55 55 56 56 57 57 58 61 61 61 61 64 64 65 65 66 1.1. INTRODUCTION 46 In this chapter, we describe two procedures to cluster data in a regression context. Following Maugis and Meynet [MMR12], we propose two global model selection procedures to simultaneously select the number of clusters and the set of relevant variables for the clustering. It is especially suited to deal with high-dimension and low sample size settings. We take advantage of regression datasets to underline reliance between the regressors and the responses, and cluster data from this approach. This idea could be interesting for prediction, because observations sharing the same reliance will be considered in the same cluster. In addition, we focus on functional dataset, for which the projection onto a wavelet basis leads to sparse representation. We also illustrate those procedures on simulated and benchmark dataset. 1.1 Introduction Owing to the increasing of high-dimensional datasets, regression models for multivariate response and high-dimensional predictors have become important tools. The goal of this chapter is to describe two procedures which cluster regression datasets. We focus on the model-based clustering. Each cluster is represented by a parametric conditional distribution, the entire dataset being modeled by a mixture of these distributions. It provides a rigorous statistical framework, and allows to understand the role of each variable in the clustering process. The model considered is then, for i ∈ {1, . . . , n}, if (yi , xi ) ∈ Rq × Rp belongs to the component k, there exists an unknown q ×p matrix of coefficients βk and an unknown covariance matrix Σk such that yi = β k x i + ǫ i (1.1) where ǫi ∼ Nq (0, Σk ). We will work with high-dimensional datasets, that is to say q × p could be larger than the sample size n, then we have to reduce the dimension. Two ways will be considered here, coefficients sparsity and ranks sparsity. We could work with a sparse model if the matrix β could be estimated by a matrix with few nonzero coefficients. The well-known Lasso estimator, introduced by Tibshirani in 1996 in [Tib96] for linear models, is the solution chosen here. Indeed, the Lasso estimator is used for variable selection, cite for example Meinshausen and Bühlmann in [MB10] for stability selection results. We could also cite the book of Bühlmann and van de Geer [BvdG11] for an overview of the Lasso estimator. If we look for rank sparsity for β, we have to assume that a lot of regressors are linearly dependent. This approach date’s back to the 1950’s, and was initiated by Anderson in [And51] for the linear model. Izenman, in [Ize75], introduced the term of reduced-rank regression for this class of models. A number of important works followed, cite for example Giraud in [Gir11] and Bunea et al. in [BSW12] for recent results. Nevertheless, the linear regression model used in those methods is appropriate for modeling the relationship between response and predictors when the reliance is the same for all observations, and it is inadequate for settings in which the regression coefficients differ across subgroups of the observations, then we will consider here mixture models. An important example of high-dimensional datasets is functional datasets (functional predictors and/or functional response). They have been studied for example in the book of Ramsay and Silverman, [RS05]. A lot of recent works have been done on regression models for functional datasets: for example, cite the article of Ciarleglio ([CO14]) which deals with scalar response 47 CHAPTER 1. TWO PROCEDURES and functional regressors. One way to consider functional datasets is to project it onto a basis well-suited. We can cite for example Fourier basis, splines, or, the one we consider here, wavelet basis. Indeed, wavelets are particularly well suited to handle many types of functional data, because they represent global and local attributes of functions, and can deal with discontinuities. Moreover, to deal with the sparsity previously mentioned, a large class of functions can be well represented with few non-zero coefficients, for any suitable wavelet. We propose here two procedures which cluster high-dimensional data or data described by a functional variable, explained by high-dimensional predictors or by predictor variables arising from sampling continuous curves. Note that we estimate the number of components, parameters of each model, and the proportions. We assume we do not have any knowledge about the model, except that it could be well approximated by sparse mixture Gaussian regression model. The high-dimensional problem is solved by using variable selection to detect relevant variables. Since the structure of interest may often be contained into a subset of available variables and many attributes may be useless or even harmful to detect a reasonable clustering structure, it is important to select the relevant clustering variables. Moreover, removing irrelevant variables enables to get simpler modeling and can largely enhance comprehension. Our two procedures are mainly based on three recent works. Firstly, we could cite the article of Städler et al. [SBG10], which studies finite mixture regression model. Even if we work on a multivariate version of it, the model considered in the article [SBG10] is adopted here. The second, Meynet and Maugis article [MMR12], deals with model-based clustering in density estimation. They propose a procedure, called Lasso-MLE procedure, which determines the number of clusters, the set of relevant variables for the clustering, and a clustering of the observations, with high-dimensional data. We extend this procedure with conditional densities. Finally, we could cite the article [Gir11] of Giraud. It suggests a low-rank estimator for the linear model. To take into account the matrix structure, we will consider this approach in our mixture models. We consider finite mixture of Gaussian regression model. We propose two different procedures, considering more or less the matrix structure. Both of them have the same frame. Firstly, an ℓ1 -penalized likelihood approach is considered to determine potential sets of relevant variables. Introduced by Tibshirani in [Tib96], the Lasso is used to select variables. This allows one to efficiently construct a data-driven model subcollection with reasonable complexity, even for highdimensional situations, with different sparsities, varying the regularization parameter in the ℓ1 penalized likelihood function. The second step of the procedures consists to estimate parameters in a better way than by the Lasso estimator. Then, we select a model among the collection using the slope heuristic, which is developed by Birgé and Massart in [BM07]. Differences between the both procedures are the estimation of parameters in each model. The first one, later called LassoMLE procedure, uses the maximum likelihood estimator rather than the ℓ1 -penalized maximum likelihood estimator. It avoids estimation problems due to the ℓ1 -penalization shrinkage. The second one, called Lasso-Rank procedure, deals with low rank estimation. For each model in the collection, we construct a subcollection of models with conditional means estimated by various low ranks matrices. It leads to sparsity and for the coefficients, and for the rank, and consider the conditional mean with its matrix structure. The chapter is organized as follows. Section 1.2 deals with Gaussian mixture regression models. It describes the model collection that we will consider. In Section 1.3, we describe both procedures that we propose to solve the problem of clustering high-dimensional regression data. Section 1.4 presents an illustrative example, to highlight each choice involved by both procedures. Section 1.5 states the functional data case, with a description of the projection proposed to convert these functions into coefficients data. We end this section by study of simulated and benchmark data. Finally, a conclusion section ends this chapter. 48 1.2. GAUSSIAN MIXTURE REGRESSION MODELS 1.2 Gaussian mixture regression models We have to construct a statistical framework on the observations. Because we estimate the conditional densities by multivariate Gaussian in each cluster, the model used is a finite Gaussian mixture regression model. Städler et al in [SBG10] describe this model, when X is multidimensional, and Y is scalar. We generalize it in the multivariate response case in this section. Moreover, we will describe a model collection of Gaussian mixture regression models, with several sparsities. 1.2.1 Gaussian mixture regression We observe n independent couples (xi , yi )1≤i≤n , realizations of random variables (Xi , Yi )1≤i≤n , with Yi ∈ Rq and Xi ∈ Rp for all i ∈ {1, . . . , n}, coming from a probability distribution with unknown conditional density denoted by s∗ . We want to perform model-based clustering, then we PKassume that data could be well approximated by a mixture conditional density s(y|x) = k=1 πk sk (y|x), with K unknown. To get a Gaussian mixture regression model, we suppose that, if Y conditionally to X belongs to the cluster k, Y = βk X + ǫ; where ǫ ∼ Nq (0, Σk ). We then assume that sk is a multivariate Gaussian conditional density. Thus, the random response variable Y ∈ Rq depends on a set of explanatory variables, written X ∈ Rp , through a regression-type model. By considering multivariate response, we could work with more general datasets. Indeed, we could for example explain a functional response by a functional regressor, as done with the electricity dataset in Section 1.5.3. Some assumptions are in order, for a mixture of K Gaussian regression models. — the variables Yi conditionally to Xi are independent, for all i ∈ {1, . . . , n} ; — we let Yi |Xi = xi ∼ sK ξ (y|xi )dy, with sK ξ (y|x) K X (y − βk x)t Σ−1 πk k (y − βk x) = exp − 2 (2π)q/2 det(Σk )1/2 k=1 ! K ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++ q ) ) ( K X ΠK = (π1 , . . . , πk ); πk > 0 for k ∈ {1, . . . , K} and πk = 1  k=1 S++ q is the set of symmetric positive definite matrices on Rq . Then, we want to estimate the conditional density function sK ξ from the observations. For all k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the mixture component k. The πk sP are the mixture proportions. Actually, for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j xj is the zth component of the conditional mean of the mixture component k for the conditional density sK ξ (.|x). In order to have a scale-invariant maximum likelihood estimator, and to have a convex optimization problem, we reparametrize the model described above by generalizing the reparametrization described in [SBG10]. For all k ∈ {1, . . . , K}, we then define Φk = Pk βk , in which t Pk Pk = Σ−1 k (it is the Cholesky decomposition of the positive definite matrix Σ−1 ). Our hypotheses could now be rewritten: k — the variables Yi conditionally to Xi are independent, for all i ∈ {1, . . . , n} ; 49 CHAPTER 1. TWO PROCEDURES — we let Yi |Xi = xi ∼ hK θ (y|xi )dy, for i ∈ {1, . . . , n} , with hK θ (y|x) = K X πk det(Pk ) k=1 (2π)q/2  (Pk y − Φk x)t (Pk y − Φk x) exp − 2  θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , P1 , . . . , PK ) ∈ ΘK = ΠK × (Rp×q )K × (Tq )K ( ) K X ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and πk = 1  k=1 Tq is the set of lower triangular matrix with non-negative diagonal entries. The log-likelihood of this model is equal to, according to the sample (xi , yi )1≤i≤n , l(θ, x, y) = n X log i=1 K X πk det(Pk ) k=1 (2π)q/2  (Pk yi − Φk xi )t (Pk yi − Φk xi ) exp − 2 ! ; and the maximum log-likelihood estimator (later denoted by MLE) is   1 M LE θ̂ := argmin − l(θ, x, y) . n θ∈ΘK This estimator is scale-invariant, and the optimization is convex in each cluster. Since we deal with the p × q >> n case, this estimator has to be regularized to obtain accurate estimates. As a result, we propose the ℓ1 -norm penalized MLE   1 Lasso θ̂ (λ) := argmin − lλ (θ, x, y) ; (1.2) n θ∈ΘK where K Pp X 1 1 πk ||Φk ||1 ; − lλ (θ, x, y) = − l(θ, x, y) + λ n n Pq k=1 where ||Φk ||1 = j=1 z=1 |[Φk ]z,j |, and with λ > 0 to specify. This estimator is not the usual ℓ1 -estimator, called the Lasso estimator, introduced by Tibshirani in [Tib96]. It penalizes the ℓ1 -norm of the coefficients and small variances simultaneously, which has some close relations to the Bayesian Lasso (see Park and Casella [PC08]). Moreover, the reparametrization allows us to consider non-standardized data. Notice that we restrict ourselves in this chapter to diagonal covariance matrices which are dependent of the clusters, that is to say for all k ∈ {1, . . . , K}, Σk = Diag ([Σk ]1,1 , . . . , [Σk ]q,q ). Then, with the renormalization described above, the restriction becomes, for all k ∈ {1, . . . , K}, Pk = Diag ([Pk ]1,1 , . . . , [Pk ]q,q ). In that case, we assume that variables are not correlated, which is a strong assumption, but it allows to reduce easily the dimension. 1.2.2 Clustering with Gaussian mixture regression Suppose we know how many clusters there are, denoted by K, and assume that we get, from the observations, an estimator θ̂ such that hK well approximate the unknown conditional density θ̂ ∗ s . Then, we want to group the data into clusters between observations which seem similar. From a different point of view, we can look at this problem as a missing data problem. Indeed, the complete data are ((x1 , y1 , z1 ), . . . , (xn , yn , zn )) in which the latent random variables are Z = (Z1 , . . . , Zn ), Zi = ([Zi ]1 , . . . , [Zi ]K ) for i ∈ {1, . . . , n} being defined by 1.2. GAUSSIAN MIXTURE REGRESSION MODELS [Zi ]k =  1 0 50 if Yi arises from the k th subpopulation ; otherwise. Thanks to the estimation θ̂, we could use the Maximum A Posteriori principle (later denoted MAP principle) to cluster data. Specifically, for all i ∈ {1, . . . , n}, for all k ∈ {1, . . . , K}, consider  πk det(Pk ) exp − 21 (Pk yi − Φk xi )t (Pk yi − Φk xi ) τi,k (θ) = PK  1 t r=1 πk det(Pk ) exp − 2 (Pk yi − Φk xi ) (Pk yi − Φk xi ) the posterior probability of yi coming from the component number k, where θ = (π, φ, P). Then, the data are partitioned by the following rule:  1 if τi,k (θ̂) > τi,l (θ̂) for all l 6= k ; [Zi ]k = 0 otherwise. 1.2.3 EM algorithm From an algorithmic point of view, we will use a generalization of the EM algorithm to compute the MLE and the ℓ1 -norm penalized MLE. The EM algorithm was introduced by Dempster et al. in [DLR77] to approximate the maximum likelihood estimator of parameters of mixture model. It is an iterative process based on the minimization of the expectation of the empirical contrast for the complete data conditionally to the observations and the current estimation of the parameter θ(ite) at each iteration (ite) ∈ N∗ . Thanks to the Karush-Kuhn-Tucker conditions, we could extend the second step to compute the maximum likelihood estimators, penalized or not, under rank constraint or not, as it was done in the scalar case in [SBG10]. All those calculus are available in Appendix 1.7.1. We therefore obtain the next updating formulae for the Lasso estimator defined by (1.2). Remark that it includes maximum likelihood estimator, and the rank constraint could be easily computed according to a singular value decomposition.  n (ite) (ite) (ite+1) k − πk ; (1.3) = πk + t(ite) πk n √ (ite) (ite) (ite) nk h[ỹ]k,z , [Φk ]z,. [x̃]k,. i + ∆ (ite+1) = [Pk ]z,z ; (1.4) (ite) 2nk ||[ỹ]k,z ||22  (ite) (ite) −[Sk ]j,z +nλπk (ite) (ite)   if [Sk ]j,z > nλπk ;  (ite)  ||[x̃]k,j ||22  (ite+1) (ite) (ite) [Sk ]j,z +nλπk (1.5) [Φk ]z,j = (ite) (ite) if [Sk ]j,z < −nλπk ; −  (ite) 2  ||[x̃] ||  2 k,j   0 else ; 51 CHAPTER 1. TWO PROCEDURES with, for j ∈ {1, . . . , p}, k ∈ {1, . . . , K}, z ∈ {1, . . . , q}, (ite) [Sk ]j,z = − nk = n X n X (ite) (ite) [x̃i ]k,j [Pk ](ite) z,z [ỹi ]k,z + i=1 p X (ite) (ite) (ite) [x̃i ]k,j [x̃i ]k,j2 [Φk ]z,j2 ; (1.6) j2 =1 j2 6=j (ite) τi,k ; i=1 q (ite) (ite) (ite) ([ỹi ]k,z , [x̃i ]k,j ) = τi,k ([yi ]z , [xi ]j );   (ite) (ite) 2 (ite) − 4||[ỹ]k,z ||22 ; [x̃] i ∆ = −nk h[ỹ]k,z , [Φk ](ite) z,. k,.     t   (ite) (ite) (ite) (ite) (ite) (ite) πk det Pk exp −1/2 Pk yi − Φk xi Pk y i − Φ k x i (ite)  τi,k =    t   ; (1.7) PK (ite) (ite) (ite) (ite) (ite) (ite) det Pk exp −1/2 Pk yi − Φk xi Pk y i − Φ k x i r=1 πk and t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, 0 < δ < 1, such that the function is not increasing. In our case, the EM algorithm corresponds to switch between the E-step which corresponds to the calculus of (1.3), (1.4) and (1.5), and the M-step, which corresponds to the calculus of (1.7). To avoid convergence to local maximum rather than global maximum, we need to precise the initialization and the stopping rules. We initialize the clustering with the k-means algorithm on the couples (xi , yi )1≤i≤n . According to this clustering, we compute the linear regression estimators in each class. Then, we run a small number of times the EM-algorithm, repeat this initialization many times, and keep the one which maximizes the log-likelihood function: how the computation will start is important. Finally, to stop the algorithm, we could wait for any convergence, but the EM algorithm is known to check the convergence hypothesis, without converging, because of local maximum. Consequently, we choose to fix a minimum number of iterations to ensure non-local maximum, and to specify a maximum number of iterations to ensure stopping. Between these two bounds, we stop if there is convergence of the log-likelihood and of the parameters (with a relative criteria), adapted from [SBG10]. 1.2.4 The model collection We want to deal with high-dimensional data, that is why we have to determine which variables are relevant for the Gaussian regression mixture clustering. Indeed, we observe a small sample and we have to estimate many coefficients: we have a problem of identifiability. The sample size n is smaller than K(pq + q + 1) − 1, the size of parameters to estimate. A way to solve this problem is to select few variables to describe the problem. We then assume that we could estimate s∗ by a sparse model. To reduce the dimension, we want to determine which variables are useful for the clustering, and which are not. It leads to the definition of an irrelevant variable. Definition 1.2.1. A variable indexed by (z, j) ∈ {1, . . . , q} × {1, . . . , p} is irrelevant for the clustering if [Φ1 ]z,j = . . . = [ΦK ]z,j = 0. A relevant variable is a variable which is not irrelevant: at least in one cluster, this variable is not equal to zero. We denote by J the relevant variables set. [J] We denote by Φk the matrix with 0 on the set c J, for all k ∈ {1, . . . , K}, and H(K,J) the model with K components and with J for relevant variables set: 52 1.3. TWO PROCEDURES n o (K,J) (y|x) ; H(K,J) = y ∈ Rq |x ∈ Rp 7→ hθ where (K,J) hθ (y|x) = K X πk det(Pk ) k=1 and [J] (2π)q/2 [J] (1.8) [J] (Pk y − Φk x)t (Pk y − Φk x) exp − 2 [J] θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , P1 , . . . , PK ) ∈ Θ(K,J) = ΠK × Rq×p K ! , × Rq+ K ; where the notation A[J] means that J is the relevant set variable for the matrix A. We will construct a model collection, by varying the number of components K and the relevant variables subset J. 1.3 Two procedures The goal of our procedures is, given a sample (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ (Rp × Rq )n , to discover the structure of the variable Y according to X. Thus, we have to estimate, according to the representation of H(K,J) , the number of clusters K, the relevant variables set J, and the parameters θ. To overcome this difficulty, we want to take advantage of the sparsity property of the ℓ1 -penalization to perform automatic variable selection in clustering high-dimensional data. Then, we could compute another estimator restricted on relevant variables, which will work better because it is no longer an high-dimensional issue. Thus, we avoid shrinkage problems due to the Lasso estimator. The first procedure takes advantage of the maximum likelihood estimator, whereas the second one takes into account the matrix structure of Φ with a low rank estimation. 1.3.1 Lasso-MLE procedure This procedure is decomposed into three main steps: we construct a model collection, then in each we compute the maximum likelihood estimator, and we choose the best one among the model collection. The first step consists of constructing a model collection {H(K,J) }(K,J)∈M in which H(K,J) is defined by equation (1.8), and the model collection is indexed by M = K × J . We denote by K ⊂ N∗ the possible number of components. We assume we could bound K without loss of estimation. We also note J ⊂ P({1, . . . , q} × {1, . . . , p}). To detect the relevant variables, and construct the set J ∈ J , we penalize P empirical contrast P the by an ℓ1 -penalty on the mean parameters proportional to ||Φk ||1 = pj=1 qz=1 |[Φk ]z,j |. In the ℓ1 -procedures, the choice of the regularization parameters is often difficult: fixing the number of components K ∈ K, we propose to construct a data-driven grid GK of regularization parameters by using the updating formulae of the mixture parameters in the EM algorithm. We can give a formula for λ, the regularization parameter, depending on which coefficient we want to cancel, for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}: [Φk ]z,j = 0 ⇔ λk,j,z = |[Sk ]j,z | ; nπk with [Sk ]j,z defined by (1.6). Then, we define the data-driven grid by GK = {λk,j,z , k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}} . 53 CHAPTER 1. TWO PROCEDURES We could compute it from maximum likelihood estimations. Then, for each λ ∈ GK , we could compute the Lasso estimator defined by ( n ) K X 1X (K,J) Lasso θ̂ (λ) = argmin log(hθ (yi |xi )) + λ πk ||Φk ||1 . n θ∈Θ(K,J) i=1 k=1 For a fixed number of mixture components K ∈ K and a regularization parameter λ ∈ GK , we could use an EM algorithm, recalled in Appendix 1.7.1, to approximate this estimator. Then, for each K ∈ K, and for each λ ∈ GK , we could construct the relevant variables set Jλ . We denote by J the collection of all these sets. The second step consists of approximating the MLE ( ) n X 1 ĥ(K,J) = argmin − log(h(yi |xi )) ; n h∈H(K,J) i=1 using the EM algorithm for each model (K, J) ∈ M. The third step is devoted to model selection. Rather than select the regularization parameter, we select the refitted model. We use the slope heuristic described in [BM07]. Explain briefly how it works. Firstly, models are grouping according to their dimension D, to obtain a model collection {HD }D∈D . The dimension of a model is the number of parameters estimated in the model. For each dimension D, let ĥD be the estimator maximizing the likelihood P among the estimators associated to a model of dimension D. Also, the function D/n 7→ 1/n ni=1 log(ĥD (yi |xi )) has a linear behavior for large dimensions. We estimate the slope, denoted by κ̂,Pwhich will be used to calibrate the penalty. The minimizer D̂ of the penalized criterion −1/n ni=1 log(ĥD (yi |xi )) + 2κ̂D/n is determined, and the model selected is (KD̂ , JD̂ ). Remark that D = K(|J| + q + 1) − 1. Note that the model is selected after parameters refitting, which avoids issue of regularization parameter selection. For an oracle inequality to justify the slope heuristic used here, see [Dev14b]. 1.3.2 Lasso-Rank procedure Whereas the previous procedure does not take into account the multivariate structure, we propose a second procedure to perform this point. For each model belonging to the collection H(K,J) , a subcollection is constructed, varying the rank of Φ. Let us describe this procedure. As in the Lasso-MLE procedure, we first construct a collection of models, thanks to the ℓ1 approach. For λ ≥ 0, we obtain an estimator for θ, denoted by θ̂Lasso (λ), for each model belonging to the collection. We could deduce the set of relevant columns, denoted by Jλ , and this for all K ∈ K: we deduce J the collection of relevant variables set. The second step consists to construct a subcollection of models with rank sparsity, denoted by {Ȟ(K,J,R) }(K,J,R)∈M̃ . The model Ȟ(K,J,R) has K components, the set J for active variables, and R is the vector of the ranks of the matrix of regression coefficients in each group: n o (K,J,R) Ȟ(K,J,R) = y ∈ Rq |x ∈ Rp 7→ hθ (y|x) (1.9) where ! t (P y − (ΦRk )[J] x) k [J] (Pk y − (ΦR ) x) k k k = exp − ; q/2 2 (2π) k=1 q K RK [J] 1 [J] ; θ = (π1 , . . . , πK , (ΦR 1 ) , . . . , (ΦK )) , P1 , . . . , PK ) ∈ Θ(K,J,R) = ΠK × Ψ(K,J,R) × R+ n o  RK [J] q×p K 1 [J] Ψ(K,J,R) = ((ΦR Rank(Φk ) = Rk for all k ∈ {1, . . . , K} ; 1 ) , . . . , (ΦK ) ) ∈ R (K,J,R) hθ (y|x) K X πk det(Pk ) 1.4. ILLUSTRATIVE EXAMPLE 54 and MR = K ×J ×R. We denote by K ⊂ N∗ the possible number of components, J a collection of subsets of {1, . . . , p}, and R the set of vectors of size K ∈ K with ranks values for each mean matrix. We could compute the MLE under the rank constraint thanks to an EM algorithm. Indeed, we could constrain the estimation of Φk , for the cluster k, to have a rank equal to Rk , in keeping only the Rk largest singular values. More details are given in Section 1.7.1. It leads to an estimator of the mean with row sparsity and low rank for each model. As described in the above section, a model is selected using the slope heuristic. This step is justified theoretically in [Dev15]. 1.4 Illustrative example We illustrate our procedures on four different simulated datasets, adapted from [SBG10], belonging to the model collection. They have been both implemented in Matlab, with the help of Benjamin Auder, and the Matlab code is available. Firstly, we will present models used in these simulations. Then, we validate numerically each step, and we finally compare results of our procedures with others. Remark that we propose here some examples to illustrate our methods, but not a complete analysis. We highlight some issues which seems important. Moreover, we do not illustrate the one-component case, focusing on the clustering. If, on some dataset, we are not convinced by the clustering, we could add to the model collection models with one component, more or less sparse, using the same pattern (computing the Lasso estimator to get the relevant variables for various regularization parameters, and refit parameters with the maximum likelihood estimator, under rank constraint or not), and select a model among this collection of linear and mixture models. 1.4.1 The model Let x be a sample of size n distributed according to multivariate standard Gaussian. We consider a mixture of two components. Besides, we fix the number of active variables to 4 in each cluster. More precisely, the first  four variables of Y are explained respectively by the four first variables of X. Fix π = 12 , 21 and Pk = Iq for all k ∈ {1, 2}. The difficulty of the clustering is partially controlled by the signal-to-noise ratio. In this context, we could extend the natural idea of the SNR with the following definition, where Tr(A) denotes the trace of the matrix A. SNR = Tr(Var(Y )) . Tr(Var(Y |βk = 0 for all k ∈ {1, . . . , K})) Remark that it only controls the distance between the signal with or without the noise, and not the distance between the both signals. We compute four different models, varying n, the SNR, and the distance between the clusters. Details are available in the Table 1.1. 55 CHAPTER 1. TWO PROCEDURES n k p q β1|J β2|J σ SNR Model 1 2000 2 10 10 3 -2 1 3.6 Model 2 100 2 10 10 3 -2 1 3.6 Model 3 100 2 10 10 3 -2 3 1.88 Model 4 100 2 10 10 5 3 1 7.8 Model 5 50 2 30 5 3 -2 1 3.6 Table 1.1: Description of the different models Take a sample of Y according to a Gaussian mixture, meaning in βk X and with covariance matrix Σk = (Pkt Pk )−1 = σIq , for the cluster k. We run our procedures with the number of components varying in K = {2, . . . , 5}. The model 1 illustrates our procedures in low dimension models. Moreover, it is chosen in the next section to illustrate well each step of the procedure (variable selection, models construction and model selection). Model 5 is considered high-dimensional, because p × K > n. The model 2 is easier than the others, because clusters are not so close to each other according to the noise. Model 3 is constructed as the models 1 and 2, but n is small and the noise is more important. We will see that it gives difficulty for the clustering. Model 4 has a larger SNR, nevertheless, the problem of clustering is difficult, because each βk is closer to the others. Our procedures are run 20 times, and we compute statistics on our results over those 20 simulations: it is a small number, but the whole procedure is time-consuming, and results are convincing enough. For the initialization, we repeat 50 times the initialization, and keep the one which maximizes the log-likelihood function after 10 iterations. Those choices are size-dependent, a numerical study not reported here concludes that it is enough in that case. 1.4.2 Sparsity and model selection To illustrate the both procedures, all the analyses made in this section are done from the model 1, since the choice of each step is clear. Firstly, we compute the grid of regularization parameters. More precisely, each regularization parameter is computed from maximum likelihood estimations (using EM algorithm), and give an associated sparsity (computed by the Lasso estimator, using again the EM algorithm). In Figure (1.1) and Figure (1.2), the collection of relevant variables selected by this grid are plotted. Firstly, we could notice that the number of relevant variables selected by the Lasso decreases with the regularization parameter. We could analyze more precisely which variables are selected, that is to say if we select true relevant or false relevant variables. If the regularization parameter is not too large, the true relevant variables are selected. Even more, if the regularization parameter is well-chosen, we select only the true relevant variables. In our example, we remark that if λ = 0.09, we have selected exactly the true relevant variables. This grid construction seems to be well-chosen according to these simulations. From this variable selection, each procedure (Lasso-MLE or Lasso-Rank) leads to a model collection, varying the sparsity thanks to the regularization parameters grid, and the number of components. Among this collection, we select a model with the slope heuristic. We want to select the best model by improving a penalized criterion. This penalty is computed 1.4. ILLUSTRATIVE EXAMPLE Figure 1.1: For one simulation, number of false relevant (in red color) and true relevant (in blue color) variables generated by the Lasso, by varying the regularization parameter λ in the grid G2 56 Figure 1.2: For one simulation, zoom in on number of false relevant (in red color) and true relevant (in blue color) variables generated by the Lasso, by varying the regularization parameter λ around the interesting values P by performing a linear regression on the couples of points {(D/n; −1/n ni=1 log(ĥD (yi |xi )))}, D varying. The slope κ̂ allows to have access to the best model, the one with dimension D̂ P minimizing −1/n ni=1 log(ĥD (yi |xi )) + 2κ̂D/n. In practice, we have to look if couples of points have a linear comportment . For each procedure, we construct the different model collection, and we have to justify this behavior. The Figures (1.3) and (1.4) represent the log-likelihood in function of the dimension of the models, for model collections constructed respectively by the Lasso-MLE procedure and by the Lasso-Rank procedure. The couples are plotted by points, whereas the estimated slope is specified by a dotted line. We could observe more than a line (4 for the Lasso-MLE procedure, more for the Lasso-Rank procedure). This phenomenon could be explained by a linear behavior for each mixture, fixing the number of classes, and for each rank. Nevertheless, slopes are almost the same, and select the same model. In practice, we estimate the slope with the Capushe package [BMM12]. 1.4.3 Assessment We compare our procedures to three other procedures on simulated models 1, 2, 3 and 4. Firstly, let us give some remarks about the model 1. For each procedure, we get a good clustering and a very low Kullback-Leibler divergence. Indeed, the sample size is large, and the estimations are good. That is the reason why we focus in this section on models 2, 3 and 4. To compare our procedures with others, the Kullback-Leibler divergence with the true density and the ARI (the Adjusted Rand Index, measuring the similarity between two data clusterings, knowing that the closer to 1 the ARI, the more similar the two partitions) are computed, and we note which variables are selected, and how many clusters are selected. For more details on the ARI, see [Ran71]. From the Lasso-MLE model collection, we construct two models, to compare our procedures with. We compute the oracle (the model which minimizes the Kullback-Leibler divergence with the true density), and the model which is selected by the BIC criterion instead of the slope heuristic. Thanks to the oracle, we know how good we could get from this model collection for the Kullback-Leibler divergence, and how this model, as good it is possible for the log-likelihood, performs the clustering. 57 Figure 1.3: For one simulation, slope graph obtain by our Lasso-Rank procedure. For large dimensions, we observe a linear part CHAPTER 1. TWO PROCEDURES Figure 1.4: For one simulation, slope graph obtain by our Lasso-MLE procedure. For large dimensions, we observe a linear part The third procedure we compare with is the maximum likelihood estimator, assuming that we know how many clusters there are, fixed to 2. We use this procedure to show that variable selection is needed. In each case, we apply the MAP principle, to compare clusterings. We do not plot the Kullback-Leibler divergence for the MLE procedure, because values are too high, and make the boxplots unreadable. For the model 2, according to the Figure (1.5) for the Kullback-Leibler divergence, and Figure (1.6) for the ARI, the Kullback-Leibler divergence is small and the ARI is close to 1, except for the MLE procedure. Boxplots are still readable with those values, but it is important to highlight that variable selection is needed, even in a model with reasonable dimension. The model collections are then well constructed. The model 3 is more difficult, because the noise is higher. That is why results, summarized in Figures (1.7) and (1.8), are not as good as for the model 2. Nevertheless, our procedures lead to the best ARI, and the Kullback-Leibler divergences are close to the one of the oracle. We could make the same remarks for the model 4. In this study, means are closer, according to the noise. Results are summarized in Figures (1.9) and (1.10). The model 5 is in high-dimension. Models selected by the BIC criterion are bad, in comparison with models selected by our procedures, or oracles. They are bad for estimation, according to the Kullback-Leibler divergence boxplots in Figure (1.11), but also for clustering, according to Figure (1.12). Our models are not as well as constructed as previously, but it is explained by the high-dimensional context. It is explained by the high Kullback-Leibler divergence. Nevertheless, our performances in clustering are really good. Note that the Kullback-Leibler divergence is smaller for the Lasso-MLE procedure, thanks to the maximum likelihood refitting. Moreover, the true model has not any matrix structure. If we look after the MLE, where we do not use the sparsity hypothesis, we could conclude that estimations are not satisfactory, which could be explained by an high-dimensional issue. The Lasso-MLE procedure, the Lasso-Rank procedure and the BIC model work almost as well as the oracle, which mind that the models are well selected. 58 0.45 1 0.4 0.35 0.8 0.3 0.6 0.25 ARI Kullback−Leibler divergence 1.4. ILLUSTRATIVE EXAMPLE 0.2 0.4 0.15 0.2 0.1 0 0.05 LMLE LR Oracle BIC LMLE Oracle BIC MLE Figure 1.6: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE) for the model 2 1 1 0.8 0.8 0.6 0.6 ARI Kullback−Leibler divergence Figure 1.5: Boxplots of the Kullback-Leibler divergence between the true model and the one selected by the procedure over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) for the model 2 LR 0.4 0.4 0.2 0.2 LMLE LR Oracle BIC 0 LMLE Figure 1.7: Boxplots of the Kullback-Leibler divergence between the true model and the one selected by the procedure over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) for the model 3 LR Oracle BIC MLE Figure 1.8: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE) for the model 3 59 CHAPTER 1. TWO PROCEDURES 0.9 1.5 0.8 0.7 0.6 1 0.5 ARI Kullback−Leibler divergence 2 0.4 0.3 0.5 0.2 0.1 0 0 LMLE LR Oracle BIC LMLE Figure 1.9: Boxplots of the Kullback-Leibler divergence between the true model and the one selected by the procedure over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) for the model 4 LR Oracle BIC MLE Figure 1.10: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE) for the model 4 80 1 60 0.8 40 0.6 20 0.4 0 0.2 0 LMLE LR Oracle BIC LMLE Figure 1.11: Boxplots of the KullbackLeibler divergence between the true model and the one selected by the procedure over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) for the model 5 LR Oracle BIC Figure 1.12: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE) for the model 5 60 1.5. FUNCTIONAL DATASETS Procedure Lasso-MLE Lasso-Rank Oracle BIC estimator TR 8(0) 8(0) 8(0) 8(0) Model 2 FR 2.2(6.9) 24(0) 1.5(3.3) 2.6(15.8) Model 3 TR FR 8(0) 4.3(28.8) 8(0) 24(0) 7.8(0.2) 2.2(11.7) 7.8(0.2) 5.7(64.8) TR 8(0) 8(0) 8(0) 8(0) Model 4 FR 2(13, 5) 24(0) 0.8(2.6) 2.6(11.8) Table 1.2: Mean number {TR, FR} of true relevant and false relevant variables over the 20 simulations for each procedure, for models 2, 3 and 4. The standard deviations are put into parenthesis In Table (1.2), we summarize results about variable selection. For each model, for each procedure, we compute how many true relevant and false relevant variables are selected. The true model has 8 relevant variables, which are always recognized. The Lasso-MLE has less false relevant variables than the others, which means that the true structure was found. The Lasso-Rank has 24 false relevant variables, because of the matrix structure: the true rank in each component was 4, then the estimator restricted on relevant variables is a 4 × 4 matrix, and we get 12 false relevant variables in each component. Nevertheless, we do not have more variables, that is to say the model constructed is the best possible. The BIC estimator and the oracle have a large variability for the false relevant variables. For the number of components, we find that all the procedures have selected the true number 2. Thanks to the MLE, the first procedure has good estimations (better than the second one). Nevertheless, depending on the data, the second procedure could be more attractive. If there is a matrix structure, for example if most of the variation of the response Y is caught by a small number of linear combinations of the predictors, the second procedure will work better. We could conclude that the model collection is well constructed, and that the clustering is well-done. 1.5 Functional datasets One of the main interest of our methods is to be applied to functional datasets. Indeed, in different fields of applications, considered data are functions. The functional data analysis has been popularized first by Ramsay and Silverman in their book [RS05]. It gives a description of the main tools to analyze functional datasets. Another book is the Ferraty and Vieu one’s [FV06]. However, the main part of the existing literature about functional regression is concentrated on the case Y scalar and X functional. For example, we can cite Zhao et al., in [ZOR12] for using wavelet basis in linear model, Yao et al. ([YFL11]) for functional mixture regression, or Ciarleglio et al. ([CO14]) for using wavelet basis in functional mixture regression. In this section, we concentrate on Y and X both functional. In this regression context, we could be interested in clustering: it leads to identify the individuals involved in the same reliance between Y and X. Denote that, with functional datasets, we have to denoise and smooth signals to remove the noise and capture only the important patterns in the data. Here, we explain how our procedures can be applied in this context. Note that we could apply our procedures with scalar response and functional regressor, or, on the contrary, with functional response for scalar regressor. We explain how our procedures are generalized in the more difficult case, the other cases resulting of that. Remark that we focus here on the wavelet basis, to take advantage of the time-scale decomposition, but the same analysis is available on Fourier basis or splines. 61 1.5.1 CHAPTER 1. TWO PROCEDURES Functional regression model Suppose we observe a centered sample of functions (fi , gi )1≤i≤n , associated with the random variables (F, G), coming from a probability distribution with unknown conditional density s∗ . We want to estimate this model by a functional mixture model: if the variables (F, G) come from the component k, there exists a function βk such that Z F (u)βk (u, t)du + ǫ(t), (1.10) G(t) = Ix where ǫ is a residual function. This linear model is introduced in Ramsay and Silverman’s book [RS05]. They propose to project onto basis and the response, and the regressors. We extend their model in mixture model, to consider several subgroups for a sample. If we assume that, for all t, for all i ∈ {1, . . . , n}, ǫi (t) ∼ N (0, Σk ), the model (1.10) is an integrated version of the model (1.1). Depending on the cluster k, the linear reliance of G with respect to F is described by the function βk . 1.5.2 Two procedures to deal with functional datasets Projection onto a wavelet basis To deal with functional data, we project them onto some basis, to obtain data as described in the Gaussian mixture regression models (1.1). In this chapter, we choose to deal with wavelet basis, given that they represent localized features of functions in a sparse way. If the coefficients matrix x and y are sparse, the regression matrix β has more chance to be sparse. Moreover, we could represent a signal with a few coefficients dataset, then it is a way to reduce the dimension. For details about the wavelet theory, see the Mallat’s book [Mal99]. Begin by an overview of some important aspects of wavelet basis. Let ψ a real wavelet function, satisfying Z 1 2 1 ψ ∈ L ∩ L , tψ ∈ L , and ψ(t)dt = 0. R We denote by ψl,h the function defined from ψ by dyadic dilation and translation: ψl,h (t) = 2l/2 ψ(2l t − h) for (l, h) ∈ Z2 . We could define wavelet coefficients of a signal f by Z dl,h (f ) = f (t)ψl,h (t)dt for (l, h) ∈ Z2 . R Let ϕ be a scaling function related to ψ, and Rϕl,h the dilatation and translation of ϕ for (l, h) ∈ Z2 . We also define, for (l, h) ∈ Z2 , βl,h (f ) = R f (t)ϕl,h (t)dt. Note that scaling functions serve to construct approximations of the function of interest, while the wavelet functions serve to provide the details not captured by successive approximations. We denote by Vl the space generated by {ϕl,h }h∈Z , and by Wl the space egenrated by {ψl,h }h∈Z for all l ∈ Z. Remark that Vl−1 = Vl ⊕ Wl for all l ∈ Z L2 = ⊕l∈Z Wl Let L ∈ N∗ . For a signal f , we could define the approximation at the level L by XX AL = dl,h ψl,h ; l>L h∈Z 62 1.5. FUNCTIONAL DATASETS and f could be decomposed by the approximation at the level L and the details (dl,h )l<L . The decomposition of the basis between scaling function and wavelet function emphasizes on the local nature of the wavelets, and that is an important aspect in our procedures, because we want to know which details allow us to cluster two observations together. Consider the sample (fi , gi )1≤i≤n , and introduce the wavelet expansion of fi in the basis B: for all t ∈ [0, 1], X XX fi (t) = βL,h (fi )ϕL,h (t) + dl,h (fi )ψl,h (t). h∈Z | {z AL } l≤L h∈Z The collection {βL,h (fi ), dl,h (fi )}l≤L,h∈Z is the Discrete Wavelet Transform (DWT) of f in the basis B. Because we project onto an orthonormal basis, this leads to a n-sample (x1 , . . . , xn ) of wavelet coefficient decomposition vectors, with fi = W xi ; in which xi is the vector of the discretized values of the signal, xi the matrix of coefficients in the basis B, and W a p × p matrix defined by ϕ and ψ. The DWT can be performed by a compu′ tationally fast pyramid algorithm (see Mallat, [Mal89]). In the same way, there exists W such ′ that gi = W yi , with y = (y1 , . . . , yn ) a n sample of wavelet coefficient decomposition vectors. ′ Because the matrices W and W are orthogonal, we keep the mixture structure, and the noise is also Gaussian. We could consider the wavelet coefficient dataset (x, y) = ((x1 , y1 ), . . . , (xn , yn )), which defines of n observations whose probability distribution could be modeled by the finite Gaussian mixture regression model (1.1). Our procedures We could apply our both procedures to this dataset, and obtain a clustering of the data. Indeed, rather than considering (f , g), we run our procedures on the sample (x, y), varying the number of clusters in K. The notion of relevant variable is natural: the function ϕl,h or ψl,h is irrelevant if it appears in none of the wavelet coefficient decomposition of the functions in each cluster. 1.5.3 Numerical experiments We will illustrate our procedures on functional datasets by using the Matlab wavelet toolbox (see Misiti et al. in [MMOP04] for details). Firstly, we simulate functional datasets, where the true model belongs to the model collection. Then, we run our procedure on an electricity dataset, to cluster successive days. We have access to time series, measured every half-hour, of a load consumption, on 70 days. We extract the signal of each day, and construct couples by each day and its eve, and we aim at clustering these couples. To finish, we test our procedures on the well-known Tecator dataset. This benchmark dataset corresponds to the spectrometric curves and fat contents of meat. These experiments illustrate different aspects of our procedures. Indeed, the simulated example proves that our procedures work in a functional context. The second example is a toy example used to validate the classification, on real data already studied, and in which we clearly understand the clusters. The last example illustrates the use of the classification to perform prediction, and the description given by our procedures to the model constructed. 63 CHAPTER 1. TWO PROCEDURES Simulated functional data Firstly, we simulate a mixture regression model. Let f be a sample of the noised cosine function, discretized on a 15 points grid. Let g be, depending on the cluster, either f , or the function −f , computed by a white-noise. We use the Daubechies-2 basis at level 2 to decompose the signal. Our procedures are run 20 times, and the number of clusters are fixed to K = 2. Then our procedures run on the projection are compared with the oracle among the collection constructed by the Lasso-MLE procedure, and with the model selected by the BIC criterion among this collection. The MLE is also computed. Figure 1.13: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the Lasso-Rank procedure (LR), the oracle (Oracle), the BIC estimator (Bic) and the MLE (MLE) This simulated dataset proves that our procedures also perform clustering functional data, considering the projection dataset. Electricity dataset We also study the clustering on electricity dataset. This example is studied in [MMOP07]. We work on a sample of size 70 of couples of days, which is plotted in Figure 5.1. For each couple, we have access to the half-hour load consumption. Figure 1.14: Plot of the 70-sample of halfhour load consumption, on the two days Figure 1.15: Plot of a week of load consumption As we said previously, we want to cluster the relationship between two successive days. In Figure 1.15, we plot a week of consumption. 1.5. FUNCTIONAL DATASETS 64 The regression model taken is F for the first day, and G for the second day of each couple. Besides, discretization of each day on 48 points, every half-hour, is made available. In our opinion, a linear model is not appropriate, as the behavior from the eve to the day depends on which day we consider: there is a difference between working days and weekend days, as involved in Figure 1.15. To apply our procedures, we project f and g onto wavelet basis. The symmlet-4 basis, at level 5, is used. We run our procedures with the number of clusters varying from 2 to 6. Our both procedures select a model with 4 components. The first one considers couples of weekdays, the second Friday-Saturday, the third component is Saturday-Sunday and the fourth considers SundayMonday. This result is faithful with the knowledge we have about these data. Indeed, working days have the same behavior, depending on the eve, whereas days off have not the same behavior, depending on working days, and conversely. Moreover, in the article [MMOP07], which also studied this example, they get the same classification. Tecator dataset This example deals with spectrometric data. More precisely, a food sample has been considered, which contained finely chopped pure meat with different fat contents. The data consist of a 100channel spectrum of absorbances in the wavelength range 850 − 1050 nm, and of the percentage of fat. We observe a sample of size 215. Those data have been studied in a lot of articles, cite for example Ferraty and Vieu’s book [FV06]. They work on different approaches. They test prediction, and classification, supervised (where the fat content become a class, larger or smaller than 20%), or not (ignoring the response variable). In this work, we focus on clustering data according to the reliance between the fat content and the absorbance spectrum. We could not predict the response variable, because we do not know the class of a new observation. Estimate it is a difficult problem, in which we are not involved in this chapter. We will take advantage of our procedures to know which coefficients, in the wavelet basis decomposition of the spectrum, are useful to describe the fat content. The sample will be split into two subsamples, 165 observations for the learning set, and 50 observations for the test set. We split it to have the same marginal distribution for the response in each sample. The spectrum is a function, which we decompose into the Haar basis, at level 6. Nevertheless, our model did not take into account a constant coefficient to describe the response. Thereby, before run our procedure, we center and the y according to the learning sample, and each function xi for all observations in the whole sample. Then, we could estimate the mean of the response by the mean µ̂ over the learning sample. We construct models on the training set by our procedure Lasso-MLE. Thanks to the estimations, we have access to relevant variables, and we could reconstruct signals keeping only relevant variables. We have also access to the a posteriori probability, which leads to know which observation is with high probability in which cluster. However, for some observations, the a posteriori probability do not ensure the clustering, being almost the same for different clusters. The procedure selects two models, which we describe here. In Figures 1.16 and 1.17, we represent clusters done on the training set for the different models. The graph on the left is a candidate for representing each cluster, constructed by the mean of spectrum over an a posteriori probability greater than 0.6. We plot the curve reconstruction, keeping only relevant variables in the wavelet decomposition. On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. The first model has two classes, which could be distinguish in the absorbance spectrum by the 65 CHAPTER 1. TWO PROCEDURES bump on wavelength around 940 nm. The first cluster is dominating, with π̂1 = 0.95. The fat content is smaller in the first cluster than in the second cluster. According to the signal reconstruction, we could see that almost all variables have been selected. This model seems consistent according to the classification goal. The second model has 3 classes, and we could remark different important wavelength. Around 940 nm, there is some difference between classes, corresponding to the bump underline in the model 1, but also around 970 nm, with higher or smaller values. The first class is dominating, with π̂1 = 0.89. Just a few of variables have been selected, which give to this model the understanding property of which coefficients are discriminating. We could discuss about those models. The first one selects only two classes, but almost all variables, whereas the second model has more classes, and less variables: there is a trade-off between clusters and variable selection for the dimension reduction. Figure 1.16: Summarized results for the model 1. The graph on the left is a candidate for representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori probability greater than 0.6 On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. 66 1.5. FUNCTIONAL DATASETS Figure 1.17: Summarized results for the model 2. The graph on the left is a candidate for representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori probability greater than 0.6 On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. According to those classifications, we could compute the response according to the linear model. We use two ways to compute ŷ: either consider the linear model in the cluster selected by the MAP principle, or mix estimations in each cluster thanks to these P a posteriori probabilities. We compute the Mean Absolute Percentage Error, MAPE = n1 ni=1 |(ŷi − yi )/yi |. Results are summarized in Table 1.3. Model 1 Model 2 Linear model in the class with higher probability 0.200 0.055 Mixing estimation 0.198 0.056 Table 1.3: Mean absolute percentage of error of the predicted value, for each model, for the learning sample Thus, we work on the test sample. We use the response and the regressors to know the a posteriori of each observation. Then, using our models, we could compute the predicted fat values from the spectrometric curve, as before according to two ways, mixing or choosing the classes. Model 1 Model 2 Linear model in the class with higher probability 0.22196 0.20492 Mixing estimation 0.21926 0.20662 Table 1.4: Mean absolute percentage of error of the predicted value, for each model, for the test sample Because the models are constructed on the learning sample, MAPE are lower than for the test sample. Nevertheless, results are similar, saying that models are well constructed. This is particularly the case for the model 1, which is more consistent over a new sample. 67 CHAPTER 1. TWO PROCEDURES To conclude this study, we could highlight the advantages of our procedure on these data. It provides a clustering of data, similar to the one done with supervised clustering in [FV06], but we could explain how this clustering is done. This work has been done with the Lasso-MLE procedure. Nevertheless, the same kind of results have been get with the Lasso-Rank procedure. 1.6 Conclusion In this chapter, two procedures are proposed to cluster regression data. Detecting the relevant clustering variables, they are especially designed for high-dimensional datasets. We use an ℓ1 -regularization procedure to select variables, and then deduce a reasonable random model collection. Thus, we recast estimations of parameters of these models into a general model selection problem. These procedures are compared with usual criteria on simulated data: the BIC criterion used to select a model, the maximum-likelihood estimator, and to the oracle when we know it. In addition, we compare our procedures to others on benchmark data. One main asset of those procedures is that it can be applied to functional datasets. We also develop this point of view. 1.7 Appendices In those appendices, we develop calculus for EM algorithm updating formulae in Section 1.7.1, for Lasso and maximum likelihood estimators, and for low ranks estimators. In Section 1.7.2, we extend our procedures with the Group-Lasso estimator to select relevant variables, rather than use the Lasso estimator. 1.7.1 EM algorithms EM algorithm for the Lasso estimator Introduced by Dempster et al. in [DLR77], the EM (Expectation-Maximization) algorithm is used to compute maximum likelihood estimators, penalized or not. The expected complete negative log-likelihood is denoted by 1 Q(θ|θ′ ) = − Eθ′ (lc (θ, X, Y, Z)|X, Y) n in which lc (θ, X, Y, Z) = n X K X [Zi ]k log i=1 k=1    1 det(Pk ) t exp − (Pk Yi − Xi Φk ) (Pk Yi − Xi Φk ) 2 (2π)q/2 + [Zi ]k log(πk ); with [Zi ]k are independent and identically distributed unobserved multinomial variables, showing the component-membership of the ith observation in the finite mixture regression model. The expected complete penalized negative log-likelihood is ′ ′ Qpen (θ|θ ) = Q(θ|θ ) + λ K X k=1 πk ||Φk ||1 . 68 1.7. APPENDICES Calculus for updating formula — E-step: compute Q(θ|θ(ite) ), or, equivalently, compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n}, (ite) τi,k = Eθ(ite) ([Zi ]k |Y) (ite) (ite) πk det Pk exp = PK  − 12 (ite) (ite) det Pk exp r=1 πk   (ite) Pk Y i  − 12 −   (ite) t (ite) X i Φk Pk Y i (ite) Pk Y i − − (ite) X i Φk   (ite) t (ite) X i Φk Pk Yi −  (ite) X i Φk  This formula updates the clustering, thanks to the MAP principle. — M-step: improve Qpen (θ|θ(ite) ). For this, rewrite the Karush-Kuhn-Tucker conditions. We have Qpen (θ|θ(ite) )      n K 1 XX 1 det(Pk ) t =− exp − (Pk Yi − Xi Φk ) (Pk Yi − Xi Φk ) |Y Eθ(ite) [Zi ]k log n 2 (2π)q/2 − 1 n i=1 k=1 n X K X i=1 k=1 Eθ(ite) ([Zi ]k log πk |Y) + λ K X k=1 πk ||Φk ||1 n K 1 XX 1 =− − (Pk Yi − Xi Φk )t (Pk Yi − Xi Φk )Eθ(ite) ([Zi ]k |Y) n 2 i=1 k=1   q n K [Pk ]z,z 1 XXX Eθ(ite) [[Zi ]k |Y] − log √ n 2π i=1 k=1 z=1 n − K K X 1 XX Eθ(ite) ([Zi ]k |Y) log πk + λ πk ||Φk ||1 . n i=1 k=1 (1.11) k=1 Firstly, we optimize this formula with respect to π: it is equivalent to optimize n − K K X 1 XX τi,k log(πk ) + λ πk ||Φk ||1 . n i=1 k=1 k=1 We obtain (ite+1) πk = (ite) πk +t (ite)  Pn i=1 τi,k n − (ite) πk  ; with t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, with 0 < δ < 1, such that the function is not increasing. To optimize (1.11) with respect to (Φ, P), we could rewrite the expression: it is similar to the optimization of ! q n X 1X 1 t − τi,k log([Pk ]z,z ) − (Pk Ỹi − X̃i Φk ) (Pk Ỹi − X̃i Φk ) + λπk ||Φk ||1 n 2 i=1 z=1 for all k ∈ {1, . . . , K}, which is equivalent to the optimization of q q 2 1 X 1 XX − nk log([Pk ]z,z ) + [Pk ]z,z [Ỹi ]k,z − [Φk ]z,. [X̃i ]k,. + λπk ||Φk ||1 ; n 2n z=1 n i=1 z=1 69 CHAPTER 1. TWO PROCEDURES P where nk = ni=1 τi,k . The minimum in [Pk ]z,z is the function which cancel its partial derivative with respect to [Pk ]z,z :   nk 1 1 X − + 2[Ỹi ]k,z [Pk ]z,z [Ỹi ]k,z − [Φk ]z,. [X̃i ]k,. = 0 n [Pk ]z,z 2n n i=1 for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, which is equivalent to −1 + n n i=1 i=1 X X 1 1 [Pk ]2z,z [Pk ]z,z [Ỹi ]k,z [Φk ]z,. [X̃i ]k,. = 0 [Ỹi ]2k,z − nk nk 1 1 ⇔ −1 + [Pk ]2z,z ||[Ỹ]k,z ||22 − [Pk ]z,z h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i = 0. nk nk The discriminant is ∆=  1 − h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i nk 2 − 4 ||[Ỹ]k,z ||22 . nk Then, for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, [Pk ]z,z = nk h[Ỹ]k,z , [Φk ]z,. [X̃]k,. i + √ 2nk ||[Ỹ]k,z ||22 ∆ . We could also look at the equation (1.11) as a function of the variable Φ: according to the partial derivative with respect to [Φk ]z,j , we obtain for all z ∈ {1, . . . , q}, for all k ∈ {1, . . . , K}, for all j ∈ {1, . . . , p},   p n X X [X̃i ]k,j [Pk ]z,z [Ỹi ]k,z − [X̃i ]k,j2 [Φk ]z,j2  − nλπk sgn([Φk ]z,j ) = 0. i=1 j2 =1 Then, for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}, Pn i=1 [X̃i ]k,j [Pk ]z,z [Ỹi ]k,z [Φk ]z,j = − Pp j2 =1 j2 6=j [X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 − nλπk sgn([Φk ]j,z ) ||[X̃]k,j ||22 To reduce notations, let, for all k ∈ {1, . . . , K}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}, [Sk ]j,z = − n X [X̃i ]k,j [Pk ]z,z [Ỹi ]k,z + i=1 p X [X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 . j2 =1 j2 6=j Then [Φk ]z,j = = −[Sk ]j,z − nλπk sgn([Φk ]z,j )        ||[X̃]k,j ||22 −[Sk ]j,z +nλπk ||[X̃]k,j ||22 [Sk ]j,z +nλπk − ||[X̃] ||2 k,j 2 0 if [Sk ]j,z > nλπk if [Sk ]j,z < −nλπk elsewhere. . 70 1.7. APPENDICES From these equalities, we could write the updating formulae. For j ∈ {1, . . . , p}, k ∈ {1, . . . , K}, z ∈ {1, . . . , q}, (ite) [Sk ]j,z nk = =− n X n X [X̃i ]k,j [Pk ](ite) z,z [Ỹi ]k,z p X + i=1 (ite) [X̃i ]k,j [X̃i ]k,j2 [Φk ]z,j2 ; j2 =1 j2 6=j τi,k ; i=1 ([Ỹi ]k,. , [X̃i ]k,. ) = √ τi,k (Yi , Xi ). and t(ite) ∈ (0, 1], the largest value in the grid {δ l , l ∈ N}, 0 < δ < 1, such that the function is not increasing. EM algorithm for the rank procedure To take into account the matrix structure, we want to make a dimension reduction on the rank of the mean matrix. If we know to which cluster each sample belongs, we could compute the low rank estimator for linear model in each component. Indeed, an estimator of fixed rank r is known in the linear regression case: denoting A+ the Moore-Penrose pseudo-inverse of A, and [A]r = U Dr V t in which Dr is obtained from D by setting (Dr )i,i = 0 for i ≥ r + 1, with U DV t the singular decomposition of A, if Y = βX + Σ, an estimator of β with rank r is β̂r = [(xt x)+ xt y]r . We do not know the clustering of the sample, but the E-step in the EM algorithm computes it. We suppose in this case that Σk and πk are known, for all k ∈ {1, . . . , K}. We use this algorithm to determine Φk , for all k ∈ {1, . . . , K}, with ranks fixed to R = (R1 , . . . , RK ). — E-step: compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n}, τi,k = Eθ(ite) ([Zi ]k |Y ) (ite) (ite) πk det Pk exp = PK  − 21 (ite) (ite) det Pk exp r=1 πk   (ite) Pk y i − 21  −   (ite) t (ite) x i Φk Pk yi (ite) Pk y i − − (ite) x i Φk   (ite) t (ite) x i Φk Pk y i −  (ite) x i Φk  — M-step: assign each observation in its estimated cluster, by the MAP principle applied (ite) thanks to the E-step. We say that Yi comes from component number argmax τi,k . k∈{1,...,K} Then, we can define β˜k (ite) = (xt|k x|k )−1 xt|k y|k , in which x|k and y|k are a restriction of (ite) the sample to the cluster k, which we decompose in singular value with β̃k = U SV t . Using the singular value decomposition described before, we obtain the estimator. 1.7.2 Group-Lasso MLE and Group-Lasso Rank procedures One way to perform those procedures is to consider the Group-Lasso estimator rather than the Lasso estimator to select relevant variables. Indeed, this estimator is more natural, according to the relevant variable definition. Nevertheless, results are very similar, because we select grouped variables in both case, selected by the Lasso or by the Group-Lasso estimator. In this section, we describe our procedures with the Group-Lasso estimator, which could be understood as an improvement of our procedures. 71 CHAPTER 1. TWO PROCEDURES Context - definitions Our both procedures take advantage of the Lasso estimator to select relevant variables, to reduce the dimension in case of high-dimensional datasets. First, recall what is a relevant variable. Definition 1.7.1. A variable indexed by (z, j) ∈ {1, . . . , q} × {1, . . . , p} is irrelevant for the clustering if [Φ1 ]z,j = . . . = [ΦK ]z,j = 0. A relevant variable is a variable which is not irrelevant. We denote by J the relevant variables set. According to this definition, we could introduce the Group-Lasso estimator. Definition 1.7.2. The Lasso estimator for mixture regression models with regularization parameter λ ≥ 0 is defined by   1 Lasso θ̂ (λ) := argmin − lλ (θ) ; n θ∈ΘK where where ||Φk ||1 = K Pp j=1 Pq X 1 1 − lλ (θ) = − l(θ) + λ πk ||Φk ||1 ; n n k=1 z=1 |[Φk ]z,j |, and with λ to specify. It is the estimator used in the both procedures described in previous parts. Definition 1.7.3. The Group-Lasso estimator for mixture regression models with regularization parameter λ ≥ 0 is defined by   1˜ Group-Lasso θ̂ (λ) := argmin − lλ (θ) ; n θ∈ΘK where p q XX√ 1 1 k||[Φ]z,j ||2 ; − ˜lλ (θ) = − l(θ) + λ n n j=1 z=1 where ||[Φ]z,j ||22 = PK k=1 |[Φk ]z,j | 2, and with λ to specify. This Group-Lasso estimator has the advantage to cancel grouped variables rather than variables one by one. It is consistent with the relevant variable definition. Nevertheless, depending on datasets, it could be interesting to look after which variables are canceled first. One way could be to extend this work with Lasso-Group-Lasso estimator, described for the example for the linear model in [SFHT13]. Let describe two additional procedures, which will use the Group-Lasso estimator rather than the Lasso estimator to detect relevant variables. Group-Lasso-MLE procedure This procedure is decomposed into three main steps: we construct a model collection, then in each model we compute the maximum likelihood estimator, and we select the best one among all the models. The first step consists of constructing a collection of models {H(K,J) in which H(K,J) ˜ }(K,J)∈M ˜ ˜ is defined by q p H(K,J) (1.12) ˜ = {y ∈ R |x ∈ R 7→ hθ (y|x)} ; 72 1.7. APPENDICES where hθ (y|x) = K X πk det(Pk ) k=1 and (2π)q/2 ˜ [J] ˜ [J] (Pk y − Φk x)t (Pk y − Φk x) exp − 2 θ = (π1 , . . . , πK , Φ1 , . . . , ΦK , ρ1 , . . . , ρK ) ∈ ΠK × Rq×p K ! × Rq+ , K . The model collection is indexed by M = K × J˜. Denote K ⊂ N∗ the possible number of components. We could bound K without loss of estimation. Denote also J˜ a collection of subsets of {1, . . . , q} × {1, . . . , p}, constructed by the Group-Lasso estimator. To detect the relevant variables, and construct the set J˜ ∈ J˜, we will use the Group-Lasso estimator defined by (1.7.3). In the ℓ1 -procedures, the choice of the regularization parameters is often difficult: fixing the number of components K ∈ K, we propose to construct a data-driven grid GK of regularization parameters by using the updating formulae of the mixture parameters in the EM algorithm. Then, for each λ ∈ GK , we could compute the Group-Lasso estimator defined by   p X q n  1X  X √ Group-Lasso θ̂ = argmin − log(hθ (yi |xi )) + λ K||[Φ]z,j ||2 .  θ∈ΘK  n j=1 z=1 i=1 For a fixed number of mixture components K ∈ K and a regularization parameter λ, we could use a generalized EM algorithm to approximate this estimator. Then, for each K ∈ K, and for each λ ∈ GK , we have constructed the relevant variables set J˜λ . We denote by J˜ the collection of all these sets. The second step consists of approximating the MLE ( ) n 1X ˜ (K,J) ĥ = argmin − log(t(yi |xi )) ; n t∈H(K,J) ˜ i=1 ˜ ∈ M. using the EM algorithm for each model (K, J) The third step is devoted to model selection. We use the slope heuristic described in [BM07]. Group-Lasso-Rank procedure As when the relevant variables were selected by the Lasso estimator, whereas the previous procedure does not take into account the multivariate structure, we propose a second procedure to perform this point. For each model belonging to the collection H(K,J) ˜ , a subcollection is constructed, varying the rank of Φ. Let us describe this procedure. As in the Group-Lasso-MLE procedure, we first construct a collection of models, thanks to the ℓ1 -approach. We obtain an estimator for θ, denoted by θ̂Group-Lasso , for each model belonging ˜ and this for all to the collection. We could deduce the set of relevant variables, denoted by J, ˜ K ∈ K: we deduce J the collection of set of relevant variables. The second step consists to construct a subcollection of models with rank sparsity, denoted by {Ȟ(K,J,R) ˜ }(K,J,R)∈ ˜ M̃ . ˜ The model {Ȟ(K,J,R) ˜ } has K components, the set J for active variables, and R is the vector of the ranks of the matrix of regression coefficients in each group: o n ˜ (K,J,R) (y|x) (1.13) Ȟ(K,J,R) = y ∈ Rq |x ∈ Rp 7→ hθ ˜ 73 CHAPTER 1. TWO PROCEDURES where ! ˜ ˜ t (P y − (ΦRk )[J] k [J] (Pk y − (ΦR ) x) x) k k k ; exp − = q/2 2 (2π) k=1 q K RK R 1 ; θ =(π1 , . . . , πK , ΦR 1 , . . . , ΦK , P1 , . . . , PK ) ∈ ΠK × ΨK × R+ n o  RK R1 q×p K ΨR = (Φ , . . . , Φ ) ∈ R |Rank(Φ ) = R , . . . , Rank(Φ ) = R 1 1 K K ; K 1 K ˜ (K,J,R) hθ (y|x) K X πk det(Pk ) and M̃R = K × J˜ × R. Denote K ⊂ N∗ the possible number of components, J˜ a collection of subsets of {1, . . . , q} × {1, . . . , p}, and R the set of vectors of size K ∈ K with ranks values for each mean matrix. We could compute the MLE under the rank constraint thanks to an EM algorithm. Indeed, we could constrain the estimation of Φk , for all k, to have a rank equal to Rk , in keeping only the Rk largest singular values. More details are given in Section 1.7.1. It leads to an estimator of the mean with row sparsity and low rank for each model. 1.7. APPENDICES 74 Chapter 2 An ℓ1-oracle inequality for the Lasso in finite mixture of multivariate Gaussian regression models Contents 2.1 2.2 2.3 2.4 2.5 2.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Finite mixture regression model . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Boundedness assumption on the mixture and component parameters . . 2.2.3 Maximum likelihood estimator and penalization . . . . . . . . . . . . . Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of the oracle inequality . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Main propositions used in this proof . . . . . . . . . . . . . . . . . . . . 2.4.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Proof of the Theorem 2.4.1 thanks to the Propositions 2.4.2 and 2.4.3 . 2.4.4 Proof of the Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . Proof of the theorem according to T or T c . . . . . . . . . . . . . . . 2.5.1 Proof of the Proposition 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Proof of the Proposition 2.4.3 . . . . . . . . . . . . . . . . . . . . . . . . Some details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Proof of the Lemma 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Lemma 2.6.5 and Lemma 4.15 . . . . . . . . . . . . . . . . . . . . . . . 75 69 70 70 71 71 72 74 74 76 76 77 78 78 81 83 83 86 76 2.1. INTRODUCTION This chapter focuses on the Lasso estimator for its regularization properties. We consider a finite mixture of Gaussian regressions for high-dimensional heterogeneous data, where the number of covariates and the dimension of the response may be much larger than the sample size. We estimate the unknown conditional density by an ℓ1 -penalized maximum likelihood estimator. We provide an ℓ1 -oracle inequality satisfied by this Lasso estimator according to the Kullback-Leibler loss. This result is an extension of the ℓ1 -oracle inequality established by Meynet in [Mey13] in the multivariate case. It is deduced from a model selection theorem, the Lasso being viewed as the solution of a penalized maximum likelihood model selection procedure over a collection of ℓ1 -ball models. 2.1 Introduction Finite mixture regression models are useful for modeling the relationship between response and predictors, arising from different subpopulations. Due to recent improvements, we are faced with high-dimensional data where the number of variables can be much larger than the sample size. We have to reduce the dimension to avoid identifiability problems. Considering a mixture of linear models, an assumption widely used is to say that only a few covariates explain the response. Among various methods, we focus on the ℓ1 -penalized least squares estimator of parameters to lead to sparse regression matrix. Indeed, it is a convex surrogate for the nonconvex ℓ0 -penalization, and produces sparse solutions. First introduced by Tibshirani in [Tib96] in a linear model Y = Xβ + ǫ, where X ∈ Rp , Y ∈ R, and ǫ ∼ N (0, Σ), the Lasso estimator is defined in the linear model by  β̂ Lasso (λ) = argmin ||Y − Xβ||22 + λ||β||1 , λ > 0. β∈Rp Many results have been proved to study the performance of this estimator. For example, cite [BRT09] and [EHJT04], for studying this estimator as a variable selection procedure in this linear model case. Note that those results need strong assumptions on the Gram matrix X t X, as the restrictive eigenvalue condition, that can be not fulfilled in practice. A summary of assumptions and results is given by Bühlmann and van de Geer in [vdGB09]. One can also cite van de Geer in [vdGBZ11] and discussions, who precises a chaining argument to perform rate, even in a non linear case. If we assume that (xi , yi )1≤i≤n arise from different subpopulations, we could work with finite mixture regression models. Indeed, the homogeneity assumption of the linear model is often inadequate and restrictive. This model was introduced by Städler et al., in [SBG10]. They assume that, for i ∈ {1, . . . , n}, the observation yi , conditionally to Xi = xi , comes from a conditional density sξ0 (.|xi ) which is a finite mixture of K Gaussian conditional densities with proportion vector π, where   K X πk0 (yi − βk0 xi )2 √ Yi |Xi = xi ∼ sξ0 (yi |xi ) = exp − 2(σk0 )2 2πσk0 k=1 for some parameters ξ 0 = (πk0 , βk0 , σk0 )1≤k≤K . They extend the Lasso estimator by   p n K   1X X X −1 log(sK (y |x )) + λ , λ>0 ŝLasso (λ) = argmin − π |σ [β ] | i i j k k ξ k  n  sK ξ i=1 k=1 j=1 (2.1) 77 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS For this estimator, they provide an ℓ0 -oracle inequality satisfied by ŝLasso (λ), according to the restricted eigenvalue condition also, and margin conditions, which lead to link the KullbackLeibler loss function to the ℓ2 -norm of the parameters. Another way to study this estimator is to look after the Lasso for its ℓ1 -regularization properties. For example, cite [MM11a], [Mey13], and [RT11]. Contrary to the ℓ0 -results, some ℓ1 -results are valid with no assumptions, neither on the Gram matrix, nor on the margin. This can be √ achieved due to the fact that they are looking for rate of convergence of order 1/ n rather than 1/n. For finite mixture Gaussian regression models, we could cite Meynet in [Mey13] who give an ℓ1 -oracle inequality for the Lasso estimator (2.1). In this chapter, we extend this result to finite mixture of multivariate Gaussian regression models. We will work with random multivariate variables (X, Y ) ∈ Rp × Rq . As in [Mey13], we shall restrict to the fixed design case, that is to say non-random regressors. We observe (xi )1≤i≤n . Without any restriction, we could assume that the regressors xi ∈ [0, 1]p for all i ∈ {1, . . . , n}. Under only bounded parameters assumption, we provide a lower bound on the Lasso regularization parameter λ which guarantees such an oracle inequality. This result is non-asymptotic: the number of observations is fixed, and the number p of covariates can grow. Remark that the number K of clusters in the mixture is supposed to be known. Our result is deduced from a finite mixture multivariate Gaussian regression model selection theorem for ℓ1 -penalized maximum likelihood conditional density estimation. We establish the general theorem following the one of Meynet in [Mey13], which combines Vapnik’s structural risk minimization method (see Vapnik in [Vap82]) and theory around model selection (see Le Pennec and Cohen in [CLP11] and Massart in [Mas07]). As in Massart and Meynet in [MM11a], our oracle inequality is deduced from this general theorem, the Lasso being viewed as the solution of a penalized maximum likelihood model selection procedure over a countable collection of ℓ1 -ball models. The chapter is organized as follows. The model and the framework are introduced in Section 2.2. In Section 2.3, we state the main result of the chapter, which is an ℓ1 -oracle inequality satisfied by the Lasso in finite mixture of multivariate Gaussian regression models. Section 2.4 is devoted to the proof of this result and of the general theorem, deriving from two easier propositions. Those propositions are proved in Section 2.5, whereas details of lemma states in Section 2.6. 2.2 2.2.1 Notations and framework Finite mixture regression model We observe n independent couples (x, y) = (xi , yi )1≤i≤n ∈ ([0, 1]p ×Rq )n , with yi ∈ Rq a random observation, realization of variable Yi , and xi ∈ [0, 1]p fixed for all i ∈ {1, . . . , n}. We assume that, conditionally to the xi s, the Yi s are independent identically distributed with conditional density sξ0 (.|xi ) which is a finite mixture of K Gaussian regressions with unknown parameters ξ 0 . In this chapter, K is fixed, then we do not precise it with unknown parameters. We will estimate the unknown conditional density by a finite mixture of K Gaussian regressions. Each subpopulation is then estimated by a multivariate linear model. Detail the conditional density. 78 2.2. NOTATIONS AND FRAMEWORK For all y ∈ Rq , for all x ∈ [0, 1]p , K X (y − βk x)t Σ−1 πk k (y − βk x) exp − sξ (y|x) = q/2 1/2 2 (2π) det(Σk ) k=1 ! K ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ Ξ = ΠK × (Rq×p )K × (S++ q ) ( ) K X ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and πk = 1 (2.2)  k=1 S++ q is the set of symmetric positive definite matrices on Rq . We want to estimate ξ 0 from the observations. For all k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the mixture component k, whereas the πk s are the mixture proportions. For x ∈ [0, 1]p , we define the parameter ξ(x) of the conditional density sξ (.|x) by K ξ(x) = (π1 , . . . , πK , β1 x, . . . , βK x, Σ1 , . . . , ΣK ) ∈ RK × (Rq )K × (S++ q ) . P For all k ∈ {1, . . . , K}, for all x ∈ [0, 1]p , for all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j [x]j , and then βk x ∈ Rq is the mean vector of the mixture component k for the conditional density sξ (.|x). 2.2.2 Boundedness assumption on the mixture and component parameters Denote, for a matrix A, m(A) the modulus of the smallest eigenvalue of A, and M (A) the modulus of the largest eigenvalue of A. We shall restrict our study to bounded parameters vector ξ = (π, β, Σ), where π = (π1 , . . . , πK ), β = (β1 , . . . , βK ), Σ = (Σ1 , . . . , ΣK ). Specifically, we assume that there exists deterministic positive constants Aβ , aΣ , AΣ , aπ such that ξ belongs to Ξ̃, with Ξ̃ = ( ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max sup |[βk x]z | ≤ Aβ , z∈{1,...,q} x∈[0,1]p aΣ ≤ m(Σ−1 k ) ≤ M (Σ−1 k )  ≤ AΣ , aπ ≤ πk . (2.3) Let S the set of conditional densities sξ , n o S = sξ , ξ ∈ Ξ̃ . 2.2.3 Maximum likelihood estimator and penalization In a maximum likelihood approach, we consider the Kullback-Leibler information as the loss function, which is defined for two densities s and t by Z   s(y)   log s(y)dy if sdy << tdy; t(y) KL(s, t) = (2.4) Rq   + ∞ otherwise. In a regression framework, we have to adapt this definition to take into account the structure of conditional densities. For the fixed covariates (x1 , . . . , xn ), we consider the average loss function   n n Z 1X 1X s(y|xi ) s(y|xi )dy. KL(s(.|xi ), t(.|xi )) = log KLn (s, t) = n n t(y|xi ) Rq i=1 i=1 79 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS Using the maximum likelihood approach, we want to estimate sξ0 by the conditional density sξ which maximizes the likelihood conditionally to (xi )1≤i≤n . Nevertheless, because we work with high-dimensional data, we have to regularize the maximum likelihood estimator. We consider the ℓ1 -regularization, and a generalization of the estimator associated, the Lasso estimator, which we define by   q X p n K X   1X X log(sξ (yi |xi )) + λ ŝLasso (λ) := argmin − |[βk ]z,j | ;  sξ ∈S  n k=1 z=1 j=1 i=1 where λ > 0 is a regularization parameter, for ξ = (π, β, Σ). We define also, for sξ defined as in (2.2), and with parameters ξ = (π, β, Σ), [2] N1 (sξ ) 2.3 = ||β||1 = p X q K X X k=1 j=1 z=1 (2.5) |[βk ]z,j |. Oracle inequality In this section, we provide an ℓ1 -oracle inequality satisfied by the Lasso estimator in finite mixture multivariate Gaussian regression models, which is the main result of this chapter. Theorem 2.3.1. We observe n couples (x, y) = ((x1 , y1 ), . . . , (xn , yn )) ∈ ([0, 1]p × Rq )n coming from the conditional density sξ0 , where ξ 0 ∈ Ξ̃, where Ξ̃ = ( ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max sup |[βk x]z | ≤ Aβ , z∈{1,...,q} x∈[0,1]p aΣ ≤ m(Σ−1 k ) ≤ M (Σ−1 k )  ≤ AΣ , aπ ≤ πk . Denote by a ∨ b = max(a, b). We define the Lasso estimator, denoted by ŝLasso (λ), for λ ≥ 0, by ŝ Lasso ! n 1X [2] (λ) = argmin − log(sξ (yi |xi )) + λN1 (sξ ) ; n sξ ∈S (2.6) i=1 with n o S = sξ , ξ ∈ Ξ̃ and where, for ξ = (π, β, Σ), [2] N1 (sξ ) = ||β||1 = p X q K X X k=1 j=1 z=1 |[βk ]z,j |. Then, if  1 λ ≥ κ AΣ ∨ aπ    r   p log(n) K 2 1 + 4(q + 1)AΣ Aβ + 1 + q log(n) K log(2p + 1) aΣ n with κ an absolute positive constant, the estimator (2.6) satisfies the following ℓ1 -oracle inequality. 80 2.3. ORACLE INEQUALITY E[KLn (sξ0 , ŝLasso (λ))] ≤ (1 + κ−1 ) + + κ κ ′ ′ q q inf sξ ∈S 1 K n K n   [2] KLn (sξ0 , sξ ) + λN1 (sξ ) + λ e− 2 π q/2 aπ p q/2 AΣ 2q    log(n) 1 2 1 + 4(q + 1)AΣ Aβ + AΣ ∨ aπ aΣ  2 q ×K 1 + Aβ + aΣ  where κ′ is a positive constant. This theorem provides information about the performance of the Lasso as an ℓ1 -regularization algorithm. If the regularization parameter λ is properly chosen, the Lasso estimator, which is the solution of the ℓ1 -penalized empirical risk minimization problem, behaves as well as the deterministic Lasso, which is the solution of the ℓ1 -penalized true risk minimization problem, up to an error term of order λ. Our result is non-asymptotic: the number n of observations is fixed while the number p of covariates and the size q of he response can grow with respect to n and can be much larger than n. The number K of clusters in the mixture is fixed. There is no assumption neither on the Gram matrix, nor on the margin, which are classical assumptions for oracle inequality for the Lasso estimator. Moreover, this kind of assumptions involve unknown constants, whereas here, every constants are explicit. We could compare this result with the ℓ0 -oracle inequality established in [SBG10], which need those assumptions, and is therefore difficult to interpret. Nevertheless, they get faster rate, the error term in the oracle √ inequality being of order 1/n rather than 1/ n. The main assumption we make to establish the Theorem 2.3.1 is the boundedness of the parameters, which is also assumed in [SBG10]. It is needed, to tackle the problem of the unboundedness of the parameter space (see [MP04] for example). Moreover, we let regressors to belong to [0, 1]p . Because we work with fixed covariates, they are finite. To simplify the reading, we choose to rescale x to get ||x||∞ ≤ 1. Nevertheless, if we not rescale the covariates, and the regularization parameter λ bound and the error term of the oracle inequality depend linearly of ||x||∞ . p √ The regularization parameter λ bound is of order (q 2 + q)/ n log(n)2 log(2p + 1). For q = 1, we recognize the same order, as regards to the sample size n and the number of covariates p, to the ℓ1 -oracle inequality in [Mey13]. Van de Geer, q in [vdGBZ11], gives some tools to improve the bound of the regularization parameter to log(p) n . Nevertheless, we have to control eigenvalues of the Gram matrix of some functions (ψj (xi )) 1≤j≤D , D being the number of parameters to estimate, where ψj (xi ) satisfies 1≤i≤n | log(sξ (yi |xi )) − log(sξ̃ (yi |xi ))| ≤ D X j=1 |ξj − ξ˜j |ψj (xi ). In our case of mixture of regression models, control eigenvalues of the Gram matrix of functions (ψj (xi )) 1≤j≤D corresponds to make some assumptions, as REC, to avoid dimension reliance on 1≤i≤n n, K and p. Without this kind of assumptions, we could not guarantee that our bound is of order q log(p) n , because we could not guarantee that eigenvalues does not depend on dimensions. In 81 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS order to get a result with smaller assumptions, we do not use the chaining argument developed in [vdGBZ11]. Nevertheless, one can easily compute that, under restricted eigenvalue condition, q log(p) n we could perform the order of the regularization parameter to λ ≍ 2.4 2.4.1 log(n). Proof of the oracle inequality Main propositions used in this proof The first result we will prove is the next theorem, which is an ℓ1 -ball mixture multivariate regression model selection theorem for ℓ1 -penalized maximum likelihood conditional density estimation in the Gaussian framework. Theorem 2.4.1. We observe (xi , yi )1≤i≤n with unknown conditional Gaussian mixture density sξ 0 . [2] For all m ∈ N∗ , we consider the ℓ1 -ball Sm = {sξ ∈ S, N1 (sξ ) ≤ m} for S = {sξ , ξ ∈ Ξ̃}, and Ξ̃ defined by ( Ξ̃ = ξ ∈ Ξ : for all k ∈ {1, . . . , K}, max sup |[βk x]z | ≤ Aβ , z∈{1,...,q} x∈[0,1]p aΣ ≤ m(||Σk−1 ||) ≤ M (Σ−1 k )  ≤ AΣ , aπ ≤ πk . For ξ = (π, β, Σ), let [2] N1 (sξ ) = ||β||1 = p X q K X X k=1 j=1 z=1 Let ŝm an ηm -log-likelihood minimizer in Sm , for ηm ≥ 0: n 1X log(ŝm (yi |xi )) ≤ inf − sm ∈Sm n i=1 n |[βk ]z,j |. 1X log(sm (yi |xi )) − n i=1 ! + ηm . Assume that for all m ∈ N∗ , the penalty function satisfies pen(m) = λm with    r    p K log(n) 1 2 1 + 4(q + 1)AΣ Aβ + 1 + q log(n) K log(2p + 1) λ ≥ κ AΣ ∨ aπ aΣ n for a constant κ. Then, if m̂ is such that n 1X − log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗ m∈N n i=1 for η ≥ 0, the estimator ŝm̂ satisfies −1  n 1X log(ŝm (yi |xi )) + pen(m) − n i=1  ! +η inf KLn (sξ0 , sm ) + pen(m) + ηm + η E(KLn (sξ0 , ŝm̂ )) ≤(1 + κ ) inf ∗ sm ∈Sm m∈N r 1 K e− 2 π q/2 p ′ 2qaπ +κ n Aq/2 Σ r     4(q + 1) log(n) K 1 ′ +κ 1+ K AΣ ∨ AΣ A2β + n aπ 2 aσ  2 q × 1 + Aβ + ; aΣ 82 2.4. PROOF OF THE ORACLE INEQUALITY ′ where κ is a positive constant. It is an ℓ1 -ball mixture regression model selection theorem for ℓ1 -penalized maximum likelihood conditional density estimation in the Gaussian framework. Its proof could be deduced from the two following propositions, which split the result if the variable Y is large enough or not. Proposition 2.4.2. We observe (xi , yi )1≤i≤n , with unknown conditional density denoted by sξ0 . Let Mn > 0, and consider the event   T := max max |[Yi ]z | ≤ Mn . i∈{1,...,n} z∈{1,...,q} For all m ∈ N∗ , we consider the ℓ1 -ball [2] Sm = {sξ ∈ S, N1 (sξ ) ≤ m} where S = {sξ , ξ ∈ Ξ̃} and [2] N1 (sξ ) = ||β||1 = p X q K X X k=1 j=1 z=1 |[βk ]z,j | for ξ = (π, β, Σ). Let ŝm an ηm -log-likelihood minimizer in Sm , for ηm ≥ 0: n 1X − log(ŝm (yi |xi )) ≤ inf sm ∈Sm n i=1 n 1X log(sm (yi |xi )) − n i=1 ! + ηm .   q(|Mn |+Aβ )AΣ . Assume that for all m ∈ N∗ , the Let CMn = max a1π , AΣ + 21 (|Mn | + Aβ )2 A2Σ , 2 penalty function satisfies pen(m) = λm with  p 4CM √  λ ≥ κ √ n K 1 + 9q log(n) K log(2p + 1) n for some absolute constant κ. Then, any estimate ŝm̂ with m̂ such that n 1X − log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗ m∈N n i=1 n 1X log(ŝm (yi |xi )) + pen(m) − n i=1 ! +η for η ≥ 0, satisfies E(KLn (sξ0 , ŝm̂ )1T ) ≤(1 + κ −1 ) inf ∗ ′ + m∈N  κ K 3/2 qCMn √ n inf KLn (sξ0 , sm ) + pen(m) + ηm  !  q 2 ; 1 + Aβ + aΣ sm ∈Sm  ′ where κ is an absolute positive constant. Proposition 2.4.3. Let sξ0 , T and ŝm̂ defined as in the previous proposition. Then, E(KLn (sξ0 , ŝm̂ )1T c ) ≤ e−1/2 π q/2 p 2 2Knqaπ e−1/4(Mn −2Mn Aβ )aΣ . q/2 AΣ 83 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS 2.4.2 Notations To prove those two propositions, and the theorem, begin with some notations. For any measurable function g : Rq 7→ R , we consider the empirical norm v u n u1 X g 2 (yi |xi ); gn := t n i=1 its conditional expectation E(g(Y |X)) = Z g(y|x)sξ0 (y|x)dy; Rq its empirical process n 1X g(yi |xi ); Pn (g) := n i=1 and its normalized process  Z n  1X g(yi |xi ) − g(y|xi )sξ0 (y|xi )dy . νn (g) := Pn (g) − EX (Pn (g)) = n Rq i=1 For all m ∈ N∗ , for all model Sm , we define Fm by     sm Fm = fm = − log , s m ∈ Sm . sξ 0 Let δKL > 0. For all m ∈ N∗ , let ηm ≥ 0. There exist two functions, denoted by ŝm̂ and s̄m , belonging to Sm , such that Pn (− log(ŝm̂ )) ≤ inf Pn (− log(sm )) + ηm ; sm ∈Sm Denote by fˆm := − log set M (m) by  (2.7) KLn (sξ0 , s̄m ) ≤ inf KLn (sξ0 , sm ) + δKL . ŝm sξ 0  sm ∈Sm and f¯m := − log  s̄m sξ 0  . Let η ≥ 0 and fix m ∈ N∗ . We define the  M (m) = m′ ∈ N∗ |Pn (− log(ŝm′ )) + pen(m′ ) ≤ Pn (− log(ŝm )) + pen(m) + η . 2.4.3 (2.8) Proof of the Theorem 2.4.1 thanks to the Propositions 2.4.2 and 2.4.3   Let Mn > 0 and κ ≥ 36. Let CMn = max a1π , AΣ + 21 (|Mn | + Aβ )2 A2Σ , q(|Mn | + Aβ )AΣ /2 . Assume that, for all m ∈ N∗ , pen(m) = λm, with r   p K λ ≥ κCMn 1 + q log(n) K log(2p + 1) . n We derive from the two propositions that there exists κ′ such that, if m̂ satisfies n 1X − log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗ m∈N n i=1 n 1X − log(ŝm (yi |xi )) + pen(m) n i=1 ! + η; 2.4. PROOF OF THE ORACLE INEQUALITY 84 then ŝm̂ satisfies E(KLn (sξ0 , ŝm̂ )) = E(KLn (sξ0 , ŝm̂ )1T ) + E(KLn (sξ0 , ŝm̂ )1T c )   −1 inf KLn (sξ0 , sm ) + pen(m) + ηm ≤(1 + κ ) inf ∗ sm ∈Sm m∈N   q 2 CM + κ′ √ n K 3/2 q 1 + (Aβ + ) +η aΣ n e−1/2 π q/2 p 2 + κ′ 2Knqaπ e−1/4(Mn −2Mn Aβ )aΣ . q/2 AΣ In order to optimize this equation with respect to Mn , we consider Mn the positive solution of the polynomial 1 log(n) − (X 2 − 2XAβ )aΣ = 0; 4 q √ 2 and ne−1/4(Mn −2Mn Aβ )aΣ = √1n . we obtain Mn = Aβ + A2β + 4 log(n) aΣ On the other hand,    q+1 1 2 1+ AΣ (Mn + Aβ ) C M n ≤ AΣ ∨ aπ 2     1 log(n) ≤ AΣ ∨ 1 + 4(q + 1)AΣ A2β + . aπ aΣ We get (sξ0 , ŝm̂ )1T ) + E(KLn (sξ0 , ŝm̂ )1T c )   −1 inf KLn (sξ0 , sm ) + pen(m) + ηm + η ≤ (1 + κ ) inf ∗ E(KLn (sξ0 , ŝm̂ )) = E(KLn sm ∈Sm m∈N +κ′ +κ′ 2.4.4 q q K n K n − 12 p e 2qaπ q/2 (qAΣ )     log(n) 1 2 1 + 4(q + 1)AΣ Aβ + AΣ ∨ aπ aΣ  2 ! q ×K 1 + Aβ + . aΣ π q/2 Proof of the Theorem 2.3.1 We will show that there exists ηm ≥ 0, and η ≥ 0 such that ŝLasso (λ) satisfies the hypothesis of the Theorem 2.4.1, which will lead to Theorem 2.3.1. First, let show that there exists m ∈ N∗ and ηm ≥ 0 such that the Lasso estimator is an ηm -log-likelihood minimizer in Sm . [2] For all λ ≥ 0, if mλ = ⌈N1 (ŝ(λ))⌉, ! n 1X Lasso log(s(yi |xi )) . ŝ (λ) = argmin − n s∈S [2] N1 (s)≤mλ We could take ηm = 0. Secondly, let show that there exists η ≥ 0 such that i=1 85 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS n 1X log(ŝLasso (λ)(yi |xi )) + pen(mλ ) ≤ inf ∗ − m∈N n i=1 n 1X log(ŝm (yi |xi )) + pen(m) − n i=1 ! + η. Taking pen(mλ ) = λmλ , n − n 1X 1X log(ŝLasso (λ)(yi |xi )) + pen(mλ ) = − log(ŝLasso (λ)(yi |xi )) + λmλ n n i=1 i=1 n 1X [2] ≤− log(ŝLasso (λ)(yi |xi )) + λN1 (ŝLasso (λ)) + λ n i=1 ) ( n 1X [2] log(sξ (yi |xi )) + λN1 (sξ ) + λ ≤ inf − sξ ∈S n i=1 ) ( n 1X [2] log(sξ (yi |xi )) + λN1 (sξ ) + λ ≤ inf ∗ inf − m∈N sξ ∈Sm n i=1 ( ) ! n 1X ≤ inf ∗ inf − log(sξ (yi |xi )) + λm + λ sξ ∈Sm m∈N n i=1 ! n X 1 ≤ inf ∗ − log(ŝm (yi |xi )) + λm + λ. m∈N n i=1 which is exactly the goal, with η = λ. Then, according to the Theorem 2.4.1, with m̂ = mλ , and ŝm̂ = ŝLasso (λ), for    r    p log(n) 1 K 1 + 4(q + 1)AΣ A2β + 1 + q log(n) K log(2p + 1) , λ ≥ κ AΣ ∨ aπ aΣ n we get the oracle inequality. 2.5 2.5.1 Proof of the theorem according to T or T c Proof of the Proposition 2.4.2 This proposition corresponds to the main theorem according to the event T . To prove it, we need some preliminary results. From our notations, reminded in Section 2.4.2, we have, for all m ∈ N∗ for all m′ ∈ M (m), Pn (fˆm′ ) + pen(m′ ) ≤ Pn (fˆm ) + pen(m) + η ≤ Pn (f¯m ) + pen(m) + ηm + η; E(Pn (fˆm′ )) + pen(m′ ) ≤ E(Pn (f¯m )) + pen(m) + ηm + η + νn (f¯m ) − νn (fˆm′ ); KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + δKL + pen(m) + ηm + η + νn (f¯m ) − νn (fˆm′ ); sm ∈Sm thanks to the inequality (2.7). The goal is to bound −νn (fˆm′ ) = νn (−fˆm′ ). To control this term, we use the following lemma. (2.9) 2.5. PROOF OF THE THEOREM ACCORDING TO T OR T C 86 Lemme 2.5.1. Let Mn > 0. Let     max |[Yi ]z | ≤ Mn . T = max i∈{1,...,n} Let CMn = max  1 aπ , AΣ z∈{1,...,q} + 21 (|Mn | + Aβ )2 A2Σ , ∆m′ = m′ log(n) q(|Mn |+Aβ )AΣ 2  and    p q . K log(2p + 1) + 6 1 + K Aβ + aΣ Then, on the event T , for all m′ ∈ N∗ , for all t > 0, with probability greater than 1 − e−t ,     √ √ √ 4CMn q 9 Kq∆m′ + 2 t 1 + K Aβ + sup |νn (−fm′ )| ≤ √ aΣ n fm′ ∈Fm′ Proof. Page 90 N From (2.9), on the event T , for all m ∈ N∗ , for all m′ ∈ M (m), for all t > 0, with probability greater than 1 − e−t , KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + δKL + pen(m) + νn (f¯m ) sm ∈Sm     √ √ √ 4CMn q + √ 9 Kq∆m′ + 2 t 1 + K Aβ + + ηm + η aΣ n ≤ inf KLn (sξ0 , sm ) + pen(m) + νn (f¯m ) sm ∈Sm ! 2   √ √ 1 q C Mn + Kt 9 Kq∆m′ + √ 1 + K Aβ + +4 √ aΣ n 2 K + ηm + η + δKL , the last inequality being true because 2ab ≤ √1 a2 + K √ Kb2 . Let z > 0 such that t = z + m + m′ . ′ On the event T , for all m ∈ N, for all m′ ∈ M (m), with probability greater than 1 − e−(z+m+m ) , KLn (sξ0 , ŝm′ ) + pen(m′ ) ≤ inf KLn (sξ0 , sm ) + pen(m) + νn (f¯m ) sm ∈Sm CM +4 √ n n √ 1 9 Kq∆m′ + √ 2 K √  ′ K(z + m + m ) CM +4 √ n n + ηm + η + δKL .   q 1 + K Aβ + aΣ 2 ! CM √ KLn (sξ0 , ŝm′ ) − νn (f¯m ) ≤ inf KLn (sξ0 , sm ) + pen(m) + 4 √ n Km sm ∈Sm n   √ 4CMn ′ ′ ′ K(m + 9q∆m ) − pen(m ) + √ n ! 2   √ 1 4CMn q √ + Kz + ηm + η + δKL . + √ 1 + K Aβ + aΣ n 2 K Let κ ≥ 1, and assume that pen(m) = λm with  p 4CM √  λ ≥ √ n K 1 + 9q log(n) K log(2p + 1) n 87 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS Then, as ∆ with m′    p q , = m log(n) K log(2p + 1) + 6 1 + K Aβ + aΣ ′ 4CM √ 1 1 p κ−1 = √ n K ≤ , λ n 1 + 9q log(n) K log(2p + 1) we get that KLn (sξ0 , ŝm′ ) − νn (f¯m ) ≤ inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m) sm ∈Sm 2   4CMn 1 q √ + √ 1 + K Aβ + aΣ n 2 K      √ √ 4CM q + √ n 54 Kq 1 + K Aβ + + Kz aΣ n + η + δKL + ηm ≤ inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m) sm ∈Sm 27 + 1/2 27K 3/2 + √ K 4CM + √ n n   q 1 + K Aβ + aΣ 2 + √ Kz ! + ηm + η + δKL . Let m̂ such that n 1X log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf ∗ − m∈N n i=1 n 1X − log(ŝm (yi |xi )) + pen(m) n i=1 ! + η; and M (m) = {m′ ∈ N∗ |Pn (− log(ŝm′ )) + pen(m′ ) ≤ Pn (− log(ŝm )) + pen(m) + η} . By definition, m̂ ∈ M (m). Because for all m ∈ N∗ , for all m′ ∈ M (m), X X ′ ′ e−(z+m+m ) ≥ 1 − e−z e−m−m ≥ 1 − e−z , 1− m∈N∗ m′ ∈M (m) (m,m′ )∈(N∗ )2 we could sum up over all models. On the event T , for all z > 0, with probability greater than 1 − e−z ,   inf KLn (sξ0 , sm ) + (1 + κ−1 ) pen(m) + ηm KLn (sξ0 , ŝm̂ ) − νn (f¯m ) ≤ inf ∗ m∈N sm ∈Sm 4CM + √ n n 55q 27K 3/2 + √ 2 K   q 1 + K Aβ + aΣ 2 + √ Kz ! + η + δKL . By integrating over z > 0, and noticing that E(νn (f¯m )) = 0 and that δKL can be chosen arbitrary 2.5. PROOF OF THE THEOREM ACCORDING TO T OR T C 88 small, we get E(KLn (sξ0 , ŝm̂ )1T ) ≤ inf ∗ m∈N  inf KLn (sξ0 , sm ) + (1 + κ sm ∈Sm −1 ) pen(m) + ηm  ! 2   √ 4CMn 55 q q + K +η 27K 3/2 + √ 1 + K Aβ + + √ aΣ n K 2   −1 ≤ inf ∗ inf KLn (sξ0 , sm ) + (1 + κ ) pen(m) + ηm sm ∈Sm m∈N   ! q 2 332K 3/2 qCMn √ 1 + Aβ + + + η. aΣ n 2.5.2 Proof of the Proposition 2.4.3  We want an upper bound of E KLn (sξ0 , ŝm̂ )1T c . Thanks to the Cauchy-Schwarz inequality, E KLn (sξ0 , ŝm̂ )1T However, for all sξ ∈ S, n 1X KLn (sξ0 , sξ ) = n 1 = n ≤− Z log q i=1 R  Z n X Rq i=1 Z n X 1 n i=1 Rq  c  ≤ sξ0 (y|xi ) sξ (y|xi ) q  E(KL2n (sξ0 , ŝm̂ )) p P (T c ). sξ0 (y|xi )dy log(sξ0 (y|xi ))sξ0 (y|xi )dy − Z Rq log(sξ (y|xi ))sξ0 (y|xi )dy  log(sξ (y|xi ))sξ0 (y|xi )dy. Because parameters are assumed to be bounded, according to the set (2.3), we get, with (β 0 , Σ0 , π 0 ) the parameters of sξ0 and (β, Σ, π) the parameters of sξ , !! (y − βk xi )t Σ−1 (y − β x ) πk k i k p exp − log(sξ (y|xi ))sξ0 (y|xi ) = log q/2 det(Σ ) 2 (2π) k k=1   K X (y − βk0 xi )t (Σ0k )−1 (y − βk0 xi ) πk0 q × exp − 2 q/2 det(Σ0 ) k=1 (2π) k q   ) aπ det(Σ−1  1 t t −1  ≥ log K exp −(y t Σ−1 1 y + x i β 1 Σ1 β 1 x i ) q/2 (2π) p  aπ det((Σ01 )−1 ) t t −1 exp −(y t Σ−1 ×K 1 y + x i β 1 Σ1 β 1 x i ) q/2 (2π) ! q/2  aπ aΣ exp −(y t y + A2β )AΣ ≥ log K (2π)q/2 K X q/2 ×K  aπ aΣ exp −(y t y + A2β )AΣ . q/2 (2π) 89 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS Indeed, for u ∈ Rq , if we use the eigenvalue decomposition of Σ = P t DP , |ut Σu| = |ut P t DP u| ≤ ||P u||2 ||DP U ||2 ≤ M (D)||P u||22 ≤ M (D)||u||22 ≤ AΣ ||u||22 . √ To recognize the expectation of a Gaussian standardized variables, we put u = 2AΣ y: # −ut u " ! 2 q/2 q/2 Z tu KaΣ aπ Kaπ e−Aβ AΣ aΣ e 2 u log KL(sξ0 (.|xi ), sξ (.|xi )) ≤ − du − A2β AΣ − q/2 q/2 2 (2π)q/2 (2AΣ ) (2π) Rq # " ! 2 q/2 q/2 Kaπ aΣ aΣ Kaπ e−Aβ AΣ U2 2 E log ≤− − Aβ AΣ − 2 (2AΣ )q/2 (2π)q/2 # " ! 2 q/2 q/2 Kaπ aΣ KaΣ aπ e−Aβ AΣ 1 ≤− log − A2β AΣ − q/2 q/2 2 (2AΣ ) (2π) ! 2 2 q/2 q/2 KaΣ aπ e−Aβ AΣ −1/2 1/2 q/2 Kaπ e−Aβ AΣ −1/2 aΣ e π log ≤− q/2 (2π)q/2 (2π)q/2 A Σ ≤ e−1/2 π q/2 q/2 AΣ ; where U ∼ Nq (0, 1). We have used that for all t ∈ R, t log(t) ≥ −e−1 . Then, we get, for all sξ ∈ S, n e−1/2 π q/2 1X KL(sξ0 (.|xi ), sξ (.|xi )) ≤ . KLn (sξ0 , sξ ) ≤ q/2 n A i=1 Σ As it is true for all sξ ∈ S, it is true for ŝm̂ , then q E(KL2n (sξ0 , ŝm̂ )) ≤ e−1/2 π q/2 q/2 AΣ . For the last step, we need to bound P (T c ). c c P (T ) = E(1T c ) = E(EX (1T c )) = E(PX (T )) ≤ E Nevertheless, let Yx ∼ P (||Yx ||∞ PK k=1 πk Nq (βk x, Σk ), Z n X i=1 ! PX (||Yi ||∞ > Mn ) . then,  K X (y−β x)t Σ−1 (y−β x)  k k k − 1 2 p e dy > Mn ) = 1{||Yx ||∞ ≥Mn } πk q/2 det(Σ ) (2π) Rq k k=1   Z (y−βk x)t Σ−1 (y−βk x) K k X − 1 2 p = dy 1{||Yx ||∞ ≥Mn } e πk q/2 det(Σ ) q (2π) R k k=1 = K X k=1 πk PX (||Yxk ||∞ > Mn ) ≤ q K X X k=1 z=1 with Yxk ∼ N (βk x, Σk ) and [Yxk ]z ∼ N ([βk x]z , [Σk ]z,z ). πk PX (|[Yxk ]z | > Mn ) 90 2.6. SOME DETAILS We need to control PX (|[Yxk ]z | > Mn ), for all z ∈ {1, . . . , q}. PX (|[Yxk ]z | > Mn ) = PX ([Yxk ]z > Mn ) + PX ([Yx,k ]z < −Mn ) ! ! −Mn − [βk x]z Mn − [βk x]z p + PX U < = PX U > p [Σk ]z,z [Σk ]z,z ! ! Mn − [βk x]z Mn + [βk x]z + PX U > p = PX U > p [Σk ]z,z [Σk ]z,z ≤e − 12 ≤ 2e ≤ 2e  Mn −[βk x]z √ [Σk ]z,z 2 +e − 12  Mn +[βk x]z √ 2 − 21  − 21 2 −2M |[β x] |+|[β x] |2 Mn n k z k z [Σk ]z,z Mn −|[βk x]z | √ [Σk ]z,z [Σk ]z,z 2 . where U ∼ N (0, 1). Then, 1 2 P (||Yx ||∞ > Mn ) ≤ 2Kqe− 2 (Mn −2Mn Aβ )aΣ ,  P 1 2 n − 21 (Mn2 −2Mn Aβ )aΣ ≤ 2Knaπ qe− 2 (Mn −2Mn Aβ )aΣ . We have and we get P (T c ) ≤ E i=1 2Kqaπ e obtained the wanted bound for E(KLn (sξ0 , ŝm̂ )1T c ). 2.6 2.6.1 Some details Proof of the Lemma 2.5.1 First, give some tools prove the Lemma 2.5.1. q to 1 Pn We define ||g||n = n i=1 g 2 (yi |xi ) for any measurable function g. Let m ∈ N∗ . We have n sup |νn (−fm )| = sup fm ∈Fm fm ∈Fm 1X (fm (yi |xi ) − E(fm (Yi |xi ))) . n i=1 To control the deviation of such a quantity, we shall combine concentration with symmetrization arguments. We first use the following concentration inequality which can be found in [BLM13]. Lemme 2.6.1. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z and let Γ be a class of real-valued functions on Z. Assume that there exists Rn a non-random constant such that supγ∈Γ ||γ||n ≤ Rn . Then, for all t > 0, P " # r ! n n √ 1X t 1X sup ≤ e−t . γ(Zi ) − E(γ(Zi )) > E sup γ(Zi ) − E(γ(Zi )) + 2 2Rn n n n γ∈Γ γ∈Γ i=1 Proof. See [BLM13]. i=1 N   P Then, we propose to bound E supγ∈Γ n1 ni=1 γ(Zi ) − E(γ(Zi )) thanks to the following symmetrization argument. The proof of this result can be found in [vdVW96]. 91 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS Lemme 2.6.2. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z and let Γ be a class of real-valued functions on Z. Let (ǫ1 , . . . , ǫn ) be a Rademacher sequence independent of (Z1 , . . . , Zn ). Then, " # " # n n 1X 1X E sup γ(Zi ) − E(γ(Zi )) ≤ 2 E sup ǫi γ(Zi ) . γ∈Γ n γ∈Γ n i=1 i=1 Proof. See [vdVW96]. N Then, we have to control E(supγ∈Γ 1 n Pn i=1 ǫi γ(Zi ) ). Lemme 2.6.3. Let (Z1 , . . . , Zn ) be independent random variables with values in some space Z and let Γ be a class of real-valued functions on Z. Let (ǫ1 , . . . , ǫn ) be a Rademacher sequence independent of (Z1 , . . . , Zn ). Define Rn a non-random constant such that sup ||γ||n ≤ Rn . γ∈Γ Then, for all S ∈ N∗ , # " n 1X ǫi γ(Zi ) ≤ Rn E sup γ∈Γ n i=1  6 X −s p √ 2 log(1 + N (2−s Rn , Γ, ||.||n )) + 2−S n s=1 S ! where N (δ, Γ, ||.||n ) stands for the δ-packing number of the set of functions Γ equipped with the metric induced by the norm ||.||n . Proof. See [Mas07]. N In our case, we get the following lemma. Lemme 2.6.4. Let m ∈ N∗ . Consider (ǫ1 , . . . , ǫn ) a Rademacher sequence independent of (Y1 , . . . , Yn ). Then, on the event T , ! n X √ CM q ǫi fm (Yi |xi ) ≤ 18 K √ n ∆m ; E sup n fm ∈Fm i=1 where ∆m := m log(n) p K log(2p + 1) + 6(1 + K(Aβ + q aΣ )). Proof. Let m ∈ N∗ . According to Lemma 2.6.5, we get that on the event T , sup ||fm ||n ≤ Rn := 2CMn (1 + K(Aβ + fm ∈Fm q )). aΣ 92 2.6. SOME DETAILS Besides, on the event T , for all S ∈ N∗ , S X 2−s s=1 ≤ S X 2 −s S X p p log[1 + N (2−s Rn , Fm , ||.||n )] ≤ 2−s log(2N (2−s Rn , Fm , ||.||n ))  s=1 p log(2) + p log(2p + 1) 2s+1 C Mn qKm  Rn    S X 2s+3 CMn qK 2s+3 CMn −s K log 1 + + 2 according to Lemma 4.15 1+ Rn a Σ Rn s=1   S X p p 2s+1 CMn qKm −s log(2) + log(2p + 1) ≤ 2 Rn s=1 s  2 S X C Mn −s s+3 K log 1 + 2 + 2 max(1, qK/aΣ ) Rn s=1   S X p p 2s+1 CMn qKm p −s log(2) + log(2p + 1) ≤ + 2(s + 3)K log(2)q/aΣ 2 Rn s=1 !! √ S X p √ q √ 2CMn Kmq p S log(2p + 1) + log(2) 1 + ≤ 6K + 2 2−s s Rn aΣ s=1 ! √ √ p q√ 2e 2CMn Kmq p √ √ √ 6K + q K ≤ S log(2p + 1) + log(2) 1 + Rn aΣ 2− e s=1 s  √ s √ because 2−s s ≤ 2e for all s ∈ N∗ . Then, thanks to the Lemma 2.6.3, n E sup fm ∈Fm 1X ǫi fm (Yi |xi ) n i=1 ! ≤ Rn ≤ Rn Taking S = log(n) log(2) ! S 6 X −s p −S −s √ 2 log[1 + N (2 Rn , Fm , ||.||n )] + 2 n s=1   2CMn Kmq p 6 √ S log(2p + 1) Rn n    p q √ q √ 2e −S √ + log(2) 1 + 6K + K . +2 aΣ aΣ 2− e to obtain the same order in the both terms depending on S, we could deduce 93 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS that ! n E sup fm ∈Fm 1X ǫi fm (Yi |xi ) n i=1 12CMn Kmq p log(n) √ ≤ log(2p + 1) log(2) n ! # √    " p √ log(2) q 2e 1 √ √ + 2CMn 1 + K Aβ + + 1 + 6K + aΣ n n 2 − 2e 18CMn Kmq p √ ≤ log(2p + 1) log(n) n ! # √ √  "   p √ q K 2e √ + 2 √ C M n 1 + K Aβ + log(2) 1 + 6 + +1 aΣ n 2 − 2e √     p K q √ . CMn mq K log(2p + 1) log(n) + 6 1 + K Aβ + ≤18 aΣ n It completes the proof. We are now able to prove the Lemma 2.5.1. n 1X (fm (yi |xi ) − EX (fm (Yi |xi ))) fm ∈Fm n i=1 ! r n X √ t fm (Yi |xi ) − E(fm (Yi |xi )) + 2 2Rn ≤E sup n fm ∈Fm sup |νn (−fm )| = sup fm ∈Fm i=1 with probability greater than 1 − e−t and where Rn is a constant computed from the Lemma 2.6.5 ! r n X √ t ≤ 2E sup ǫi fm (Yi |xi ) + 2 2Rn n fm ∈Fm i=1 with ǫi a Rademacher sequence, independent of Zi r   √ C Mn q √ t ≤ 2 18 K √ ∆m + 2 2Rn n n r  √ !  √ q Kq t . 1 + K Aβ + ≤ 4CMn 9 √ ∆m + 2 n aΣ n 2.6.2 Lemma 2.6.5 and Lemma 4.15 Lemme 2.6.5. On the event T =  max max |[Yi ]z | ≤ Mn , i∈{1,...,n} z∈{1,...,q} for all m ∈ N∗ , sup ||fm ||n ≤ 2CMn fm ∈Fm    q 1 + K Aβ + aΣ  := Rn . N 94 2.6. SOME DETAILS n   o Proof. Let m ∈ N∗ . Because fm ∈ Fm = fm = − log ssm0 , sm ∈ Sm , there exists sm ∈ ξ   sm p Sm such that fm = − log s 0 . For all x ∈ [0, 1] , denote ξ(x) = (π, β1 x, . . . , βK x, Σ) the ξ parameters of sm (.|x). For all i ∈ {1, . . . , n}, |fm (yi |xi )|1T = | log(sm (yi |xi )) − log(sξ0 (yi |xi ))|1T ∂ log(sξ (yi |x)) ||ξ(xi ) − ξ 0 (xi )||1 1T , ≤ sup sup ∂ξ x∈[0,1]p ξ∈Ξ thanks to the Taylor formula. Then, we need an upper bound of the partial derivate. For all x ∈ [0, 1]p , for all y ∈ Rq , we could write ! K X log(sξ (y|x)) = log hk (x, y) k=1 where, for all k ∈ {1, . . . , K}, hk (x, y) = πk q/2 (2π) det Σk   1 × exp −  2 q X z2 =1   q X z1 =1 y z1 − p X j=1    y z 2 − xj [βk ]z1 ,j  [Σk ]−1 z1 ,z2 p X j=1  [βk ]z2 ,j xj  . Then, for all l ∈ {1, . . . , K}, for all z1 ∈ {1, . . . , q}, for all z2 ∈ {1, . . . , q}, for all y ∈ Rq , for all x ∈ [0, 1]p , ! q q(|y| + Aβ )AΣ ∂ log(sξ (y|x)) 1 X hl (x, y) − ; [Σl ]−1 = PK z1 ,z2 ([βl x]z2 − yz2 ) ≤ ∂([βl x]z1 ) 2 2 k=1 hk (x, y) z2 =1 ∂ log(sξ (y|x)) 1 = PK ∂([Σl ]z1 ,z2 ) k=1 hk (x, y) × ≤ −hl Cofz1 ,z2 (Σl ) hl (x, y)(yz1 − [βl x]z1 )(yz2 − [βl x]z2 )[Σl ]−2 z1 ,z2 − det(Σl ) 2 −Cofz1 ,z2 (Σl ) (yz1 − [βl x]z1 )(yz2 − [βl x]z2 )[Σl ]−2 z1 ,z2 + det(Σl ) 2 1 ≤AΣ + (|y| + Aβ )2 A2Σ , 2 where Cofz1 ,z2 (Σk ) is the (z1 , z2 )-cofactor of Σk . We also have, for all l ∈ {1, . . . , K}, for all x ∈ [0, 1]p , for all y ∈ Rq , ∂ log(sξ (y, x)) hl (x, y) 1 = . ≤ PK ∂πl aπ πl k=1 hk (x, y) Thus, for all y ∈ Rq , ∂ log(sξ (y|x)) ≤ max sup sup ∂ξ x∈[0,1]p ξ∈Ξ̃  We have Cy ≤ AΣ ∧ 1 aπ h 1+  q(|y| + Aβ )AΣ 1 1 , AΣ + (|y| + Aβ )2 A2Σ , aπ 2 2 q+1 2 AΣ (|y| i + Aβ )2 . For all m ∈ N∗ ,  = Cy . 95 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS |fm (yi |xi )|1T ≤ Cyi ||ξ(xi ) − ξ 0 (xi )||1 1T ≤ C Mn K X k=1 (||βk xi − βk0 xi ||1 + ||Σk − Σ0k ||1 + |πk − πk0 |). 0 belong to Ξ̃, we obtain Since fm and fm |fm (yi |xi )|1T ≤ 2CMn (KAβ + K q + 1) aΣ and then sup ||fm ||n 1T ≤ 2CMn (KAβ + K fm ∈Fm q + 1). aΣ N For the next results, we need the following lemma, proved in [Mey13]. Lemme 2.6.6. Let δ > 0 and (Ai,j ) i∈{1,...,n} ∈ [0, 1]n×p . There exists a family B of (2p + 1)1/δ 2 j∈{1,...,p} vectors of Rp such that for all µ ∈ Rp in the ℓ1 -ball, there exists µ′ ∈ B such that 2  p n X X 1  (µj − µ′j )Ai,j  ≤ δ 2 . n i=1 Proof. See [Mey13]. j=1 N With this lemma, we can prove the following one: Lemme 2.6.7. Let δ > 0 and m ∈ N∗ . On the event T , we have the upper bound of the δ-packing number of the set of functions Fm equipped with the metric induced by the norm ||.||n : N (δ, Fm , ||.||n ) ≤ (2p + 1) 2 K 2 q 2 m2 /δ 2 4CM n  8CMn qK 1+ aΣ δ K  8CMn 1+ δ K . Proof. Let m ∈ N∗ and fm ∈ Fm . There exists sm ∈ Sm such that fm = − log(sm /sξ0 ). Intro′ ′ = − log(s′ /s ). Denote by (β , Σ , π ) ′ ′ duce s′m in S and put fm k k k 1≤k≤K and (βk , Σk , πk )1≤k≤K m ξ0 ′ the parameters of the densities sm and sm respectively. First, applying Taylor’s inequality, on the event   T = max |[Yi ]z | ≤ Mn , max i∈{1,...,n} z∈{1,...,q} we get, for all i ∈ {1, . . . , n}, ′ (yi |xi )|1T = | log(sm (yi |xi )) − log(s′m (yi |xi ))|1T |fm (yi |xi ) − fm ∂ log(sξ (yi |x)) ≤ sup sup ||ξ(xi ) − ξ ′ (xi )||1 1T ∂ξ p x∈[0,1] ξ∈Ξ̃ ≤ C Mn q K X X k=1 z=1 ! [βk xi ]z − [βk′ xi ]z + ||Σk − Σ′k ||1 + |πk − πk′ | . 96 2.6. SOME DETAILS Thanks to the Cauchy-Schwarz inequality, we get that   !2 q K X X 2  ′ + (||Σ − Σ′ ||1 + ||π − π ′ ||)2  βk xi − βk′ xi (yi |xi ))2 1T ≤ 2CM (fm (yi |xi ) − fm n k=1 z=1   2  q p p K X X X X 2   [βk ]z,j [xi ]j − Kq ≤ 2CM [βk′ ]z,j [xi ]j  + (||Σ − Σ′ ||1 + ||π − π ′ ||)2  , n k=1 z=1 j=1 j=1 and  2 p p q K X n X X X X 1 ′ 2 2   [βk ]z,j [xi ]j − Kq ||fm − fm ||n 1T ≤2CM [βk′ ]z,j [xi ]j  n n i=1 j=1 j=1 k=1 z=1 #  + (||Σ − Σ′ ||1 + ||π − π ′ ||)2 . Denote by 2  q p p n K X X X X X 1  [βk ]z,j [xi ]j − a = Kq [βk′ ]z,j [xi ]j  . n k=1 z=1 i=1 j=1 j=1 Then, for all δ > 0, if 2 ) a ≤ δ 2 /(4CM n ||Σ − Σ′ ||1 ≤ δ/(4CMn ) ||π − π ′ || ≤ δ/(4CMn ) ′ ||2 ≤ δ 2 . To bound a, we write then ||fm − fm n 2  q p p K X n ′ X X X X [βk ]z,j [βk ]z,j 1  [xi ]j − [xi ]j  a = Kqm2 n m m =1 z=1 i=1 j=1 j=1 and we apply Lemma 2.6.6 to [βk ]z,. /m for all k ∈ {1, . . . , K}, and for all z ∈ {1, . . . , q}. Since P P 2 2 2 2 2 [β ] sm ∈ Sm , we have qz=1 pj=1 kmz,j ≤ 1, thus there exists a family B of (2p + 1)4CMn q K m /δ vectors of Rp such that for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, for all [βk ]z,. , there exists ′ 2 ). Moreover, since ||Σ|| ≤ qK and ||π|| ≤ 1, we get that, [βk ]z,. ∈ B such that a ≤ δ 2 /(4CM 1 1 aΣ n on the event T ,     δ δ K qK K ,B ( , B (1), ||.||1 ), ||.||1 N N (δ, Fm , ||.||n ) ≤ card(B)N 4CMn 1 AΣ 4CMn 1     2 q 2 K 2 m2 /δ 2 8CMn qK K 8CMn K 4CM n 1+ ≤ (2p + 1) 1+ aΣ δ δ N 97 CHAPTER 2. AN ℓ1 -ORACLE INEQUALITY FOR THE LASSO IN FINITE MIXTURE OF MULTIVARIATE GAUSSIAN REGRESSION MODELS 2.6. SOME DETAILS 98 Chapter 3 An oracle inequality for the Lasso-MLE procedure Contents 3.1 3.2 3.3 3.4 3.5 3.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 The Lasso-MLE procedure . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.2.1 Gaussian mixture regression model . . . . . . . . . . . . . . . . . . . . . 93 3.2.2 The Lasso-MLE procedure . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2.3 Why refit the Lasso estimator? . . . . . . . . . . . . . . . . . . . . . . . 94 An oracle inequality for the Lasso-MLE model . . . . . . . . . . . . . 95 3.3.1 Notations and framework . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.2 Oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4.1 Simulation illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Tools for proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.5.1 General theory of model selection with the maximum likelihood estimator.101 3.5.2 Proof of the general theorem . . . . . . . . . . . . . . . . . . . . . . . . 103 3.5.3 Sketch of the proof of the oracle inequality 3.3.2 . . . . . . . . . . . . . 107 Assumption Hm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Assumption K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Appendix: technical results . . . . . . . . . . . . . . . . . . . . . . . . 109 3.6.1 Bernstein’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.6.2 Proof of Lemma 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.6.3 Determination of a net for the mean and the variance . . . . . . . . . . 110 3.6.4 Calculus for the function φ . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.6.5 Proof of the Proposition 3.5.5 . . . . . . . . . . . . . . . . . . . . . . . . 114 3.6.6 Proof of the Lemma 3.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 115 99 3.1. INTRODUCTION 100 In this chapter, we focus on a theoretical result for the LassoMLE procedure. We will get a penalty which depends on the model complexity for which the model selected by the penalized criterion among the collection constructed satisfies an oracle inequality. This result is non-asymptotic. We derive it from a general model selection theorem, also detailed here, which is a generalization of Cohen and Le Pennec Theorem, [CLP11], for a model collection constructed randomly. 3.1 Introduction The goal of clustering methods is to discover a structure among individuals described by several variables. Specifically, in regression case, given n observations (x, y) = ((x1 , y1 ), . . . , (xn , yn )) which are realizations of random variables (X, Y ) with X ∈ Rp and Y ∈ Rq , one aims at grouping the data into clusters such that the observations Y conditionally to X in the same cluster are more similar to each other than those from the other clusters. Different methods could be considered, more geometric or more statistical. We are dealing with model-based clustering, in order to have a rigorous statistical framework to assess the number of clusters and the role of each variable. This method is known to have good empirical performance relative to its competitors, see in [TMZT06]. Datasets are described by a lot of explicative variables, sometimes much more than the sample size. All the information should not be relevant for the clustering. To solve this problem, we propose a procedure which provide a data clustering from variable selection. In a density estimation way, we could also cite Pan and Shen, in [PS07], who focus on mean variable selection, Zhou and Pan, in [ZPS09], who use the Lasso estimator to regularize Gaussian mixture model with general covariance matrices, Sun and Wand, in [SWF12], who propose to regularize the k-means algorithm to deal with high-dimensional data, Guo et al. in [GLMZ10] who propose a pairwise variable selection method. All of them deal with penalized model-based clustering. In a regression framework, the Lasso estimator, introduced by Tibshirani in [Tib96], is a classical tool in this context. Working well in practice, many efforts have been made recently on this estimator to get some theoretical results. Under a variety of different assumptions on the design matrix, we could have oracle inequalities for the Lasso estimator. For example, we can state the restricted eigenvalue condition, introduced by Bickel, Ritov and Tsybakov in [BRT09], who get an oracle inequality with this assumption. For an overview of existing results, cite for example [vdGB09]. Whereas focus on the estimation, the Lasso estimator could be used to select variables, and, for this goal, many results without strong assumptions are proved. The first result in this way is from Meinshausen and Bühlmann, in [MB10], who show that, for neighborhood selection in Gaussian graphical models, under a neighborhood stability condition, the Lasso estimator is consistent. Under different assumptions, as the irrepresentable condition, described in [ZY06], one get the same kind of result: true variables are selected consistently. Thanks to those results, one could refit the estimation, after the variable selection, with an estimator with better properties. In this chapter, we focus on the maximum likelihood estimator on the estimated active set. In a linear regression framework, we could cite Massart and Meynet, [MM11a], or Belloni and Chernozhukov, [BC13], or also Sun and Zhang, [SZ12] for using this idea. In our case of finite mixture regression, we propose a procedure which is based on a modeling that recasts variable selection and clustering problems into a model selection problem. This procedure is developed in [Dev14c], with methodology, computational issues, simulations and 101 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE data analysis. First, for some data-driven regularization parameters, we construct a relevant variables set. Then, restricted on those sets, we compute the maximum likelihood estimator. Considering the model collection with various number of components and various sparsities, we select a model thanks to the slope heuristic. Then, we get a clustering of the data thanks to the Maximum A Posteriori principle. This procedure could be used to cluster heterogeneous multivariate regression data and understand which variables explain the clustering, in high-dimension. Considering a regression clustering could refine a clustering, and it could be more adapted for instance for prediction. In this chapter, we focus on the theoretical point of view. We define a penalized criterion which allows to select a model (defined by the number of clusters and the set of relevant variables) from a non-asymptotic point of view. Penalizing the empirical contrast is an idea emerging from the seventies. Akaike, in [Aka74], proposed the Akaike’s Information Criterion (AIC) in 1973, and Schwarz in 1978 in [Sch78] suggested the Bayesian Information Criterion (BIC). Those criteria are based on asymptotic heuristics. To deal with non-asymptotic observations, Birgé and Massart in [BM07] and Barron et al. in [YB99], define a penalized data-driven criterion, which leads to oracle inequalities for model selection. The aim of our approach is to define penalized data-driven criterion which leads to an oracle inequality for our procedure. In our context of regression, Cohen and Le Pennec, in [CLP11], proposed a general model selection theorem for maximum likelihood estimation, adapted from Massart’s Theorem in [Mas07]. Nevertheless, we can not apply it directly, because it is stated for a deterministic model collection, whereas our data-driven model collection is random, constructed by the Lasso estimator. As Maugis and Meynet have done in [MMR12] to generalize Massart’s Theorem, we extend the theorem to cope with the randomness of our model collection. By applying this general theorem to the finite mixture regression random model collection constructed by our procedure, we derive a convenient theoretical penalty as well as an associated non-asymptotic penalized criteria and an oracle inequality fulfilled by our Lasso-MLE estimator. The advantage of this procedure is that it does not need any restrictive assumption. To obtain the oracle inequality, we use a general theorem proposed by Massart in [Mas07], which gives the form of the penalty and associated oracle inequality in term of the Kullback-Leibler and Hellinger loss. In our case of regression, Cohen and Le Pennec, in [CLP11], generalize this theorem in term of Kullback-Leibler and Jensen-Kullback-Leibler loss. Those theorems are based on the centred process control with the bracketing entropy, allowing to evaluate the size of the models. Our setting is more general, because we work with a random family denoted by M̌. We have to control the centred process thanks to Bernstein’s inequality. The rest of this chapter is organized as follows. In the Section 3.2, we define the multivariate Gaussian mixture regression model, and we describe the main steps of the procedure we propose. We also illustrate the requirement of refitting by some simulations. We present our oracle inequality in the Section 3.3.2. In Section 3.4, we illustrate the procedure on simulated dataset and benchmark dataset. Finally, in Section 3.5, we give some tools to understand the proof of the oracle inequality, with a global theorem of model selection with a random collection in Section 3.5.1 and sketch of proofs after. All the details are given in Appendix. 3.2 The Lasso-MLE procedure In order to cluster high-dimensional regression data, we will work with the multivariate Gaussian mixture regression model. This model is developed in [SBG10] in the scalar response case. We generalize it in Section 3.2.1. Moreover, we want to construct a model collection. We propose, in Section 3.2.2, a procedure called Lasso-MLE which constructs a model collection, with various sparsity and various number of components, of Gaussian mixture regression models. The different sparsities solve the high-dimensional problem. We conclude this section with 102 3.2. THE LASSO-MLE PROCEDURE simulations, which illustrate the advantage of refitting. 3.2.1 Gaussian mixture regression model We observe n independent couples (xi , yi )1≤i≤n realizing random variables (X, Y ), where X ∈ Rp , and Y ∈ Rq comes from a probability distribution with unknown conditional density denoted by s∗ . To solve a clustering problem, we use a finite mixture model in regression. In particular, we will approximate the density of Y conditionally to X with a mixture of K multivariate Gaussian regression models. If the observation i belongs to the cluster k, we are looking for βk ∈ Rq×p such that yi = βk xi + ǫ, where ǫ ∼ N (0, Σk ). Remark that we also have to estimate the number of clusters K. Thus, the random response variable Y ∈ Rq depends on a set of random explanatory variables, written X ∈ Rp , through a regression-type model. Give more precisions on the assumptions on the model we use. — The variables Yi , conditionally to Xi , are independent, for all i ∈ {1, . . . , n} ; — Yi |Xi = xi ∼ sK ξ (y|xi )dy, with K X (y − βk x)t Σ−1 πk k (y − βk x) sK exp − ξ (y|x) = q/2 1/2 2 (2π) det(Σk ) k=1 ! (3.1) K ξ = (π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++ q ) ( ) K X ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and πk = 1  k=1 S++ q is the set of symmetric positive definite matrices on Rq . We want to estimate the conditional density function sK ξ from the observations. For all k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the mixture component k. The πk s are the mixture Ppproportions. In fact, for a regressor x, for all k ∈ {1, . . . , K}, for all z ∈ {1, . . . , q}, [βk x]z = j=1 [βk ]z,j xj is the zth component of the mean of the mixture component k. To deal with high-dimensional data, we select variables. Definition 3.2.1. A variable (z, j) ∈ {1, . . . , q} × {1, . . . , p} is said to be irrelevant if, for all k ∈ {1, . . . , K}, [βk ]z,j = 0. A variable is relevant if it is not irrelevant. A model is said to be sparse if there are a few of relevant variables. We denote by Φ[J] the matrix with 0 on the set c J, and S(K,J) the model with K components and with J for relevant variables set: n o (K,J) S(K,J) = y ∈ Rq |x ∈ Rp 7→ sξ (y|x) (3.2) where (K,J) sξ (y|x) K X [J] [J] (y − βk x)t Σ−1 πk k (y − βk x) = exp − q/2 1/2 2 (2π) det(Σk ) k=1 ! . This is the main model used in this chapter. To construct the set of relevant variables J, we use the Lasso estimator. Rather than select a regularization parameter, we consider a collection, which leads to a model collection. Detail the procedure. 103 3.2.2 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE The Lasso-MLE procedure The procedure we propose, which is particularly interesting in high-dimension, could be decomposed into three main steps. First, we construct a model collection, with models more or less sparse and with more or less components. Then, we refit estimations with the maximum likelihood estimator. Finally, we select a model thanks to the slope heuristic. It leads to a clustering according to the MAP principle on the selected model. Model collection construction The first step consists of constructing a collection of models {S(K,J) }(K,J)∈M in which the model S(K,J) is defined by equation (3.2), and the model collection is indexed by M = K × J . Denote by K ⊂ N∗ the possible number of components, and denote by J a collection of subsets of {1, . . . , q} × {1, . . . , p}. To detect the relevant variables, and construct the set J in each model, we generalize the Lasso estimator. Indeed, we penalize empirical contrast by an ℓ1 -penalty on the mean parameters P theP proportional to ||Pk βk ||1 = pj=1 qz=1 |(Pk βk )z,j |, where Pkt Pk = Σ−1 k for all k ∈ {1, . . . , K}. Then, we will consider ) ( n K X X 1 Lasso log(sK πk ||Pk βk ||1 . ξˆK (λ) = argmin − ξ (yi |xi )) + λ n ξ=(π,β,Σ)∈ΞK i=1 k=1 This leads to penalize simultaneously the ℓ1 -norm of the mean coefficients and small variances. Computing those estimators lead to construct the relevant variables set. For a fixed number of mixture components K ∈ K, denote by GK a candidate of regularization parameters. Fix a parameter λ ∈ GK , we could then use an EM algorithm to compute the Lasso estimator, and construct the set of relevant variables J(λ,K)S , sayingSthe non-zero coefficients. We denote by J the random collection of all these sets, J = K∈K λ∈GK J(λ,K) . Refitting The second step consists of approximating the maximum likelihood estimator ) ( n X 1 ŝ(K,J) = argmin − log(t(yi |xi )) n t∈S(K,J) i=1 using an EM algorithm for each model (K, J) ∈ K×J . Remark that we estimate all parameters, to reduce bias induced by the Lasso estimator. Model selection The third step is devoted to model selection. We get a model collection, and we need to select the best one. Because we do not have access to s∗ , we can not take the one which minimizes the risk. The Theorem 3.5.1 solves this problem: we get a penalty achieving to an oracle inequality. Then, even if we do not have access to s∗ , we know that we can do almost like the oracle. 3.2.3 Why refit the Lasso estimator? In order to illustrate the refitting, we compute multivariate data, the restricted eigenvalue condition being not satisfied, and run our procedure. We consider an extension of the model studied in Giraud et al. article [BGH09] in the Section 6.3. Indeed, this model is a linear regression with a scalar response which does not satisfy the restricted eigenvalues condition. Then, we define different classes, to get a finite mixture regression model, which does not satisfied the restricted eigenvalues condition, and extend the dimension for multivariate response. We could compare 3.3. AN ORACLE INEQUALITY FOR THE LASSO-MLE MODEL 104 the result of our procedure with the Lasso estimator, to illustrate the oracle inequality we have get. Let precise the model. Let [x]1 , [x]2 , [x]3 be three vectors of Rn defined by √ [x]1 = (1, −1, 0, . . . , 0)t / 2 √ t 2 [x]2 = (−1, p √1.001,√0, . . . , 0) / 1 + t0.001 [x]3 = (1/ 2, 1/ 2, 1/n, . . . , 1/n) / 1 + (n − 2)/n2 and for 4 ≤ j ≤ n, let [x]j be the j th vector of the canonical basis of Rn . We take a sample of size n = 20, and vector of size p = q = 10. We consider two classes, each of them defined by [β1 ]z,j = 10 and [β2 ]z,j = −10 for j ∈ {1, . . . , 2}, z ∈ {1, . . . , 10}. Moreover, we define the covariance matrix of the noise by a diagonal matrix with 0.01 for diagonal coefficient in each class. We run our procedure on this model, and compare it with the Lasso estimator, without refitting. We compute the model selected by the slope heuristic over the model collection constructed by the Lasso estimator. In Figure 3.1 are the boxplots of each procedure, running 20 times. The Kullback-Leibler divergence is computed over a sample of size 5000. Kullback−Leibler divergence 8 7 6 5 4 Lasso−MLE Lasso Figure 3.1: Boxplot of the Kullback-Leibler divergence between the true model and the one constructed by each procedure, the Lasso-MLE procedure and the Lasso estimator We could see that a refitting after variable selection by the Lasso estimator leads to a better estimation, according to the Kullback-Leibler loss. 3.3 An oracle inequality for the Lasso-MLE model Before state the main theorem of this chapter, we need to precise some definitions and notations. 3.3.1 Notations and framework We assume that the observations (xi , yi )1≤i≤n are realizations of random variables (X, Y ) where X ∈ Rp and Y ∈ Rq . For (K, J) ∈ K × J , for a model S(K,J) , we denote by ŝ(K,J) the maximum likelihood estimator ! n X (K,J) (K,J) − ŝ = argmin log sξ (yi |xi ) . (K,J) sξ ∈S(K,J) i=1 To avoid existence issue, we could work with almost minimizer of this quantity and define an η-log-likelihood minimizer: ! n n X X (K,J) (K,J) − log(ŝ (yi |xi )) ≤ inf − log sξ (yi |xi ) + η. i=1 (K,J) sξ ∈S(K,J) i=1 105 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE The best model in this collection is the one with the smallest risk. However, because we do not have access to the true density s∗ , we can not select the best model, which we call the oracle. Thereby, there is a trade-off between a bias term measuring the closeness of s∗ to the set S(K,J) and a variance term depending on the complexity of the set S(K,J) and on the sample size. A good set S(K,J) will be one for which this trade-off leads to a small risk bound. Because we are working with a maximum likelihood approach, the most natural quality measure is thus the Kullback-Leibler divergence denoted by KL. Z   s(y)   log s(y)dy if sdy << tdy; t(y) (3.3) KL(s, t) = R   + ∞ otherwise; for s and t two densities. As we deal with conditional densities and not classical densities, the previous divergence should be adapted. We define the tensorized Kullback-Leibler divergence by # " n X 1 KL⊗n (s, t) = E KL(s(.|xi ), t(.|xi )) . n i=1 This divergence, defined in [CLP11] appears as the natural one in this regression setting. Namely, we use the Jensen-Kullback-Leibler divergence JKLρ with ρ ∈ (0, 1), which is defined by 1 JKLρ (s, t) = KL(s, (1 − ρ)s + ρt); ρ and the tensorized one " # n X 1 n JKL⊗ JKLρ (s(.|xi ), t(.|xi )) . ρ (s, t) = E n i=1 This divergence is studied in [CLP11]. We use this divergence rather than the Kullback-Leibler one because we need a boundedness assumption  on the controlled functions that is not satisfied by (K,J) ∗ /s . When considering the Jensen-Kullback-Leibler the log-likelihood differences − log sξ divergence, those ratios are replaced by ratios   ∗ + ρs(K,J) (1 − ρ)s 1 ξ  − log  ρ s∗ that are close to the log-likelihood differences when sm are close to s∗ and always upper bounded by − log(1 − ρ)/ρ. Indeed, this bound is needed to use deviation inequalities for sums of random variables and their suprema, which is the key of the proof of oracle type inequality. 3.3.2 Oracle inequality We denote by (S(K,J) )(K,J)∈K×J L the model collection constructed by the Lasso-MLE procedure, with J L a random subcollection of P({1, . . . , q}×{1, . . . , p}) constructed by the Lasso estimator. The grid of regularization parameter considered is data-driven, then random. Because we work in high-dimension, we could not look at all subsets of P({1, . . . , q}×{1, . . . , p}). Considering the Lasso estimator through its regularization path is the solution chosen here, but it needs more control because of the random family. To get theoretical results, we need to work with restricted 106 3.3. AN ORACLE INEQUALITY FOR THE LASSO-MLE MODEL parameters. Assume Σk diagonal, with Σk = diag([Σk ]1,1 , . . . , [Σk ]q,q ), for all k ∈ {1, . . . , K}. We define B S(K,J) =  (K,J) ∈ S(K,J) for all k ∈ {1, . . . , K}, [βk ][J] ∈ [−Aβ , Aβ ]q×p ,  aΣ ≤ [Σk ]z,z ≤ AΣ for all z ∈ {1, . . . , q}, for all k ∈ {1, . . . , K} . sξ (3.4) Moreover, we assume that the covariates X belong to an hypercube. Without any restriction, we could assume that X ∈ [0, 1]p . Remark 3.3.1. We have to denote that in this chapter, the relevant variables set is designed by the Lasso estimator. Nevertheless, any tool could be used to construct this set, and we obtain analog results. We could work with any random subcollection of P({1, . . . , q} × {1, . . . , p}), the controlled size being required in high-dimensional case. Theorem 3.3.2. Let (xi , yi )1≤i≤n the observations, with unknown conditional density s∗ . Let S(K,J) defined by (3.2). For (K, J) ∈ K×J L , J L being a random subcollection of P({1, . . . , q}× B {1, . . . , p}) constructed by the Lasso estimator, denote S(K,J) the model defined by (3.4). Consider the maximum likelihood estimator ( ) n 1X (K,J) (K,J) ŝ = argmin − (yi |xi ) . log sξ n (K,J) s ∈S B ξ i=1 (K,J) B Denote by D(K,J) the dimension of the model S(K,J) , D(K,J) = K(|J| + q + 1) − 1. Let s̄(K,J) ∈ B S(K,J) such that δKL ; KL⊗n (s∗ , s̄(K,J) ) ≤ inf KL⊗n (s∗ , t) + B n t∈S(K,J) and let τ > 0 such that s̄(K,J) ≥ e−τ s∗ . Let pen : K × J → R+ , and suppose that there exists an absolute constant κ > 0 and an absolute constant B(Aβ , AΣ , aΣ ) such that, for all (K, J) ∈ K × J ,    D(K,J) 2 D(K,J) 2 B (Aβ , AΣ , aΣ ) − log B (Aβ , AΣ , aΣ ) ∧ 1 pen(K, J) ≥ κ n n   4epq +(1 ∨ τ ) log . (D(K,J) − q 2 ) ∧ pq ˆ Then, the estimator ŝ(K̂,J) , with ˆ = (K̂, J) argmin (K,J)∈K×J L ( n 1X log(ŝ(K,J) (yi |xi )) + pen(K, J) − n i=1 ) satisfies E h ˆ ∗ (K̂,J) n JKL⊗ ) ρ (s , ŝ i ≤C1 E + C2 inf (K,J)∈K×J L (1 ∨ τ ) ; n for some absolute positive constants C1 and C2 . inf B t∈S(K,J) ⊗n KL ∗ (s , t) + pen(K, J) !! 107 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE This oracle inequality compares performances of our estimator with the best model in the collection. Nevertheless, as we consider mixture of Gaussian, if we take enough clusters, we could approximate well a lot of densities. This result could be compared with the oracle inequality get in [SBG10], Theorem 4. Indeed, under restricted eigenvalues condition and fixed design, they get an oracle inequality for the Lasso estimator in finite mixture regression model, with scalar response and high-dimension regressors. Note that they control the divergence with the true parameters. We get a similar result for the Lasso-MLE estimator. Moreover, our procedure work in a more general framework, the only assumption needed is to be bounded. 3.4 Numerical experiments To illustrate this procedure, we study some simulations and real data. The main algorithm is a generalized version of the EM algorithm, which is used many times for the procedure. We first use it to compute maximum likelihood estimator, to construct the regularization parameter grid. Then, we use it to compute the Lasso estimator for each regularization parameter belonging to the grid, and we are able to construct the relevant variables set. Finally, we could compute the maximum likelihood estimator, restricted to those relevant variables in each model. Among this model collection, we select one using the Capushe package. More details, as initialization rule, stopping rule, and more numerical experiments, are available in [Dev14c]. 3.4.1 Simulation illustration We illustrate the procedure on a simulated dataset, adapted from [SBG10]. Let x be a sample of size n = 100 distributed according to multivariate standard Gaussian. We consider a mixture of two components, and we fix the dimension of the regressor and of the response variables to p = q = 10. Besides, we fix the number of relevant variables to 4 in each cluster. More precisely, the first  four variables of Y are explained respectively by the four first variables of X. Fix π = 21 , 21 , [β1 ][J] = 3, [β2 ][J] = −2 and Pk = 3Iq for all k ∈ {1, 2}. The difficulty of the clustering is partially controlled by the signal-to-noise ratio. In this context, we could extend the natural idea of the SNR with the following definition, where Tr(A) denotes the trace of the matrix A. Tr(Var(Y )) = 1.88. SNR = Tr(Var(Y |βk = 0 for all k ∈ {1, . . . , K})) We take a sample of Y knowing X = x according to a Gaussian mixture, meaning in βk x and with covariance matrix Σk = (Pkt Pk )−1 = σIq , for the cluster k ∈ {1, 2}. We run our procedures with the number of components varying in K = {2, . . . , 5}. To compare our procedure with others, we compute the Kullback-Leibler divergence with the true density, the ARI (the Adjusted Rand Index measures the similarity between two data clusterings, knowing that the closer to 1 the ARI, the more similar the two partitions), and how many clusters are selected. From the Lasso-MLE model collection, we construct two models, to compare our procedures with. We compute the oracle (the model which minimizes the Kullback-Leibler divergence with the true density), and the model which is selected by the BIC criterion instead of the slope heuristic. Thanks to the oracle, we know how good we could be from this model collection for the Kullback-Leibler divergence, and how this model, as good it is possible for the contrast, performs the clustering. The third procedure we compare with is the maximum likelihood estimator, assuming that we know how many clusters there are, fixed to 2. We use this procedure to show that variable selection is necessary. 108 3.4. NUMERICAL EXPERIMENTS 0.4 1 0.35 0.8 0.3 0.25 0.6 0.2 ARI Kullback−Leibler divergence 0.45 0.4 0.15 0.1 0.2 0.05 LMLE Oracle Bic 0 LMLE Figure 3.2: Boxplots of the Kullback-Leibler divergence between the true model and the one selected by the procedure over the 20 simulations, for the Lasso-MLE procedure (LMLE), the oracle (Oracle), the BIC estimator (BIC) Oracle Bic MLE Figure 3.3: Boxplots of the ARI over the 20 simulations, for the Lasso-MLE procedure (LMLE), the oracle (Oracle), the BIC estimator (BIC) and the MLE (MLE) Results are summarized in Figure 3.2 and in Figure 3.3. The Kullback-Leibler divergence is smaller for models selected in our model collection (else by BIC, or by slope heuristic, or the oracle) than for the model constructed by the MLE. The ARI is closer to 1 in those case, and, moreover, is better for the model selected by the slope heuristic. We could conclude that the model collection is well constructed, selecting relevant variables, and also that the model is well selected among this collection, near the oracle. 3.4.2 Real data We also illustrate the procedure on the Tecator dataset, which deal with spectrometric data. We summarize here results which are described in [Dev14c]. Those data have been studied in a lot of articles, cite for example Ferraty and Vieu’s book [FV06]. The data consist of a 100-channel spectrum of absorbances in the wavelength range 850 − 1050 nm, and of the percentage of fat. We observe a sample of size 215. In this work, we focus on clustering data according to the reliance between the fat content and the absorbance spectrum. The sample will be split into two subsamples, 165 observations for the learning set, and 50 observations for the test set. We split it to have the same marginal distribution of the response in each sample. The spectrum is a function, which we decompose into the Haar basis, at level 6. The procedure selects two models, which we describe here. In Figures (3.4) and (3.5), we represent clusters done on the training set for the different models. The graph on the left is a candidate for representing each cluster, constructed by the mean of spectrum over an a posteriori probability greater than 0.6. We plot the curve reconstruction, keeping only active variables in the wavelet decomposition. On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. The first model has two clusters, which could be distinguish in the absorbance spectrum by the bump on wavelength around 940 nm. The first class is dominating, with π̂1 = 0.95. The fat content is smaller in the first class than in the second class. According to the signal reconstruction, we could see that almost all variables have been selected. This model seems consistent according to the classification goal. The second model has 3 clusters, and we could remark different important wavelength. Around 940 nm, there is some differences between clusters, corresponding to the bump underline in the model 1, but also around 970 nm, with higher or smaller values. The first class is dominating, 109 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE with π̂1 = 0.89. Just a few of variables have been selected, which give to this model the understanding property of which coefficient are discriminating. Figure 3.4: Summarized results for the model 1. The graph on the left is a candidate for representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori probability greater than 0.6 On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. Figure 3.5: Summarized results for the model 2. The graph on the left is a candidate for representing each cluster, constructed by the mean of reconstructed spectrum over an a posteriori probability greater than 0.6 On the right side, we present the boxplot of the fat values in each class, for observations with an a posteriori probability greater than 0.6. We could discuss about those models. The first select only two clusters, but almost all variables, whereas the second model has more clusters, and less variables: there is a trade-off between clusters and variable selection for the dimension reduction. 110 3.5. TOOLS FOR PROOF 3.5 Tools for proof In this section, we present the tools needed to understand the proof. First, we present a general theorem for model selection in regression among a random collection. Then, in subsection 3.5.2, we present the proof of this theorem, and in the next subsection we explain how we could use the main theorem to get the oracle inequality. All details are available in Appendix. 3.5.1 General theory of model selection with the maximum likelihood estimator. To get an oracle inequality for our clustering procedure, we have to use a general model selection theorem. Because the model collection constructed by our procedure is random, because of the Lasso estimator which select variables randomly, we have to generalize Cohen and Le Pennec Theorem. Begin by some general model selection theory. Before state the general theorem, begin by talking about the assumptions. We work here in a more general context, (X, Y ) ∈ X × Y, and (Sm )m∈M defining a model collection indexed by M. First, we impose a structural assumption on each model indexed by m ∈ M. It is a bracketing entropy condition on the model Sm with respect to the Hellinger divergence, defined by " n # X 1 n 2 (d⊗ d2H (s(.|xi ), t(.|xi )) . H ) (s, t) = E n i=1 A bracket [l, u] is a pair of functions such that for all (x, y) ∈ X × Y, l(y, x) ≤ s(y|x) ≤ u(y, x). n The bracketing entropy H[.] (ǫ, S, d⊗ H ) of a set S is defined as the logarithm of the minimum n number of brackets [l, u] of width d⊗ H (l, u) smaller than ǫ such that every functions of S belong to one of these brackets. It leads to the Assumption Hm . Assumption Hm . There is a non-decreasing function φm such that ̟ 7→ increasing on (0, +∞) and for every ̟ ∈ R+ and every sm ∈ Sm , Z ̟q n H[.] (ǫ, Sm (sm , ̟), d⊗ H )dǫ ≤ φm (̟); 1 ̟ φm (̟) is non- 0 n where Sm (sm , ̟) = {t ∈ Sm , d⊗ H (t, sm ) ≤ ̟}. The model complexity Dm is then defined as 2 n̟m with ̟m the unique root of √ 1 (3.5) φm (̟) = n̟. ̟ Remark that the model complexity depends on the bracketing entropies not of the global models Sm but of the ones of smaller localized sets. This is a weaker assumption. For technical reason, a separability assumption is also required. ′ ′ ′ Assumption Sepm . There exists a countable subset Sm of Sm and a set Ym with λ(Y \Ym ) = 0 ′ such that for every t ∈ Sm , there exists a sequence (tl )l≥1 of elements of Sm such that for every ′ x and every y ∈ Ym , log(tl (y|x)) goes to log(t(y|x)) as l goes to infinity. This assumption leads to work with a countable family, which allows to cope with the randomness of ŝm . We also need an information theory type assumption on our collection. We assume the existence of a Kraft-type inequality for the collection. Assumption K. There is a family (wm )m∈M of non-negative numbers such that X e−wm ≤ Ω < +∞. m∈M 111 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE The difference with Cohen and Le Pennec’s Theorem is that we consider a random collection of models M̌, included in the whole collection M. In our procedure, we deal with high-dimensional models, and we cannot look after all the models: we have to restrict ourselves to a smaller subcollection of models, which is then random. In the proof of the theorem, we have to be careful with the recentred process of − log(s̄m /s∗ ). Because we conclude by taking the expectation, if M is fixed, this term is non-interesting, but if we consider a random family, we have to use the Bernstein inequality to control this quantity, and then we have to make the assumption (3.6). Let state our main global theorem. Theorem 3.5.1. Assume we observe (xi , yi )1≤i≤n with unknown conditional density s∗ . Let the model collection S = (Sm )m∈M be at most countable collection of conditional density sets. Assume Assumption K holds, while assumptions Hm and Sepm hold for every m ∈ M. Let δKL > 0, and s̄m ∈ Sm such that KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , t) + t∈Sm δKL ; n and let τ > 0 such that s̄m ≥ e−τ s∗ . (3.6) Introduce (Sm )m∈M̌ some random subcollection of (Sm )m∈M . Consider the collection (ŝm )m∈M̌ of η-log-likelihood minimizer in Sm , satisfying, for all m ∈ M̌, ! n n X X − log(ŝm (yi |xi )) ≤ inf − log(sm (yi |xi )) + η. sm ∈Sm i=1 i=1 Then, for any ρ ∈ (0, 1) and any C1 > 1, there are two constants κ0 and C2 depending only on ρ and C1 such that, as soon as for every index m ∈ M, (3.7) pen(m) ≥ κ(Dm + (1 ∨ τ )wm ) with κ > κ0 , and where the model complexity Dm is defined in (3.5), the penalized likelihood estimate ŝm̂ with m̂ ∈ M̌ such that ! n n X X ′ − log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf − log(ŝm (yi |xi )) + pen(m) + η m∈M̌ i=1 i=1 satisfies ∗ n E(JKL⊗ ρ (s , ŝm̂ )) ≤C1 E  inf m∈M̌  + C2 (1 ∨ τ ) ⊗n inf KL t∈Sm  pen(m) (s , t) + 2 n Ω2 η ′ + η + . n n ∗  (3.8) Obviously, one of the models minimizes the right hand side. Unfortunately, there is no way to know which one without knowing s∗ . Hence, this oracle model can not be used to estimate s∗ . We nevertheless propose a data-driven strategy to select an estimate among the collection of estimates {ŝm }m∈M̌ according to a selection rule that performs almost as well as if we had known this oracle, according to the absolute constant C1 . Using simply the log-likelihood of the estimate in each model as a criterion is not sufficient. It is an underestimation of the true risk of the estimate and this leads to select models that are too complex. By adding an adapted penalty pen(m), one hopes to compensate for both the variance term and the bias term between 112 3.5. TOOLS FOR PROOF P −1/n ni=1 log (ŝm̂ (yi |xi )/s∗ (yi |xi )) and inf sm ∈Sm KL⊗n (s∗ , sm ). For a given choice of pen(m), the best model Sm̂ is chosen as the one whose index is an almost minimizer of the penalized η-log-likelihood. Talk about the assumption (3.6). If s is bounded, with a compact support, this assumption is satisfied. It is also satisfied in other cases, more particular. Then it is not a strong assumption, but it is needed to control the random family. This theorem is available for whatever model collection constructed, whereas assumptions Hm , K and Sepm are satisfied. In the following, we will use this theorem for the procedure we propose to cluster high-dimensional data. Nevertheless, this theorem is not specific for our context, and could be used whatever the problem. Remark that the constant associated to the Assumption K appears squared in the bound. It is due to the random subcollection M̌ of M, if the model collection is fixed, we get a linear bound. Moreover, the weights wm appears linearly in the penalty bound. 3.5.2 Proof of the general theorem For any model Sm , we have denoted by s̄m a function such that KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , sm ) + sm ∈Sm Fix m ∈ M such that KL⊗n (s∗ , s̄m ) < +∞. Introduce ( δKL . n ′ pen(m ) M(m) = m ∈ M Pn (− log ŝm′ ) + n ′ ′ pen(m) η ≤ Pn (− log ŝm ) + + n n where Pn (g) = 1/n ) ; Pn We define the functions kl(s̄m ), kl(ŝm ) and jklρ (ŝm ) by    s̄  ŝm m ; ; kl(ŝm ) = − log kl(s̄m ) = − log ∗ s s∗   1 (1 − ρ)s∗ + ρŝm jklρ (ŝm ) = − log . ρ s∗ i=1 g(yi |xi ). ′ For every m ∈ M(m), by definition, ′ ′ pen(m) + η pen(m ) ≤Pn (kl(ŝm )) + Pn (kl(ŝm′ )) + n n ′ pen(m) + η + η ≤Pn (kl(s̄m )) + . n Let νn⊗n (g) denote the recentred process Pn (g) − P ⊗n (g). By concavity of the logarithm, kl(ŝm′ ) ≥ jklρ (ŝm′ ), and then P ⊗n (jklρ (ŝm′ )) − νn⊗n (kl(s̄m )) ≤P ⊗n (kl(s̄m )) + ′ ′ η + η pen(m ) pen(m) − νn⊗n (jklρ (ŝm′ )) + − , n n n CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE 113 which is equivalent to pen(m) − νn⊗n (jklρ (ŝm′ )) n ′ ′ η + η pen(m ) + − . n n ⊗n ∗ ⊗n ∗ n (s , s̄m ) + JKL⊗ ρ (s , ŝm′ ) − νn (kl(s̄m )) ≤ KL (3.9) Mimic the proof as done in Cohen and Le Pennec [CLP11], we could obtain that except on a set of probability less than e−wm′ −w , for all w, for all zm′ > σm′ , there exist absolute constants ′ ′ ′ κ0 , κ1 , κ2 such that s ′ wm′ + w 18 wm′ + w κ1 σ m ′ −νn⊗n (jklρ (ŝm′ )) ′ + κ2 ≤ + . (3.10) ′ ⊗n 2 ∗ 2 zm′ ρ nz2 ′ nz2m′ z ′ + κ0 (dH ) (s , ŝm′ ) m m To obtain this inequality we use the hypothesis Sepm and Hm . This control is derived from maximal inequalities, described in [Mas07]. Our purpose is now to control νn⊗n (kl(s̄m )). This is the difference with the Theorem of Cohen and Le Pennec: we work with a random subcollection ML of M. By definition of kl and νn⊗n , " n  #   n X X 1 s̄ (y |x ) s̄ (Y |X ) 1 m i i m i i +E . log log νn⊗n (kl(s̄m )) = − n s∗ (yi |xi ) n s∗ (Yi |Xi ) i=1 i=1 We want to apply Bernstein’s inequality, which is recalled in Appendix.   s̄m (Yi |Xi ) 1 If we denote by Zi the random variable Zi = − n log s∗ (Yi |Xi ) , we get νn⊗n (kl(s̄m )) = n X i=1 (Zi − E(Zi )). We need to control the moments of Zi to apply Bernstein’s inequality. Lemme 3.5.2. Let s∗ and s̄m two conditional  with respect to the Lebesgue measure.  densities s∗ ≤ τ . Then, Assume that there exists τ > 0 such that log s̄m n E 1X n i=1 Z Rq  log  s∗ (y|xi ) s̄m (y|xi ) 2 ∞ ∗ s (y|xi )dy ! ≤ e−τ τ2 KL⊗n (s∗ , s̄m ). +τ −1 We prove this lemma in Appendix 3.6.2. 2 2 Because e−τ τ+τ −1 ∼ τ , there exists A such that e−τ τ+τ −1 ≤ 2τ for all τ ≥ A. For τ ∈ (0, A], τ →∞ because this function is continuous to 2 in 0, there exists B > 0 such that Pn and2 equivalent τ2 1 ≤ B. We obtain that i=1 E(Zi ) ≤ n δ(1 ∨ τ ) KL⊗n (s∗ , s̄m ), where δ = 2 ∨ B. e−τ +τ −1 Moreover, for all integers K ≥ 3, n X i=1 E((Zi )K +) K Z   ∗ n X s (y|xi ) 1 log ≤ s∗ (y|xi )dy n K Rq s̄m (y|xi ) + i=1  ∗  ∗ K−2  Z s (y|x) s (y|x) 2 n log log 1s∗ ≥s̄m (y|x) s∗ (y|x)dy ≤ K n s̄ (y|x) s̄ (y|x) q m m R n K−2 ⊗n ∗ δ(1 ∨ τ ) KL (s , s̄m ). ≤ Kτ n 114 3.5. TOOLS FOR PROOF Assumptions of Bernstein’s inequality are then satisfied, with δ(1 ∨ τ ) KL⊗n (s∗ , s̄m ) , n v= c= τ , n thus, for all u > 0, except on a set with probability less than e−u , √ νn⊗n (kl(s̄m )) ≤ 2vu + cu. Thus, for all z > 0, for all u > 0, except on a set with probability less than e−u , √ √ cu 2vu + cu vu νn⊗n (kl(s̄m )) + 2. ≤ 2 ≤ p ⊗n ∗ ⊗n ∗ 2 ⊗ n ∗ z + KL (s , s̄m ) z + KL (s , s̄m ) z 2 KL (s , s̄m ) z (3.11) We apply this bound to u = w + wm + wm′ . We get that, except on a set with probability less than e−(w+wm +wm′ ) , using that a2 + b2 ≥ a2 , from the inequality (3.10),    κ′1 + κ′2 18 ⊗n 2 ∗ ⊗n 2 ′ + 2 , −νn (jklρ (ŝm′ )) ≤ zm′ + κ0 (dH ) (s , ŝm′ ) θ θ ρ and, from the inequality (3.11), where we have chosen  2 ⊗n (s, sm ) , νn⊗n (kl(s̄m )) ≤ (β + β 2 ) zm,m ′ + KL zm′ = θ with θ > 1 to fix later, and zm,m′ = β −1 s r 2 + σm ′ w m′ + w , n  v + c (w + wm + wm′ ), 2 KL⊗n (s∗ , s̄m ) with β > 0 to fix later. Coming back to the inequality (3.9), ⊗n ∗ ∗ n (s , s̄m ) + JKL⊗ ρ (s , ŝm′ ) ≤ KL pen(m) n ∗ n 2 κ′0 (d⊗ H ) (s , ŝm′ ))  κ′1 + κ′2 18 + 2 θ θ ρ  + (z2m′ + η ′ + η pen(m′ ) 2 ⊗n ∗ (s , s̄m )). − + (β + β 2 )(zm,m ′ + KL n n + Recall that s̄m is chosen such that KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , sm ) + sm ∈Sm Put κ(β) = 1 + (β + β 2 ), and let ǫ1 > 0, we define θ1 by κ′0 defined by ∗ n 2 Cρ (d⊗ H ) (s , ŝm′ )  ∗ n ≤ JKL⊗ ρ (s , ŝm′ ), and put κ2 = ⊗n ∗ ∗ n (s , sm ) + (1 − ǫ1 ) JKL⊗ ρ (s , ŝm′ ) ≤κ(β) KL + κ(β) δKL . n κ′1 +κ′2 θ1 Cρ ǫ 1 κ0 . + 18 θ12 ρ  = Cρ ǫ1 where Cρ is We get that pen(m) pen(m′ ) − n n δKL η ′ + η 2 + + z2m′ κ2 + (β + β 2 )zm,m ′. n n 115 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE Since τ ≤ 1 ∨ τ , if we choose β such that (β + β 2 )(δ/2 + 1) = αθ1−2 β −2 , and if we put κ1 = αγ −2 (β −2 + 1), since 1 ≤ 1 ∨ τ , using the expressions of zm′ and zm,m′ , we get that ⊗n ∗ ∗ n (s , sm ) + (1 − ǫ1 ) JKL⊗ ρ (s , ŝm′ ) ≤κ(β) KL pen(m) pen(m′ ) − n n δKL η ′ + η + n n  w + w m′ w + w m + w m′ 2 2 + κ2 θ 1 σ m ′ + + κ1 (1 ∨ τ ) n n   wm pen(m) ⊗n ∗ + κ1 (1 ∨ τ ) ≤κ(β) KL (s , sm ) + n n     ′ pen(m′ ) w w m′ m 2 2 + − + κ2 θ 1 σ m ′ + + κ1 (1 ∨ τ ) n n n ′ w δKL η + η + + (κ2 θ12 + κ1 (1 ∨ τ )) . + n n n + κ(β) Now, assume that κ1 ≥ κ in inequality (3.7), we get pen(m) δKL η + η ′ ⊗n ∗ ∗ n (s , sm ) + 2 (1 − ǫ1 ) JKL⊗ + + ρ (s , ŝm′ ) ≤κ(β) KL n n n w 2 + (κ2 θ1 + κ1 (1 ∨ τ )) . n It only remains to sum up the tail bounds over all the possible values of m ∈ M and m′ ∈ M(m) by taking the union of the different sets of probability less than e−(w+wm +wm′ ) , X X e−(w+wm +wm′ ) ≤ e−w e−(wm +wm′ ) (m,m′ )∈M×M m∈M m′ ∈M(m) = e−w X m∈M e−wm !2 = Ω2 e−w from the Assumption K. We then have simultaneously for all m ∈ M, for all m′ ∈ M(m), except on a set with probability less than Ω2 e−w , pen(m) δKL + n n w η + η′ 2 + κ2 θ1 + κ1 (1 ∨ τ ) . + n n ⊗n ∗ ∗ n (s , sm ) + 2 (1 − ǫ1 ) JKL⊗ ρ (s , ŝm′ ) ≤κ(β) KL It is in particular satisfied for all m ∈ M̌ and m′ ∈ M̌(m), and, since m̂ ∈ M̌(m) for all m ∈ M̌, we deduce that except on a set with probability less than Ω2 e−w ,    1 pen(m) ⊗n ∗ ⊗n ∗ JKLρ (s , ŝm̂ ) ≤ × inf κ(β) KL (s , sm ) + 2 (1 − ǫ1 ) n m∈M̌  ′ w δKL η + η 2 + + κ2 θ1 + κ1 (1 ∨ τ ) + . n n n 116 3.5. TOOLS FOR PROOF By integrating over all w > 0, because for any non negative random variable Z and any a > 0, R E(Z) = a z≥0 P (Z > az)dz, we obtain that  ∗ n E JKL⊗ ρ (s , ŝm̂ ) − 1 (1 − ǫ1 )  Ω2 . ≤ κ2 θ12 + κ1 (1 ∨ τ ) n  inf m∈M̌  κ(β) KL⊗n (s∗ , sm ) + 2 pen(m) n  + δKL + η + η ′ κ0 θ 2 n  As δKL can be chosen arbitrary small, this implies that   1 pen(m) E(JKL⊗n (s∗ , ŝm̂ )) ≤ E inf κ(β) KL⊗n (s∗ , sm ) + 1 − ǫ1 n m∈M̌ 2 ′ Ω η+η + (κ2 θ12 + κ1 (1 ∨ τ )) + n n    pen(m) ⊗n ∗ ≤C1 E inf inf KL (s , t) + n m∈M̌ t∈Sm 2 ′ Ω η +η + C2 (1 ∨ τ ) + n n with C1 = 3.5.3 2 1−ǫ1 and C2 = κ2 θ12 + κ1 . Sketch of the proof of the oracle inequality 3.3.2 To prove the Theorem 3.3.2, we have to apply the Theorem 3.5.1. Then, our model collection has to satisfy all the assumptions. Here, m = (K, J). The Assumption Sepm is true when we consider Gaussian densities. If s∗ is bounded, with compact support, the assumption (3.6) is satisfied. It is also true in other particular cases. We have to look after assumption Hm and Assumption K. Here we present only the main step to prove these assumptions. All the details are in Appendix. Assumption Hm R̟q n H[.] (ǫ, Sm , d⊗ We could take φm (̟) = 0 H )dǫ for all ̟ > 0. It could be better to consider more local version of the integrated square root entropy, but the global one is enough in this case to define the penalty. As done in Cohen and Le Pennec [CLP11], we could decompose the entropy by ⊗n ⊗n B n H[.] (ǫ, S(K,J) , d⊗ H ) ≤ H[.] (ǫ, ΠK , dH ) + KH[.] (ǫ, FJ , dH ) where B S(K,J) ΠK  P (K,J) [J] y ∈ Rq |x ∈ Rp 7→ sξ (y|x) = K πk ϕ(y|βk x, Σk )  k=1  o n [J] [J] = ξ = π1 , . . . , πK , β1 , . . . , βK , Σ1 , . . . , ΣK ∈ Ξ̃(K,J)     Ξ̃(K,J) = ΠK × ([−Aβ , Aβ ]q×p )K × ([aΣ , AΣ ]q )K ( ) K X = (π1 , . . . , πK ) ∈ (0, 1)K ; πk = 1    n k=1 FJ = ϕ(.|β [J] X, Σ); β ∈ [−Aβ , Aβ ]q×p , Σ = diag([Σ]1,1 , . . . , [Σ]q,q ) ∈ [aΣ , AΣ ]q where ϕ denote the Gaussian density, and Aβ , aΣ , AΣ are absolute constants. o 117 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE Calculus for the proportions We could apply a result proved by Wasserman and Genovese in [GW00] to bound the entropy for the proportions. We get that  K−1 ! ⊗n K/2 3 . H[.] (ǫ, ΠK , dH ) ≤ log K(2πe) ǫ Calculus for the Gaussian The family  2  l(y, x) = (1 + δ)−p q−3q/4 ϕ(y|νJ x, (1 + δ)−1/4 B [1] )   2   u(y, x) = (1 + δ)p q+3q/4 ϕ(y|νJ x, (1 + δ)B [2] )    [a] B =diag(bi(1) , . . . , bi(q) ), with i a permutation, for a ∈ {1, 2}, Bǫ (FJ ) =   bl = (1 + δ)1−l/2 AΣ , l ∈ {2, . . . , N }     and ∀(z, j) ∈ J c , νz,j = 0   √   ∀(z, j) ∈ J, νz,j = cδAΣ uz,j                  (3.12) is an ǫ-bracket covering for FJ , where uz,j is a net for the mean, N is the number of parameters −1/4 ) 1 ǫ, and c = 5(1−28 needed to recover all the variance set, δ = √2(p2 q+3/4q) We obtain that     2Aβ |J| AΣ 1 −1−|J| √ + δ ; |Bǫ (FJ )| ≤ 2 aΣ 2 cAΣ . and then we get n H[.] (ǫ, FJ , d⊗ H ) ≤ log 2  2Aβ √ cAΣ |J|  AΣ 1 + aΣ 2 Proposition 3.5.3. Put D(K,J) = K(1 + |J|). For all ǫ ∈ (0, 1), B n H[.] (ǫ, S(K,J) , d⊗ H ) ≤ log(C) + D(K,J) log with C = 2K(2πe)K/2  2Aβ √ cAΣ K|J| 3K−1   δ −1−|J| (̟) .   1 ; ǫ AΣ 1 + aΣ 2 Determination of a function φ We could take s  " q φ(K,J) (̟) = D(K,J) ̟ B(Aβ , AΣ , aΣ ) + log φ ! K . 1 ̟∧1 # . This function is non-decreasing, and ̟ 7→ (K,J) is non-increasing. ̟ √ 2 . With the expression of φ(K,J) , The root ̟(K,J) is the solution of φ(K,J) (̟(K,J) ) = n̟(K,J) we get s  " r # D 1 (K,J) 2 ̟ B(Aβ , AΣ , aΣ ) + log ̟(K,J) = . n ̟(K,J) ∧ 1 q D(K,J) ∗ Nevertheless, we know that ̟ = n B(Aβ , AΣ , aΣ ) minimizes ̟(K,J) : we get " !# D(K,J) 1 2 2 ̟(K,J) ≤ . 2B (Aβ , AΣ , aΣ ) + log D (K,J) n B 2 (Aβ , AΣ , aΣ ) ∧ 1 n 118 3.6. APPENDIX: TECHNICAL RESULTS Assumption K We want to group models by their dimension. Lemme 3.5.4. The quantity card{(K, J) ∈ N∗ × P({1, . . . , q} × {1, . . . , p}), D(K, J) = D} is upper bounded by   2pq if pq ≤ D − q 2  D−q2 epq  otherwise. 2 D−q Proposition 3.5.5. Consider the weight family {w(K,J) }(K,J)∈K×J defined by w(K,J) = D(K,J) log Then we have 3.6 P (K,J)∈K×J  4epq (D(K,J) − q 2 ) ∧ pq  . e−w(K,J) ≤ 2. Appendix: technical results In this appendix, we give more details for the proofs. 3.6.1 Bernstein’s Lemma Lemme 3.6.1 (Bernstein’s inequality). Let (X1 , . . . , Xn ) be independent real Pnvalued random variables. Assume that therePexists some positive numbers v and P c such that i=1 E(Xi2 ) ≤ v, n K! K−2 K and, for all integers K ≥ 3, i=1 E((Xi )+ ) ≤ 2 vc . Let S = ni=1 (Xi − E(Xi )). Then, for every positive x, √ P (S ≥ 2vx + cx) ≤ exp(−x). 3.6.2 Proof of Lemma 3.5.2 This proof is adapted from Maugis and Meynet, [MMR12]. First, let give some bounds. Lemme 3.6.2. Let τ > 0. For all x > 0, consider f (x) = x log(x)2 , h(x) = x log(x) − x + 1, Then, for all 0 < x < eτ , we get f (x) ≤ φ(x) = ex − x − 1. τ2 h(x). φ(−τ ) φ(y) y2 is non-decreasing. We omit the proof here.  ∗  s We want to apply this inequality, in order to derive the Lemma 3.5.2. As log ≤ τ, s̄m To prove this, we have to show that y 7→ ∞ s∗ s̄m ∞ ≤ eτ ; and we could apply the previous inequality to s∗ /s̄m . Indeed, for all x, for all y,  ∗    ∗ τ2 s (y|x) s (y|x) h ≤ . f s̄m (y|x) φ(−τ ) s̄m (y|x) CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE 119 Integrating with respect to the density s̄m , we get that  ∗  s (y|.) 2 s∗ (y|.) log s̄m (y|.)dy s̄m (y|.) Rq s̄m (y|.)   ∗ Z τ2 s∗ (y|.) s∗ (y|.) s (y|.) ≤ log − + 1 s̄m (y|.)dy −τ − τ − 1 s̄m (y|.) s̄m (y|.) s̄m (y|.) Rq e   ∗ n Z s (y|xi ) 2 1X ∗ s (y|xi ) log dy ⇐⇒ n s̄m (y|xi ) i=1 n Z τ2 1X s∗ (y|xi ) ≤ −τ dy. s∗ (y|xi ) log e −τ −1n s̄m (y|xi ) Z i=1 It concludes the proof. 3.6.3 Determination of a net for the mean and the variance In this subsection, we work with a Gaussian density, then β ∈ Rq×p and Σ ∈ S++ q . — Step 1: construction of a net for the variance j Let ǫ ∈ (0, 1], and δ = √2(p21q+ 3 q) ǫ. Let bj = (1 + δ)1− 2 AΣ . For 2 ≤ j ≤ N , we have S S 4 [aΣ , AΣ ] = [bN , bN −1 ] . . . [b3 , b2 ], where N is chosen to recover everything. We want that aΣ = (1 + δ)1−N/2 AΣ   aΣ N log = 1− log(1 + δ) AΣ 2 √ 2 log( AaΣΣ 1 + δ) . N= log(1 + δ) ⇔ ⇔ We want N to be an integer, then N =  A 2 log( a Σ Σ √ 1+δ) log(1+δ)  . We get a net for the variance. We could let B = diag(bi(1) , . . . , bi(q) ), close to Σ (and deterministic, independent of the values of Σ), where i is a permutation such that bi(z)+1 ≤ [Σ]z,z ≤ bi(z) for all b √1 z ∈ {1, . . . , q}. Remember that j+1 bj = 1+δ . — Step 2: construction of a net for the mean vectors We select only the relevant variables detected by the Lasso estimator. For λ ≥ 0, n o Lasso Jλ = (z, j) ∈ {1, . . . , q} × {1, . . . , p}|β̂z,j (λ) 6= 0 . Let f = ϕ(.|βx, Σ) ∈ FJ . — Definition of the brackets Define the bracket by the functions l and u:   2 l(y, x) = (1 + δ)−p q−3q/4 ϕ y|νJ x, (1 + δ)−1/4 B [1] ;   2 u(y, x) = (1 + δ)p q+3q/4 ϕ y|νJ x, (1 + δ)B [2] . We have chosen i such that [B [1] ]z,z ≤ Σz,z ≤ [B [2] ]z,z for all z ∈ {1, . . . , q}. We need to define ν such that [l, u] is an ǫ-bracket for f . 120 3.6. APPENDIX: TECHNICAL RESULTS — Proof that [l, u] is a bracket for f We are looking for a condition on νJ to have fu ≤ 1 and fl ≤ 1. We will use the following lemma to compute these ratios. Lemme 3.6.3. Let ϕ(.|µ1 , Σ1 ) and ϕ(.|µ2 , Σ2 ) be two Gaussian densities. If their variance matrices are assumed to be diagonal, with Σa = diag([Σa ]1,1 , . . . , [Σa ]q,q ) for a ∈ {1, 2}, such that [Σ2 ]z,z > [Σ1 ]z,z > 0 for all z ∈ {1, . . . , q}, then, for all y ∈ Rq ,   q p 1 1 (µ1 −µ2 ) ,..., [Σ ] −[Σ ϕ(y|µ1 , Σ1 ) Y [Σ2 ]z,z 21 (µ1 −µ2 )t diag [Σ2 ]1,1 −[Σ ] ] q,q q,q 1 1,1 2 1 p ≤ e . ϕ(y|µ2 , Σ2 ) [Σ1 ]z,z z=1 For the ratio f u we get: ϕ(y|βx, Σ) 1 f (y|x) = 2 q+3q/4 p u(y, x) (1 + δ) ϕ(y|νJ x, (1 + δ)B [2] ) q Y 1 bz 1 t [2] −1 ≤ (1 + δ)q/2 × e 2 (βx−νJ x) ((1+δ)B −Σ) (βx−νJ x) 2 q+3q/4 p [Σ]z,z (1 + δ) z=1 ≤(1 + δ) p2 q−q/4 1 (1 + δ)q/4 e 2 (βx−νJ x) 1 2 ≤(1 + δ)p q e 2δ (βx−νJ x) For the ratio l f t [B [2] ]−1 (βx−ν t (δB [2] )−1 (βx−ν J x) J x) (3.13) . we get: ϕ(y|νJ x, (1 + δ)−1/4 B [1] ) l(y, x) 1 = 2 f (y|x) (1 + δ)p q+3q/4 ϕ(y|βx, Σ) q Y 1 Σz,z 1 t [1] −1 (1 + δ)q/8 × e 2 (βx−νJ x) (Σ−B ) (βx−νJ x) ≤ 2 q+3q/4 p bz (1 + δ) z=1 ≤(1 + δ) −p2 q−3q/8 ≤(1 + δ)−p 2 q−3q/8 1 (1 + δ)q/4 × e 2 (βx−νJ x) 1 e 2(1−(1+δ)−1/4 ) t ((1−(1+δ)−1/4 )B [1] )−1 (βx−ν (βx−νJ x)t [B [1] ]−1 (βx−νJ x) We want to bound the ratios (3.13) and (3.14) by 1. Put c = these calculus. A necessary condition to obtain this bound is . 5(1−2−1/4 ) , 8 ||βx − νJ x||22 ≤ pqδ 2 (1 − 2−1/4 )A2Σ . Indeed, we want (1 + δ)−p 2 q−3q/8 1 e 2(1−(1+δ)−1/4 ) (1 + δ) −p2 q e (βx−νJ x)t [B [2] ]−1 (βx−νJ x) 1 (βx−νJ x)t [B [1] ]−1 (βx−νJ x) 2δAΣ ≤1 ≤ 1; which is equivalent to δ2 ||βx − νJ x||22 ≤ p2 q A2Σ ;  2  3 2 2 ||βx − νJ x||2 ≤ p q + q δ 2 (1 − 2−1/4 )AΣ . 4 As ||βx − νJ x||22 ≤ p||β − νJ ||22 ||x||∞ , and x ∈ [0, 1]p , we need to get ||β − νJ ||22 ≤ pqδ 2 (1 − 2−1/4 )A2Σ to have the wanted bound. Put     Aβ −Aβ , √ . U := Z ∩ √ cδAΣ cδAΣ J x) (3.14) and develop 121 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE For all (z, j) ∈ J, choose uz,j = argmin βz,j − vz,j ∈U √ cδAΣ vz,j . (3.15) Define ν by for all (z, j) ∈ J c , νz,j = 0; √ for all (z, j) ∈ J , νz,j = cδAΣ uz,j . Then, we get a net for the mean vectors. — Proof that dH (l, u) ≤ ǫ We will work with the Hellinger distance. d2H (l, u) = = = = − Z √ √ 1 ( l − u)2 2 Rq Z √ 1 l + u − 2 lu 2 Rq i Z √ 1h 2 2 (1 + δ)−p q−3q/4 + (1 + δ)p q+3q/4 − ϕ l ϕu 2 Rq i 1h 2 2 (1 + δ)−p q−3q/4 + (1 + δ)p q+3q/4 2 2 !1/2 q p Y 2bi(z)+1 bi(z) (1 + δ)1/2 (1 + δ)−1/8 ∗ 1. (1 + δ)bi(z)+1 + (1 + δ)−1/4 bi(z) z=1 We have used the following lemma: Lemme 3.6.4. The Hellinger distance of two Gaussian densities with diagonal variance matrices is given by the following expression: d2H (ϕ(.|µ1 , Σ1 ), ϕ(.|µ2 , Σ2 )) !1/2 p q Y 2 [Σ1 ]z,z [Σ2 ]z,z =2 − 2 [Σ1 ]z,z + [Σ2 ]z,z z=1 ( ! )   1 1 × exp − (µ1 − µ2 )t diag (µ1 − µ2 ) 4 [Σ1 ]2z,z + [Σ2 ]2z,z z∈{1,...,q} As bi(z)+1 = (1 + δ)−1/2 bi(z) , we get that 2 bi(z)+1  (1 + δ)3/8 bi(z) (1 + δ)5/8 (1 + δ)−1/4 + (1 + δ)3/2 (1 + δ)−1/4 + (1 + δ)1/2 (1 + δ) 2 . = (1 + δ)−7/8 + (1 + δ)7/8  =2 Then d2H (l, u) i 1h 2 2 = (1 + δ)−(p q+3q/4) + (1 + δ)p q+3q/4 − 2  2 −7/8 (1 + δ) + (1 + δ)7/8 = cosh((p2 q + 3q/4) log(1 + δ)) − 2 cosh(7/8 log(1 + δ))−q/2 q/2 = cosh((p2 q + 3q/4) log(1 + δ)) − 1 + 1 − 2−q/2 cosh(7/8 log(1 + δ))−q/2 . 122 3.6. APPENDIX: TECHNICAL RESULTS We want to apply the Taylor formula to f (x) = cosh(x) − 1 to obtain an upper bound, and to g(x) = 1 − 2−q/2 cosh(x)−q/2 . Indeed, there exists c such that, on the good 2 2 interval, f (x) ≤ cosh(c) x2 and g(x) ≤ q 2 x2 . Then, and because log(1 + δ) ≤ δ, d2H (l, u) ≤ cosh((p2 q + 3q/4) log(1 + δ)) − 2 cosh(7/8 log(1 + δ))−q/2   49 2 2 2 ≤ (p q + 3q/4) δ cosh(α) + 128 ≤ 2(p2 q + 3q/4)2 δ 2 ≤ ǫ2 ; √ where ǫ ≥ 2(p2 q + 34 q)δ. — Step 3: Upper bound of the number of ǫ-brackets for FJ . From Step 1 and Step 2, the family   −(p2 q+3q/4) ϕ(y|ν x, (1 + δ)−1/4 B [1] )   l(y, x) = (1 + δ) J       p2 q+3q/4 ϕ(y|ν x, (1 + δ)B [2] )   u(y, x) = (1 + δ)   J      B [a] = diag(b[a] , . . . , b[a] ) where i is a permutation, for a ∈ {1, 2},  a i(1) i(q)  Bǫ (FJ ) = [a]       bi(z) = (1 + δ)1−ia (z)/2 AΣ for all z ∈ {1, . . . , q}       c   with ∀(z, j) ∈ J , ν = 0   z,j    √    ∀(z, j) ∈ J, ν = cδA u z,j Σ z,j (3.16) is an ǫ-bracket for FJ , for uz,j defined by (3.15). Therefore, an upper bound of the number of ǫ-brackets necessary to cover FJ is deduced from an upper bound of the cardinal of Bǫ (FJ ). |J|  N N X Y  2Aβ   2Aβ |J| X 2Aβ √ √ √ ≤ (N − 1). |Bǫ (FJ )| ≤ 1≤ cδAΣ cδAΣ cδAΣ l=2 l=2 (z,j)∈J As N ≤ 2(AΣ /aΣ +1/2) , δ we get |Bǫ (FJ )| ≤ 2 3.6.4  2Aβ √ cAΣ |J|  AΣ 1 + aΣ 2  δ −1−|J| . Calculus for the function φ From the Proposition 3.5.3, we obtain, for all ̟ > 0, Z ̟ 0 q n B H[.] (ǫ, S(K,J) , d⊗ H )dǫ We need to control R̟q log 0 1 ǫ  0 ̟∧1 0 s   1 dǫ log ǫ (3.17) dǫ, which is done in Maugis-Rabusseau and Meynet ([MMR12]). Lemme 3.6.5. For all ̟ > 0, Z s ̟ Z q p ≤ ̟ log(C) + D(K,J) s  # "   √ 1 1 log dǫ ≤ ̟ π + log . ǫ ̟ CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE 123 Then, according to (3.17), Z ̟ 0 q n B H[.] (ǫ, S(K,J) , d⊗ H )dǫ s  " q p √ ≤̟ log(C) + D(K,J) (̟ ∧ 1) π + log "s q ≤̟ D(K,J) log(C) √ + π+ D(K,J) s log  1 ̟∧1 1 ̟∧1 # Nevertheless, K log(C) ≤ log(2) + log(K) + log(2πe) 2    2Aβ AΣ 1 + + K|J| log √ + (K − 1) log(3) + K log aΣ 2 cAΣ " √ ≤D(K,J) log(2) + log( 2πe) + 1 + log(3)    2Aβ AΣ 1 + + log √ + log aΣ 2 cAΣ r       Aβ AΣ 1 πe 5/2 ≤D(K,J) 1 + log + + log 2 3 . AΣ a Σ 2 c  Then Z q B n H[.] (ǫ, S(K,J) , d⊗ H )dǫ 0 s r      q A πe A 1 β Σ 5/2 + + log 2 3 ≤̟ D(K,J)  1 + log AΣ AΣ 2 c s  # √ 1 + π + log ̟∧1 s  s  "   # q Aβ AΣ 1 1 + a + log + ≤̟ D(K,J) 1 + log AΣ a Σ 2 ̟∧1 s  " #  q 1 ≤̟ D(K,J) B(Aβ , AΣ , aΣ ) + log ; ̟∧1 ̟ with B(Aβ , AΣ , aΣ ) = 1 + s q p log(25/2 3 πe c ). and a = √ 3.6.5 Proof of the Proposition 3.5.5 π+ log We are interested in P (K,J)∈K×J  Aβ AΣ  AΣ 1 + aΣ 2  + a; e−w(K,J) . Considering w(K,J) = D(K,J) log  4epq (D(K,J) − q 2 ) ∧ pq  , # 124 3.6. APPENDIX: TECHNICAL RESULTS we could group models by their dimensions to compute this sum. Denote by CD the cardinal of models of dimension D. X e −D(K,J) log  4epq (D(K,J) −q 2 )∧pq K∈N∗ J∈P({1,...,q}×{1,...,p}) = 2 pq+q X e −D log  = −D 4 D=1  pq+q 2 ≤ 3.6.6 X −D 2 + = X CD e −D log  4epq (D−q 2 )∧pq  D≥1 4epq (D−q 2 ) D=1 2 pq+q X  epq D − q2  epq D − q2 −q2 +∞ X + D−q2 +∞ X + +∞ X e −D log  4epq pq  2pq D=pq+q 2 +1 e−D(log(4)+1)+pq log(2) D=pq+q 2 +1 2−D = 2. D=pq+q 2 +1 D=1 Proof of the Lemma 3.5.4 We know that D(K,J) = K − 1 + |J|K + Kq 2 . Then, CD = card{(K, J) ∈ N∗ × P({1, . . . , q} × {1, . . . , p}), D(K, J) = D} X X  pq  ≤ 1 2 |J| K(|J|+q +1)−1=D ∗ 1≤z≤q K∈N 1≤j≤p X  pq  ≤ 1 2 . |J| |J|≤pq∧(D−q ) ∗ |J|∈N If pq < D − q 2 , X  pq  pq 1 2 = 2 . |J| |J|≤pq∧(D−q ) |J|>0 Otherwise, according to the Proposition 2.5 in Massart, [Mas07], X  pq  2 1 2 ≤ f (D − q ) |J| |J|≤pq∧(D−q ) |J|>0 where f (x) = (epq/x)x is an increasing function on {1, . . . , pq}. As pq is an integer, we get the result. 125 CHAPTER 3. AN ORACLE INEQUALITY FOR THE LASSO-MLE PROCEDURE 3.6. APPENDIX: TECHNICAL RESULTS 126 Chapter 4 An oracle inequality for the Lasso-Rank procedure Contents 4.1 4.2 Introduction . . . . . . . . . . . . . . . . . . . . The model and the model collection . . . . . . 4.2.1 Linear model . . . . . . . . . . . . . . . . . . . 4.2.2 Mixture model . . . . . . . . . . . . . . . . . . 4.2.3 Generalized EM algorithm . . . . . . . . . . . . 4.2.4 The Lasso-Rank procedure . . . . . . . . . . . 4.3 Oracle inequality . . . . . . . . . . . . . . . . . . 4.3.1 Framework and model collection . . . . . . . . 4.3.2 Notations . . . . . . . . . . . . . . . . . . . . . 4.3.3 Oracle inequality . . . . . . . . . . . . . . . . . 4.4 Numerical studies . . . . . . . . . . . . . . . . . 4.4.1 Simulations . . . . . . . . . . . . . . . . . . . . 4.4.2 Illustration on a real dataset . . . . . . . . . . 4.5 Appendices . . . . . . . . . . . . . . . . . . . . . 4.5.1 A general oracle inequality for model selection 4.5.2 Assumption Hm . . . . . . . . . . . . . . . . . Decomposition . . . . . . . . . . . . . . . . . . For the Gaussian . . . . . . . . . . . . . . . . . For the mixture . . . . . . . . . . . . . . . . . . 4.5.3 Assumption K . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . 119 . . . . . . 119 . . . . . . 119 . . . . . . 120 . . . . . . 121 . . . . . 122 . . . . . . 122 . . . . . . 123 . . . . . . 123 . . . . . 125 . . . . . . 125 . . . . . . 126 . . . . . 126 . . . . . . 126 . . . . . . 128 . . . . . . 128 . . . . . . 129 . . . . . . 132 . . . . . . 134 128 4.1. INTRODUCTION In this chapter, we focus on a theoretical result for the Lasso-Rank procedure. Indeed, we get the same kind of results as in the previous chapter, with rank constraint on the estimator. We get a theoretical penalty for which the model selected by a penalized criterion, among the collection, satisfies an oracle inequality. We also illustrate in more details benefits of this procedure with simulated and benchmark dataset including rank structure. 4.1 Introduction The multivariate response regression model Y = βX + ǫ postulates a linear relationship between Y , the q×n matrix containing q responses for n subjects, and X, the p×n matrix on p predictor variables. The term ǫ is an q ×n matrix with independent columns, ǫi ∼ Nq (0, Σ) for all i ∈ {1, . . . , n}. The unknown q × p coefficient matrix β needs to be estimate. In a more general way, we could use finite mixture of linear model, which models the relationship between response and predictors, arising from different subpopulations: if the variable Y , conditionally to X, belongs to the cluster k, there exists βk ∈ Rq×p and Σk ∈ S++ such that q Y = βk X + ǫ, with ǫ ∼ Nq (0, Σk ). If we use this model to deal with high-dimensional data, the number of variables could be quickly much larger than the sample size, because and predictors and response variables could be high-dimensional. To solve this problem, we have to reduce the parameter set dimension. One way to cope the dimension problem is to select relevant variables, in order to reduce the number of unknowns. Indeed, all the information should not be interesting for the clustering, and could even be harmful. In a density estimation way, we could cite Pan and Shen, in [PS07], who focus on mean variable selection, Meynet and Maugis in [MMR12] who extend their procedure in high-dimension, Zhou et al., in [ZPS09], who use the Lasso estimator to regularize Gaussian mixture model with general covariance matrices, Sun et al., in [SWF12], who propose to regularize the k-means algorithm to deal with high-dimensional data, Guo et al, in [GLMZ10], who propose a pairwise variable selection method, among others. In a regression framework, we could use the Lasso estimator, introduced by Tibshirani in [Tib96], which is a sparse estimator. It penalizes the maximum likelihood estimator by the ℓ1 -norm, which achieves the sparsity, as the ℓ0 -penalty, but leads also to a convex optimization. Because we work with the multivariate linear model, to deal with the matrix structure, we could prefer the group-Lasso estimator, variables grouped by columns, which selects columns rather than coefficients. This estimator was introduced by Zhou and Zhu in [ZZ10] in the general case. If we select |J| columns among the p possible, we have to estimate |J|q coefficients rather than pq for A, which could be smaller than nq if |J| is smaller enough. Another estimator which reduces the dimension, is the low rank estimator. Introduced by Izenman in [Ize75] in the linear model, and more used the last decades, with among others Bunea et al. in [BSW12] and Giraud in [Gir11], the regression matrix is estimated by a matrix of rank R, R < p ∧ q. Then, we have to estimate R(p + q − R) coefficients, which could be smaller than nq. In this chapter, we have chosen to mix these two estimators to provide a sparse and low rank estimator in mixture models. This method was introduced by Bunea et al. in [BSW12], in the case of linear model and known noise covariance matrix. They present different ways, more or 129 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE less computational, with more or less good results in theory. They get an oracle inequality, which say that, among a model collection, they are able to choose an estimator with true rank and true relevant variables. For this model, Ma and Sun in [MS14] get a minimax lower bound, which precise that they attain nearly optimal rates of convergence adaptively for square Schatten norm losses. In this chapter, we consider finite mixture of K linear models in high-dimension. This model is studied in details by Städler et al. for real response variable in [SBG10], and by Devijver for multivariate response variable in [Dev14c]. We will estimate βk for all k ∈ {1, . . . , K} by a column sparse and low rank estimator. The Lasso estimator is used to select variables, whereas we refit the estimation by a low rank estimator, restricted on relevant variables. The procedure we propose is based on a modeling that recasts variable selection, rank selection, and clustering problems into a model selection problem. This procedure is developed in [Dev14c], with methodology, computational issues, simulations and data analysis. In this chapter, we focus on theoretical point of view, and developed simulations and data analysis for the low rank issue. We construct a model collection, with models more or less sparse, and with vector of ranks varying with values more or less small. Among this collection, we have to select a model. We use the slope heuristic, which is a non-asymptotic criterion. In a theoretical way, in this chapter, we get an oracle inequality for the collection constructed by our procedure, which makes a performance comparison between our selected model and the oracle for a specified penalty. This result is an extension of the work of Bunea et al. in [BSW12], to mixture models and with unknown covariance matrices (Σk )1≤k≤K . They ensure that mixing sparse estimator and low rank matrix could be interesting. Indeed, whereas we have to estimate q × p coefficients in each cluster for the regression matrix, we get only R(|J| + q − R) unknown variables, which could be smaller than the number of observations nq if |J| and R are small. Even if the oracle inequality we get in this chapter is an extension of Bunea et al. result, we use a really different way to prove it. Considering the model collection constructed, we want to select a model as good as possible. For that, we use the slope heuristic, which leads to construct a penalty, proportional to the dimension, and we select the model minimizing the penalized log-likelihood. Theoretically, we construct also a penalty, proportional to the dimension (up to a logarithm term). We provide an oracle inequality which compares, up to a constant, the Jensen-KullbackLeibler divergence between our model and the true model to the Kullback-Leibler divergence between the oracle and the true model. Then, in estimation term, we do as well as possible. This oracle inequality is deduced from a general model selection theorem for maximum likelihood estimator of Massart developed in [Mas07]. Controlling the bracketing entropy of models, we could prove the result. Remark that we work in a regression framework, then we rather use an extension of this theorem proved in Cohen and Le Pennec article [CLP11]. As our model collection is random, constructed by the Lasso estimator, we rather use an extension of this theorem proved in [Dev14b]. To illustrate this procedure, in a computational way, we validate it on simulated dataset, and benchmark dataset. If the data have a low rank structure, we could easily find it with our methodology. This chapter is organized as follows. In the Section 4.2, we describe the finite mixture regression model used in this procedure, and the main step of the procedure. In the Section 4.3, we present the main result of this chapter, which is an oracle inequality for the procedure proposed. Finally, in Section 4.4, we illustrate the procedure on simulated and benchmark dataset. Proof details of the oracle inequality are given in Appendix. 4.2. THE MODEL AND THE MODEL COLLECTION 4.2 130 The model and the model collection We introduce our procedure of estimation by sparse and low rank matrix in the linear model, in Section 4.2.1, and extend it in Section 4.2.2 for mixture models. We also present the main algorithm used in this context, and we describe the procedure we propose in Section 4.2.4. 4.2.1 Linear model We consider the observations (xi , yi )1≤i≤n which realized random variables (X, Y ), satisfying the linear model Y = βX + ǫ; where Y ∈ Rq are the responses, X ∈ Rp are the regressors, β ∈ Rq×p is an unknown matrix, and ǫ ∈ Rq are random errors, ǫ ∼ Nq (0, Σ), with Σ ∈ S++ a symmetric positive definite matrix. q We will work in high-dimension, then q × p could be larger than the number of observations nq. We will construct an estimator which is sparse and low rank for β to cope with the high-dimension issue. Moreover, to reduce the covariance matrix dimension, we compute a diagonal estimator of Σ. The procedure we propose could be explained into two steps. First, we estimate the relevant columns of β thanks to the Lasso estimator, for λ > 0, using the estimator  (4.1) β̂ Lasso (λ) = argmin ||Y − βX||22 + λ||β||1 ; β∈Rq×p P P where ||β||1 = pj=1 qz=1 |βz,j |. We assume that the covariance matrix is unknown. For λ > 0, computing the Lasso estimator of β̂ Lasso (λ), we could deduce the relevant columns. Restricted to these relevant columns, in the second step of the procedure, we compute a low rank estimator of β, saying of rank at most R. Indeed, as explained in Giraud in [Gir11], we restrict the maximum likelihood estimator to have a rank at most R, keeping only the R biggest singular values in the corresponding decomposition. We get an explicit formula. This two steps procedure leads to an estimator of β which is sparse and has a low rank. We have also reduced the dimension into two ways. We refit the covariance matrix estimator by the maximum likelihood estimator. This estimator is studied in Bunea et al. in [BSW12], in method 3. Let extend it in mixture models. 4.2.2 Mixture model We observe n independent couples (x, y) = (xi , yi )1≤i≤n of random variables (X, Y ), with Y ∈ Rq and X ∈ Rp . We will estimate the unknown conditional density s∗ by a multivariate Gaussian mixture regression model. In our model, if the observation i belongs to the cluster k, we assume that there exists βk ∈ Rq×p , and Σk ∈ S++ such that yi = βk xi + ǫi where ǫi ∼ Nq (0, Σk ). q Thus, the random response variable Y ∈ Rq will be explained by a set of explanatory variables, written X ∈ Rp , through a mixture of linear regression-type model. Give more precisions on the assumptions. — The variables Yi are independent conditionally to Xi , for all i ∈ {1, . . . , n} ; CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE 131 — we let Yi |Xi = xi ∼ sK ξ (y|xi )dy, with sK ξ (y|x) K X (y − βk x)t Σ−1 πk k (y − βk x) exp − = 2 (2π)q/2 det(Σk )1/2 k=1 ! K ξ = (π1 , . . . , πk , β1 , . . . , βk , Σ1 , . . . , ΣK ) ∈ ΞK = ΠK × (Rq×p )K × (S++ q ) ( ) K X ΠK = (π1 , . . . , πK ); πk > 0 for k ∈ {1, . . . , K} and πk = 1 (4.2)  k=1 is the set of symmetric positive definite matrices on Rq . S++ q For all k ∈ {1, . . . , K}, βk is the matrix of regression coefficients, and Σk is the covariance matrix in the mixture component k. P The πk s are the mixture proportions. For all k ∈ {1, . . . , K}, for t all z ∈ {1, . . . , q}, [βk x]z = pj=1 [βk ]z,j xj is the zth component of the mean of the mixture component k for the conditional density sK ξ (.|x). To detect the relevant variables, we generalize the Lasso estimator defined by (4.1) for mixture models. Indeed, we penalize the empirical contrast by an ℓ1 -penalty on the mean parameters proportional to p X q X ||Pk βk ||1 = |(Pk βk )z,j |, j=1 z=1 where the Cholesky decomposition Pkt Pk = Σ−1 k defines Pk for all k ∈ {1, . . . , K}. Then, we will consider ( ) n K X X 1 Lasso (λ) = argmin ξˆK − log(sK πk ||Pk βk ||1 . (4.3) ξ (yi |xi )) + λ n ξ=(π,β,Σ)∈ΞK i=1 k=1 Remark that the penalty take into account the mixture weight. To reduce the dimension and simplify computations, we will estimate Σk by a diagonal matrix, thus Pk will be also estimated by a diagonal matrix, for all k ∈ {1, . . . , K}. As in Section 4.2.1, we refit the estimator, restricted on releant columns, with low rank estimator. In Section 4.2.3, we will extend the EM algorithm to compute those two estimators. 4.2.3 Generalized EM algorithm In a computational way, we will use two generalized EM algorithms, in order to deal with high-dimensional data and get a sparse and low rank estimator. Give some details about those algorithms. Initially, the EM algorithm was introduced by Dempster et al. in [DLR77]. It alternates two steps until convergence, an expectation step to cluster data, and a maximization step to update estimation. In our procedure, we want to know which columns are relevant, then we have to compute (4.3), and we want to refit the estimators by a maximum likelihood under low rank constraint estimator. From the Lasso estimator (4.3), we could use a generalization of the EM algorithm described in [Dev14c]. From the estimate of β, we could deduce which columns are relevant. The second algorithm we use leads to determine βk restricted on relevant columns, for all k ∈ {1, . . . , K}, with rank Rk . We alternate two steps, E-step and M-step, until relative convergence of the parameters and of the likelihood. We restrict the dataset to relevant columns, and construct an estimator of size q × |J| rather than q × p, where βk has for rank Rk , for all k ∈ {1, . . . , K}. Explain the both steps at iteration (ite) ∈ N∗ . 4.2. THE MODEL AND THE MODEL COLLECTION 132 — E-step: compute for k ∈ {1, . . . , K}, i ∈ {1, . . . , n}, the expected value of the loglikelihood function, γk τi,k = Eθ(ite) ([Zi ]k |Y ) = PK l=1 γl where (ite) γl = πl (ite) det Σl exp t    (ite) (ite) y −β (ite) x − 12 yi −βl xi (Σ−1 i i l ) l for l ∈ {1, . . . , K}, and Zi is the component-membership variable for an observation i. — M-step: — To get estimation in linear model, we assign each observation in its estimated cluster, by the MAP principle. We could compute this thanks to the E-step. Indeed,yi is assigned to the component number argmax τi,k . k∈{1,...,K} — (ite) Then, we could define β˜k = (xt|k x|k )−1 xt|k y|k , in which x|k and y|k are the sample (ite) (ite) restriction to the cluster k. We decompose β̃k in singular values such that β̃k = t U SV with S = diag(s1 , . . . , sq ) and s1 ≥ s2 ≥ . . . ≥ sq the singular values. Then, the (ite) (ite) estimator β̂k is defined by β̂k = U SRk V t with SRk = diag(s1 , . . . , sRk , 0, . . . , 0). We do it for all k ∈ {1, . . . , K}. 4.2.4 The Lasso-Rank procedure The procedure we propose, which is particularly interesting in high-dimension, could be decomposed into three main steps. First, we construct a model collection, with models more or less sparse and with more or less components. Then, we refit estimations with the maximum likelihood estimator, under rank constraint. Finally, we select a model with the slope heuristic. Model collection construction Fix K ∈ K. To get various relevant columns, we construct a data-driven grid of regularization parameters GK , coming from EM algorithm formula. See [Dev14c] for more details. For each λ ∈ GK , we could compute the Lasso estimator (4.3), and deduce relevant variables set, denoted by J(K,λ) . Then, varying λ ∈ GK , and K ∈ K, we construct J = ∪K∈K ∪λ∈GK J(K,λ) the set of relevant variables sets. Refitting We could also define a low rank estimator ŝ(K,J,R) restricted to relevant variables detected by the Lasso estimator, indewed by J. From this procedure, we construct a model with K clusters, |J| relevant columns and matrix of regression coefficients of ranks R ∈ NK , as described by the next model S(K,J,R) . (K,J,R) S(K,J,R) = {y ∈ Rq |x ∈ Rp 7→ sξ (y|x)} (4.4) where (K,J,R) sξ (y|x) = K X πk det(Pk ) k=1 (2π)q/2   1 Rk [J] t −1 Rk [J] exp − (y − (βk ) x) Σk (y − (βk ) x) ; 2 RK ξ = (π1 , . . . , πK , β1R1 , . . . , βK , Σ1 , . . . , ΣK ) ∈ Ξ(K,J,R) ; K Ξ(K,J,R) = ΠK × Ψ(K,J,R) × (S++ q ) ; n o RK [J] Ψ(K,J,R) = ((β1R1 )[J] , . . . , (βK ) ) ∈ (Rq×p )K | for all k ∈ {1, . . . , K}, Rank(βk ) = Rk . 133 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE Varying K ∈ K ⊂ N∗ , J ∈ J ⊂ P({1, . . . , p}), and R ∈ R ⊂ {1, . . . , |J| ∧ q}K , we get a model collection with various number of components, relevant columns and matrix of regression coefficients. Model selection Among this model collection, during the last step, a model has to be selected. As in Maugis and Michel in [MM11b], and in Maugis and Meynet in [MMR12], among others, a non asymptotic penalized criterion is used. The slope heuristic was introduced by Birgé and Massart in [BM07], and developed in practice by Baudry et al. in [BMM12] with the Capushe package. To use it in our context, we have to extend theoretical result to determine the penalty shape in the high-dimensional context, with a random model collection, in a regression framework. The main result is described in the next section, whereas proof details are given in Appendix. 4.3 Oracle inequality In a theoretical point of view, we want to ensure that the slope heuristic which penalizes the loglikelihood with the model dimension will select a good model. We follow the approach developed by Massart in [Mas07] which consists of defining a non-asymptotic penalized criterion, leading to an oracle inequality. In the context of regression, Cohen and Le Pennec, in [CLP11], and Devijver in [Dev14b], propose a general model selection theorem for maximum likelihood estimation. The result we get is a theoretical penalty, for which the model selected is as good as the best one, according to the Kullback-Leibler loss. 4.3.1 Framework and model collection Among the model collection constructed by the procedure developed in Section 4.2.2, with various rank and various sparsity, we want to select an estimator which is close to the best one. The oracle is by definition the model belonging to the collection which minimizes the contrast with the true model. In practice, we do not have access to the true model, then we could not know the oracle. Nevertheless, the goal of the model selection step of our procedure is to be nearest to the oracle. In this section, we present an oracle inequality, which means that if we have penalized the log-likelihood in a good way, we will select a model which is as good as the oracle, according to the Kullback-Leibler loss. We consider the model collection defined by (4.4). Because we work in high-dimension, p could be big, and it will be time-consuming to test all the parts of {1, . . . , p}. We construct a sub-collection denoted by J L , which is constructed by the Lasso estimator, which is also random. This step is explained in more details in [Dev14c]. Moreover, to get the oracle inequality, we assume that the parameters are bounded:  (K,J,R) B (4.5) S(K,J,R) = sξ ∈ S(K,J,R) ξ = (π, β, Σ), for all k ∈ {1, . . . , K}, Σk = diag([Σk ]1,1 , . . . , [Σk ]q,q ), for all z ∈ {1, . . . , q}, aΣ ≤ [Σk ]z,z ≤ AΣ , for all k ∈ {1, . . . , K}, βkRk = Rk X r=1 [σk ]r [utk ].,r [vk ]r,. ,  for all r ∈ {1, . . . , Rk }, [σk ]r < Aσ . 134 4.3. ORACLE INEQUALITY Remark that it is the singular value decomposition of βk is the singular value decomposition, with ([σk ]r )1≤r≤Rk the singular values, and uk and vk unit vectors, for k ∈ {1, . . . , K}. We also assume that covariates belong to an hypercube: without restrictions, we could assume that X ∈ [0, 1]p . Fixing K the possible number of components, J L the relevant columns set constructed by the Lasso estimator, and R the possible vector of ranks, we get a model collection [ [ [ B S(K,J,R) . (4.6) K∈K J∈J L R∈R 4.3.2 Notations Before state the main theorem which leads to the oracle inequality for the model collection (4.6), we need to define some metrics used to compare the conditional densities. First, the Kullback-Leibler divergence is defined by Z   s(y)   s(y)dy if sdy << tdy; log t(y) (4.7) KL(s, t) = R   + ∞ otherwise; for s and t two densities. To deal with regression data, for observed covariates (x1 , . . . , xn ), we define ! n 1X ⊗n KL (s, t) = E KL(s(.|xi ), t(.|xi )) (4.8) n i=1 for s and t two densities. We also define the Jensen-Kullback-Leibler divergence, first introduced in Cohen and Le Pennec in [CLP11], by 1 JKLρ (s, t) = KL(s, (1 − ρ)s + ρt) ρ for ρ ∈ (0, 1), s and t two densities. The tensorized one is defined by ! n X 1 n JKL⊗ JKLρ (s(.|xi ), t(.|xi )) . ρ (s, t) = E n i=1 Note that those divergences are not metrics, they do not satisfy the triangular inequality and they are not symmetric, but they are also wildly used in statistics to compare two densities. 4.3.3 Oracle inequality Let state the main theorem. Theorem 4.3.1. Assume that we observe (xi , yi )1≤i≤n ∈ ([0, 1]p ×Rq )n with unknown conditional density s∗ . Let M = K × J × R and ML = K × J L × R, where J L is constructed by the Lasso B B estimator. For (K, J, R) ∈ M, let s̄(K,J,R) ∈ S(K,J,R) , where S(K,J,R) is defined by (4.5), such that, for δKL > 0, δKL inf KL⊗n (s∗ , t) + KL⊗n (s∗ , s̄(K,J,R) ) ≤ B n t∈S(K,J,R) and there exists τ > 0 such that s̄(K,J,R) ≥ e−τ s∗ . (4.9) CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE 135 For (K, J, R) ∈ M, consider the rank constraint log-likelihood minimizer ŝ(K,J,R) in S(K,J,R) , satisfying ) ( n   1X (K,J,R) (K,J,R) log sξ (yi |xi ) . ŝ = argmin − n (K,J,R) s ∈S B ξ i=1 (K,J,R) B Denote by D(K,J,R) the dimension of the model S(K,J,R) . Let pen : M → R+ defined by, for all (K, J, R) ∈ M, (   D(K,J,R) 2 D(K,J,R) 2 pen(K, J, R) ≥ κ B (Aσ , AΣ , aΣ ) ∧ 1 2B (Aσ , AΣ , aΣ ) − log n n !) K X 4epq + Rk + (1 ∨ τ ) log D(K,J,R)−q2 ∧ pq k=1 where κ > 0 is an absolute constant, and B(Aσ , AΣ , aΣ ) is an absolute constant, depending on parameter bounds. ˆ Then, the estimator ŝ(K̂,J,R̂) , with ( ) n X 1 ˆ R̂) = argmin (K̂, J, − log(ŝ(K,J,R) (yi |xi )) + pen(K, J, R) , n (K,J,R)∈ML i=1 satisfies   ˆ R̂) ∗ (K̂,J, n E JKL⊗ (s , ŝ ) ρ   inf ≤CE inf (K,J,R)∈ML t∈S(K,J,R) ⊗n KL  (1 ∨ τ ) (s , t) + pen(K, J, R) + n ∗  (4.10) some absolute positive constant C. The proof of the Theorem 4.3.1 is given in Section 4.5. Note that condition (4.9) leads to control the random model collection. The mixture parameters are bounded in order to construct brackets over S(K,J,R) , and thus to upper bound the entropy. The inequality (4.10) not exactly an oracle inequality, since the Jensen-Kullback-Leibler risk is upper bounded by the KullbackLeibler divergence. Note that we use the Jensen-Kullback-Leibler divergence rather than the Kullback-Leibler divergence, because it is bounded. This boundedness turns out to be crucial to control the loss of the penalized maximum likelihood estimator under mild assumptions on the complexity of the model and on parameters. Because we are looking at a random sub-collection of models which is small enough, our estimator ŝ(K,J,R) is attainable in practice. Moreover, it is a non-asymptotic result, which allows us to study cases for which p increases with n. We could compare our inequality with the bound of Bunea et al, in [BSW12], who computed a procedure similar to ours, for a linear model. According to consistent group selection for the group-Lasso estimator, they get adaptivity of the estimator to an optimal rate, and their estimators perform the bias variance trade-off among all reduced rank estimators. Nevertheless, their results are obtained according to some assumptions, for instance the mutual coherence on X t X, which postulates that the off-diagonal elements have to be small. Some assumptions on the design are required, whereas our result just needs to deal with bounded parameters and bounded covariates. 136 4.4. NUMERICAL STUDIES 4.4 Numerical studies We will illustrate our procedure with simulated and benchmark datasets, to highlight the advantages of our method. We adapt the simulations part of Bunea et al. article, [BSW12]. Indeed, we work in the same way, to get a sparse and low rank estimator. Nevertheless, we add a mixture framework to be consistent with our clustering method, and to have more flexibility. 4.4.1 Simulations To illustrate our procedure, we use simulations adapted from the article of Bunea [BSW12], extended to mixture models. The design matrix X has independent and identically distributed rows Xi , distributed from a multivariate Gaussian Nq (0, Σ) with Σ = ρI, ρ > 0. We consider a mixture of 2 components. According to the cluster, the coefficient matrix βk has the form   bk B 0 b k B 1 βk = 0 0 for k ∈ {1, 2}, with B 0 a J × Rk matrix and B 1 a Rk × q matrix. All entries in B 0 and B 1 are independent and identically distributed according to N (0, 1). The noise matrix ǫ has independent N (0, 1) entries. Let ǫi denotes its ith  row. 1 1 The proportion vector π is defined by π = 2 , 2 , i.e. all clusters have the same probability. Each row Yi in Y is then generated as, if the observation i belongs to the cluster k, Yi = βk Xi +ǫi , for all i ∈ {1, . . . , n}. This setup contains many noise features, but the relevant ones lie in a low-dimensional subspace. We report two settings: — p > n: n = 50, |J| = 6, p = 100, q = 10, R = (3, 3), ρ = 0.1, b = (3, −3). — p < n: n = 200, |J| = 6, p = 10, q = 10, R = (3, 3), ρ = 0.01, b = (3, −3). The current setups show that variable selection, without taking the rank information into consideration, may be suboptimal, even if the correlations between predictors are low. Each model are simulated 20 times, and Table 4.1 summarizes our findings. We evaluate the prediction accuracy of each estimator β̂ by the Kullback-Leibler divergence (KL) using a test sample at each run. We also report the median rank estimate (denoted by R̂) over all runs, rates of non included true variables (denoted by M for misses) and the rates of incorrectly included variables (F A for false actives). Ideally, we are looking for a model with low KL, low M and low F A. p>n p<n KL 19.03 3.28 R̂ [2.8,3] [3,3] M 0 0 FA 20 0.6 ARI 0.95 0.99 Table 4.1: Performances of our procedure. Mean number {KL, R̂, M, F A, ARI} of the KullbackLeibler divergence between the model selected and the true model, the estimated rank of the model selected in each cluster, the missed variables, the false relevant variables, and the ARI, over 20 simulations We could draw the following conclusions from Table 4.1. When we work in low-dimensional framework, we get very good results. Even if we could use any estimator, because we do not have dimension problem, with our procedure we get the matrix structure involved by the model. Over 20 simulations, we get almost exact clustering, and the Kullback-Leibler divergence between the model we construct and the true model is really low. In case of high-dimensional data, when p is larger than n, we get also good results. We find the good structure, selecting the relevant variables (our model will have false relevant variables, but no missed variables), and selecting 137 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE the true ranks. We could remark that false relevant variables have low values. Comparing to another procedure which will not reduce the rank, we will perform the dimension reduction. 4.4.2 Illustration on a real dataset In this section, we apply our procedure to real data set. The Norwegian paper quality data were obtained from a controlled experiment that was carried out at a paper factory in Norway to uncover the effect of three control variables X1 , X2 , X3 on the quality of the paper which was measured by 13 response variables. Each of the control variables Xi takes values in {−1, 0, 1}. To account for possible interactions and nonlinear effects, second order terms were added to the set of predictors, yielding X1 , X2 , X3 , X12 , X22 , X32 , X1 X2 , X1 X3 , X2 X3 , and the intercept term. There were 29 observations with no missing values made on all response and predictor variables. The Box Behnken design of the experiment and the resulting data are described in Aldrin [Ald96] and Izenman [Ize75]. Moreover, Bunea et al. in [BSW12] also study this dataset. We always center the responses and the predictors. The dataset clearly indicates that dimension reduction is possible, making it a typical application for reduced rank regression methods. Moreover, our method will exhibit different clusters among this sample. We construct a model collection varying the number of clusters in K = {2, . . . , 5}. We select a model with 2 clusters. We select all variables except X1 X2 and X2 X3 , which is consistent with comments of Bunea et al. In the two clusters, we get two mean matrices, with ranks equal to 2 and 4. One cluster describes the mean comportment (with rank equals to 2), whereas the other cluster contains values more different. 4.5 Appendices In those appendices, we present the details of the proof of the Theorem 4.3.1. It derives from a general model selection theorem, stated in Section 4.5.1, and proved in the Chapter 3. Then, the proof of the Theorem 4.3.1 could be summarized by satisfying assumptions Hm , Sepm and K described in Section 4.5.1. 4.5.1 A general oracle inequality for model selection Model selection appears with the AIC criterion and BIC criterion. A non-asymptotic theory was developed by Birgé and Massart in [BM07]. With some assumptions detailed here, we get an oracle inequality for the maximum likelihood estimator among a model collection. Cohen and Le Pennec, in [CLP11], generalize this theorem in regression framework. We have to use a generalization of this theorem detailed in [Dev14b] because we consider a random collection of models. Let state the main theorem. We consider a model collection (Sm )m∈M , indexed by M. Let (X, Y ) ∈ X × Y. Begin by describe the assumptions. First, we impose a structural assumption. It is a bracketing entropy condition on the model with respect to the tensorized Hellinger divergence # " n 1X 2 ⊗n 2 dH (s(.|xi ), t(.|xi )) ; (dH ) (s, t) = E n i=1 for two densities s and t. A bracket [l, u] is a pair of functions such that for all (x, y) ∈ X × Y, l(y, x) ≤ s(y|x) ≤ u(y, x). 138 4.5. APPENDICES n For ǫ > 0, the bracketing entropy H[.] (ǫ, S, d⊗ H ) of a set S is defined as the logarithm of the ⊗n minimum number of brackets [l, u] of width dH (l, u) smaller than ǫ such that every densities of S belong to one of these brackets. Let m ∈ M. Assumption Hm . There is a non-decreasing function φm such that ̟ 7→ 1/̟φm (̟) is nonincreasing on (0, +∞) and for every ̟ ∈ R+ and every sm ∈ Sm , Z ̟q n H[.] (ǫ, Sm (sm , ̟), d⊗ H )dǫ ≤ φm (̟); 0 n where Sm (sm , ̟) = {t ∈ Sm , d⊗ H (t, sm ) ≤ ̟}. The model complexity Dm is then defined as 2 n̟m with ̟m the unique solution of √ 1 φm (̟) = n̟. ̟ (4.11) Remark that the model complexity depends on the bracketing entropies not of the global models Sm but of the ones of smaller localized sets. It is a weaker assumption. For technical reasons, a separability assumption is also required. ′ ′ ′ Assumption Sepm . There exists a countable subset Sm of Sm and a set Ym with λ(Y \Ym ) = 0, for λ the Lebesgue measure, such that for every t ∈ Sm , there exists a sequence (tl )l≥1 of elements ′ ′ of Sm such that for every x and every y ∈ Ym , log(tl (y|x)) goes to log(t(y|x)) as l goes to infinity. According to this assumption, we could work with a countable subset. We also need an information theory type assumption on our model collection. We assume the existence of a Kraft-type inequality for the collection. Assumption K. There is a family (wm )m∈M of non-negative numbers such that X e−wm ≤ Ω < +∞. m∈M Then, we could write our main global theorem to get an oracle inequality in regression framework, with a random collection of models. Theorem 4.5.1. Assume we observe (xi , yi )1≤i≤n ∈ ([0, 1]p × Rq )n with unknown conditional density s∗ . Let S = (Sm )m∈M be at most countable collection of conditional density sets. Let assumption K holds while assumptions Hm and Sepm hold for every models Sm ∈ S. Let δKL > 0, and s̄m ∈ Sm such that KL⊗n (s∗ , s̄m ) ≤ inf KL⊗n (s∗ , t) + t∈Sm δKL ; n and let τ > 0 such that s̄m ≥ e−τ s∗ . (4.12) Introduce (Sm )m∈M̌ some random sub-collection of (Sm )m∈M . Consider the collection (ŝm )m∈M̌ of η-log-likelihood minimizer in Sm , satisfying ! n n X X − log(ŝm (yi |xi )) ≤ inf − log(sm (yi |xi )) + η. i=1 sm ∈Sm i=1 139 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE Then, for any ρ ∈ (0, 1) and any C1 > 1, there are two constants κ0 and C2 depending only on ρ and C1 such that, as soon as for every index m ∈ M, (4.13) pen(m) ≥ κ(Dm + (1 ∨ τ )wm ) with κ > κ0 , and where the model complexity Dm is defined in (4.11), the penalized likelihood estimate ŝm̂ with m̂ ∈ M̌ such that ! n n X X ′ − log(ŝm̂ (yi |xi )) + pen(m̂) ≤ inf − log(ŝm (yi |xi )) + pen(m) + η m∈M̌ i=1 i=1 satisfies ∗ n E(JKL⊗ ρ (s , ŝm̂ )) ≤ C1 E  inf inf KL m∈M̌ t∈Sm Ω2 η ′ + C2 (1 ∨ τ ) n ⊗n + pen(m) (s , t) + 2 n +η . n ∗  (4.14) Remark 4.5.2. We get that, among a random model collection, we are able to choose a model which is as good as the oracle, up to a constant C1 , and some additive terms being around 1/n. This result is non-asymptotic, and gives a theoretical penalty to select this model. Remark 4.5.3. The proof of this theorem is detailed in [Dev14b]. Nevertheless, we could give the main ideas to understand the assumptions. From assumptions Hm and Sepm , we could use maximal inequalities which lead to, except on a set of probability less than e−wm′ −w , for all w, a control of the ratio of the centered empirical process of log(ŝm′ ) over the Hellinger distance between s∗ and ŝm′ , this control being around 1/n. Thanks to Bernstein inequality, satisfied according to the inequality (4.12), and thanks to the assumption K, we get the oracle inequality. Now, to prove Theorem 4.3.1, we have to satisfy assumptions Hm and K, assumption Sepm being true for our conditional densities. 4.5.2 Assumption Hm Decomposition As done in Cohen and Le Pennec [CLP11], we could decompose the entropy by ⊗n B n H[.] (ǫ, S(K,J,R) , d⊗ H ) ≤ H[.] (ǫ, ΠK , dH ) + K X k=1 n H[.] (ǫ, F(J,Rk ) , d⊗ H ) (4.15) 140 4.5. APPENDICES where B S(K,J,R) = Ψ(K,J,R)       P (K,J,R) Rk [J] y ∈ Rq |x ∈ Rp 7→ sξ (y|x) = K π ϕ y|(β ) x, Σ k k k=1 k   RK [J] R1 [J] ξ = π1 , . . . , πK , (β1 ) , . . . , (βK ) , Σ1 , . . . , ΣK ∈ Ξ(K,J,R)    Ξ(K,J,R) = ΠK × Ψ̃(K,J,R) × ((aΣ , AΣ ]q )K n o RK [J] = ((β1R1 )[J] , . . . , (βK ) ) ∈ (Rq×p )K |Rank(βk ) = Rk ; ( βkRk F(J,R) =    RK [J] ((β1R1 )[J] , . . . , (βK ) ) ∈ Ψ(K,J,R) for all k ∈ {1, . . . , K}, Ψ̃(K,J,R) = ΠK =     ( ( = Rk X σr utr vr , r=1 (π1 , . . . , πK ) ∈ (0, 1)K ; ϕ(.|(β R )[J] X, Σ); β R = with σr < Aσ for all r ∈ {1, . . . , Rk } K X k=1 R X ) πk = 1 ; σr utr vr , with σr < Aσ , r=1 Σ = diag(Σ1,1 , . . . , Σq,q ) ∈ [aΣ , AΣ ]q ϕ the Gaussian density. )      For the proportions, it is known that (see Wasserman and Genovese in [GW00])  K−1 ! 3 ⊗n . H[.] (ǫ, ΠK , dH ) ≤ log K(2πe)K/2 ǫ Look after the Gaussian entropy. For the Gaussian We want to bound the integrated entropy. For that, first we have to construct some brackets to recover Sm . Fix f ∈ Sm . We are looking for functions l and u such that l ≤ f ≤ u. Because f is a Gaussian, l and u are dilatations of Gaussian. We then have to determine the mean, the variance and the dilatation coefficient of l and u. We need the both following lemmas to construct these brackets. Lemme 4.5.4. Let ϕ(.|µ1 , Σ1 ) and ϕ(.|µ2 , Σ2 ) be two Gaussian densities. If their variance matrices are assumed to be diagonal, with Σa = diag([Σa ]1,1 , . . . , [Σa ]q,q ) for a ∈ {1, 2}, such that [Σ2 ]z,z > [Σ1 ]z,z > 0 for all z ∈ {1, . . . , q}, then, for all x ∈ Rq ,   q p 1 1 (µ1 −µ2 ) ,..., ϕ(x|µ1 , Σ1 ) Y [Σ2 ]z,z 21 (µ1 −µ2 )t diag [Σ2 ]1,1 −[Σ [Σ2 ]q,q −[Σ1 ]q,q 1 ]1,1 p ≤ . e ϕ(x|µ2 , Σ2 ) [Σ1 ]z,z z=1 Lemme 4.5.5. The Hellinger distance of two Gaussian densities with diagonal variance matrices 141 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE is given by the following expression: d2H (ϕ(.|µ1 , Σ1 ), ϕ(.|µ2 , Σ2 )) !1/2 p q Y 2 [Σ1 ]z,z [Σ2 ]z,z =2 − 2 [Σ1 ]z,z + [Σ2 ]z,z z=1 ( ! )   1 1 × exp − (µ1 − µ2 )t diag (µ1 − µ2 ) 4 [Σ1 ]z,z + [Σ2 ]z,z z∈{1,...,q} To get an ǫ-bracket for the densities, we have to construct a δ-net for the variance and the mean, δ to be specify later. — Step 1: construction of a net for the variance j Let ǫ ∈ (0, 1], and δ = √ǫ2q . Let bj = (1 + δ)1− 2 AΣ . For 2 ≤ j ≤ N , we have S S [aΣ , AΣ ] = [bN , bN −1 ] . . . [b3 , b2 ], where N is chosen to recover everything. We want that ⇔ ⇔ aΣ = (1 + δ)1−N/2 AΣ   aΣ N log = 1− log(1 + δ) AΣ 2 √ 2 log( AaΣΣ 1 + δ) N= . log(1 + δ) We want N to be an integer, then N =  A 2 log( a Σ Σ √ 1+δ) log(1+δ)  . We get a regular net for the variance. We could let B = diag(bi(1) , . . . , bi(q) ), close to Σ (and deterministic, independent of the values of Σ), where i is a permutation such that bi(z)+1 ≤ Σz,z ≤ bi(z) for all z ∈ {1, . . . , q}. — Step 2: construction of a net for the mean PRvectors We use the singular decomposition of β, β = r=1 σr utr vr , with (σr )1≤r≤R the singular values, and (ur )1≤r≤R and (vr )1≤r≤R unit vectors. Those vectors are also bounded. We are looking for l and u such that dH (l, u) ≤ ǫ, and l ≤ f ≤ u. We will use a dilatation of a Gaussian to construct such an ǫ-bracket of ϕ. We let l(x, y) = (1 + δ)−(p u(x, y) = (1 + δ)p 2 qR+3q/4) 2 qR+3q/4 ϕ(y|νJ,R x, (1 + δ)−1/4 B 1 ) ϕ(y|νJ,R x, (1 + δ)B 2 ) where B 1 and B 2 are constructed such that, for all z ∈ {1, . . . , q}, [B 1 ]z,z ≤ Σz,z ≤ [B 2 ]z,z (see step 1). The means νJ,R ∈ Rq×p will be specified later. Just remark that J is the set of the relevant P t |J|×R , columns, and R the rank of νJ,R : we will decompose νJ,R = R r=1 σ̃r ũr ṽr , ũ ∈ R q×R and ṽ ∈ R . We get l(x, y) ≤ f (y|x) ≤ u(x, y) if we have ||βx − νJ,R x||22 ≤ p2 qR δ2 2 a (1 − 2−1/4 ). 2 Σ 142 4.5. APPENDICES Remark that ||βx − νJ,R x||22 ≤ p||β − νJ,R ||22 ||x||∞ We need then ||β − νJ,R ||22 ≤ pqR δ2 2 a (1 − 2−1/4 ) 2 Σ (4.16) According to [Dev14b], dH (l, u) ≤ 2(p2 qR + 3q/4)2 δ 2 , then, with δ=√ ǫ 2(pqR + 3/4q) we get the wanted bound. Now, explain how to construct νJ,R to get (4.16). ||β − νJ,R ||22 = p X q X R X j=1 z=1 r=1 = p X q X R X j=1 z=1 r=1 ≤ p X q X R X j=1 z=1 r=1 2 σr ur,j vr,z − σ̃r ũr,j ṽr,z 2 |σr − σ̃r ||uj,r vz,r | − σ̃r |ũr,j − ur,j ||ṽz,r | − σ̃r ur,j |vr,z − ṽr,z | 2 |σr − σ̃r | + Aσ |ũr,j − ur,j | + Aσ |vr,z − ṽr,z |   ≤ 2pqR max |σr − σ̃r |2 + Aσ max |ũr,j − ur,j |2 + Aσ max |ṽr,z − vr,z |2 r r,j 2 We need ||β − νJ,R ||22 ≤ pqR δ2 a2Σ (1 − 2−1/4 ). If we choose σ̃r , ũr,j and ṽr,z such that then it works. p δ |σr − σ̃r | ≤ √ aΣ 1 − 2−1/4 , 12 p δ aΣ 1 − 2−1/4 , |ur,j − ũr,j | ≤ √ 12Aσ p δ |vr,z − ṽr,z | ≤ √ aΣ 1 − 2−1/4 , 12Aσ r,z 143 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE To get this, we let, for ⌊.⌋ the floor function,        Aσ    p S = Z ∩ 0, √δ aΣ 1 − 2−1/4 12 p δ σ̃r = argmin σr − √ aΣ 1 − 2−1/4 ς , 12 ς∈S        Aσ    p U = Z ∩ 0, −1/4 √ δ 1 − 2 a 12Aσ Σ p δ aΣ 1 − 2−1/4 µ , ũr,j = argmin ur,j − √ 12Aσ µ∈U        A σ  p V = Z ∩ 0,  −1/4 √ δ a 1 − 2 Σ 12Aσ p δ aΣ 1 − 2−1/4 ν ṽz,r = argmin vz,r − √ 12Aσ ν∈V for all r ∈ {1, . . . , R}, j ∈ {1, . . . , p}, z ∈ {1, . . . , q}. Remark that we just need to determine the vectors ((ũr,j )1≤j≤J−r )1≤r≤R and ((ṽz,r )1≤z≤q−r )1≤r≤R because those vectors are unit. Then, we let for all j ∈ J c , for all z ∈ {1, . . . , p}, (νJ,R )z,j = 0 for all j ∈ J, for all z ∈ {1, . . . , p}, (νJ,R )z,j = R X σ̃r ũr,j ṽz,r r=1 — Step 3: Upper bound of the number of ǫ-brackets for F(J,R) 1−2−1/4 . 12 We have defined our bracket. Let c = We want to control the entropy. R  R( 2J−R−1 N  + 2q−R−1 ) X 2 2 Aσ A2σ √ √ |Bǫ (F(J,R) )| ≤ δaΣ c δaΣ c l=2  R(J+q−R) A2σ √ ≤(N − 1) A−R σ δaΣ c ≤C(aΣ , AΣ , Aσ , J, R)δ −D(J,R) −1 with C(aΣ , AΣ , Aσ , J, R) = 2  AΣ aΣ + 1 2  A2σ √ aΣ c R(J+q−R) A−R σ . For the mixture Begin by computing the bracketing entropy: according to (4.15), B H[.] (ǫ, S(K,J,R) , dH )  D(K,J,R) ! 1 ≤ log C ǫ 144 4.5. APPENDICES where K C = 2 K(2πe) K/2 K−1 3  AΣ 1 + aΣ 2 !D(K,J,R) √ P Aσ 12 − K R p Aσ k=1 k a2Σ 1 − 2−1/4 K P and D(K,J,R) = K k=1 Rk (|J| + q − Rk ). We have to determine φ(K,J,R) such that Z ̟q B n H[.] (ǫ, S(K,J,R) (s(K,J,R) , ̟), d⊗ H )dǫ ≤ φ(K,J,R) (̟). 0 Let compute the integral. Z ̟q n B H[.] (ǫ, S(K,J,R) (s(K,J,R) , ̟), d⊗ H )dǫ Z ̟s   q p 1 ≤ ̟ log(C) + D(K,J,R) log dǫ ǫ 0 s  s " # q √ 1 log(C) + log π+ ≤ D(K,J,R) ̟ D(K,J,R) ̟∧1 0 with, according to (4.17), Then,   K AΣ 1 + log(2πe) + (K − 1) log(3) + K log log(C) =K log(2) + 2 aΣ 2 ! √ K X 1 A2 12 pσ + D(K,J,R) log + log(K) + Rk log( ) −1/4 A σ aΣ 1 − 2 k=1 !! √ A2σ 12 p ≤D(K,J,R) log(2) + log(2πe) + log 3 + 1 + log aΣ 1 − 2−1/4    AΣ 1 + D(K,J,R) log + aΣ 2 ! √   2 !  Aσ 12 12πe AΣ 1 ≤D(K,J,R) log p + + log −1/4 a 2 aΣ Σ 1−2 Z q B n H[.] (ǫ, S(K,J,R) (s(K,J,R) , ̟), d⊗ H )dǫ 0 s   2  s  ! q Aσ AΣ 1 1 ≤ D(K,J,R) 2 + log + log + aΣ 2 aΣ ̟∧1 ̟ Consequently, by putting B =2+ s log  AΣ 1 + aΣ 2 we get that the function φ(K,J,R) defined on R∗+ by  q φ(K,J,R) (̟) = D(K,J,R) ̟ B + s  A2σ ; aΣ log  1 ̟∧1 ! (4.17) (4.18) 145 CHAPTER 4. AN ORACLE INEQUALITY FOR THE LASSO-RANK PROCEDURE satisfies (4.18). Besides, φ(K,J,R) is nondecreasing and ̟ 7→ φ(K,J,R) (̟)/̟ is non-increasing, then φ(K,J,R) is convenient. Finally, we need to find an upper bound of ̟∗ satisfying √ φ(K,J,R) (̟∗ ) = n̟∗2 . Consider ̟∗ such that φ(K,J,R) (̟∗ ) = This is equivalent to solve ̟∗ = r p ̟∗2 . D(K,J,R) n B+ s log  1 ̟∗ ∧ 1 ! and then we could choose ̟∗2 4.5.3 D(K,J,R) ≤ n 1 2 2B + log 1∧ D(K,J,R) 2 B n !! . Assumption K Let w(K,J,R) = D̃(K,J) log  we could compute the sum X (K,J,R) e−w(K,J,R) ≤ ≤ 4epq (D̃(K,J) −q 2 )∧pq  +  X X  K≥1 X K≥1 ≤2 e P k∈{1,...,K} Rk , −D̃(K,J) log  −D̃(K,J) log  where D̃(K,J) = K(1 + |J|). Then,   4epq (D̃(K,J) −q 2 )∧pq  1≤|J|≤pq   X e 1≤|J|≤pq K  X   e−R   4epq (D̃(K,J) −q 2 )∧pq R≥1   1K  The last inequality is computed in Proposition 4.5 in [Dev14b] by 2. Then, X e−w(K,J,R) ≤ 2. (K,J,R) 4.5. APPENDICES 146 Chapter 5 Clustering electricity consumers using high-dimensional regression mixture models. Contents 5.1 5.2 5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Typical workflow using the example of the aggregated consumption 138 5.3.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.2 Cluster days on the residential synchronous curve . . . . . . . . . . . . . 139 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Model visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4 Clustering consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.1 Cluster consumers on mean days . . . . . . . . . . . . . . . . . . . . . . 143 5.4.2 Cluster consumers on individual curves . . . . . . . . . . . . . . . . . . 144 Selected mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Analyzing clusters using classical profiling features . . . . . . . . . . . . 144 Cross analysis using survey data . . . . . . . . . . . . . . . . . . . . . . 147 Using model on one-day shifted data . . . . . . . . . . . . . . . . . . . . 148 Remarks on similar analyses . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 149 147 5.1. INTRODUCTION 5.1 148 Introduction New metering infrastructures as smart meters provide new and potentially massive informations about individual (household, small and medium enterprise) consumption. As an example, in France, ERDF (Electricite Reseau Distribution de France the French manager of the public electricity distribution network) deployed 250 000 smart meters, covering a rural and an urban territory and providing half-hourly household energy used each day. ERDF plans to install 35 millions of them over the French territory by the end of 2020 and exploiting such an amount of data is an exciting but challenging task (see http://www.erdf.fr/Linky). Many applications coming from individual data analysis can be found in the literature. The first and most popular one is load profiling. Understanding consumers time of use, seasonal patterns and the different features that drive their consumption is a fundamental task for electricity providers to design their offer and more generally for marketing studies (see e.g. [Irw86]). Energy agencies and states can also benefit from profiling for efficiency programs and improve recommendation policies. Customer segmentation based on load classification is a natural approach for that and [ZYS13] proposes a nice review of the most popular methods, concluding that classification for smart grids is a hard task due to the complexity, massiveness, high dimension and heterogeneity of the data. Another problem pointed out is the dynamic structure of smart meters data and particularly the issue of portfolio variations (losses and gains of customers), the update of a previous classification when new customers arrive or the clustering of a new customer with very few observations on its load. In [KFR14], the authors propose a segmentation algorithm based on K-means to uncover shape dictionaries that help to summarize information and cluster a large population of 250 000 households in California. However, the proposed solution exploits a quite long historic of data of at least 1 year. Recently, other important questions were raised by smart meters and the new possibility to send potentially complex signal to consumers (incentive payments, time varying prices...) and demand response program tailoring attracts a lot of attention (see [US 06], [H+ 13]). Local optimization of electricity production and real time management of individual demand thus play an important role in the smart grid landscape. It induces a need for local electricity load forecasting at different levels of the grid and favorites bottom-up approaches based on a two stage process. First, it consists in building classes in a population such that each class could be sufficiently well forecast but corresponds to different load shapes or reacts differently to exogenous variables like temperature or prices (see e.g. [LSD15] in the context of demand response). The second stage consists in aggregating forecasts to forecast the total or any subtotal of the population consumption. For example, identify and forecast the consumption of a sub-population reactive to an incentive is an important need to optimize a demand response program. Surprisingly, few papers consider the problem of clustering individual consumption for forecasting and specially for forecasting at a disaggregated level (e. g. in each class). In [AS13], clustering procedures are compared according to the forecasting performances of their corresponding bottom-up forecasts of the total consumption of 6 000 residential customers and small-to-medium enterprises in Ireland. Even if they achieve nice performances at the end, the proposed clustering methods are quite independent to the VAR model used for forecasting. In [MMOP10], a clustering algorithm is proposed that couples hierarchical clustering and multi-linear regression models to improve the forecast of the total consumption of a French industrial subset. They obtain a real forecasting gain but need a sufficiently long dataset (2-3 years) and the algorithm is computationally intensive. We propose here a new methodology based on high dimensional regression models. Our main contribution is that we focus on uncovering classes corresponding to different regression models. As a consequence, these classes could then be exploited for profiling as well as forecasting in 149 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. each classes or for bottom-up forecasts in an unified view. More precisely, we consider regression models where Yd = Xd β + εd is typically an individual -high dimension- load curve for day d and Xd could be alternatively Yd−1 or any other exogenous covariates. We consider a real dataset of Irish individual consumers of 4 225 meters, each with 48 halfhourly meter reads per day over 1 year: from 1st January 2010 up to 31st December 2010. These data have already been studied in [AS13] and [Cha14] and we refer to those papers for a presentation of the data. For computational and time reasons, we draw a random sample of around 500 residential consumers among the 90% closest to the mean, to demonstrate the feasibility of our methods. We show that, considering only 2 days of consumption, we obtain physically interpretable clusters of consumers. According to the Fig. 5.1, deal with individual consumption curves is an hard task, because of the high variability. 7 Consumption (kW) 6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 Figure 5.1: Load consumption of a sample of 5 consumers over a week in winter 5.2 Method We propose to use model-based clustering and adopt the model selection paradigm. Indeed, we consider the model collection of conditional mixture densities, n o (K,J) S = sξ , K ∈ K, J ∈ J , where K denotes the number of clusters, J the set of relevant variables for clustering, and K and J being respectively the set of possible values of K and J. The basic model we propose to use is a finite mixture regression of K multivariate Gaussian densities (see [SBvdGR10] for a recent and fruitful reference), the conditional density being, for x ∈ Rp , y ∈ Rq , ϕ denoting the Gaussian density, (K,J) sξ (y|x) = K X [J] πk ϕ(βk x, Σk ) k=1 Such a model can be interpreted and used from two different viewpoints. First, from a clustering perspective, given the estimation ξˆ of the parameters ξ = (π, β, Σ), we could deduce data clustering from the Maximum A Posteriori principle: for each observation ˆ of each cluster k from the estimation ξ, ˆ and we i, we compute the posterior probability τi,k (ξ) ˆ Proportions of each cluster are assign the observation i to the cluster k̂i = argmax τi,k (ξ). k∈{1,...,K} estimated by π̂. 5.3. TYPICAL WORKFLOW USING THE EXAMPLE OF THE AGGREGATED CONSUMPTION 150 Second, in each cluster, the corresponding model is meaningful and its interpretation allows to understand the relationship between variables Y and X since it is of the form Y = Xβk + ǫ, the noise intensity being measured by Σk . Parameters are estimated from the Lasso-MLE procedure, which is described in details in [Dev14c], and theoretically approved in [Dev14b]. To overcome the high-dimension issue, we use the Lasso estimator on the regression parameters and we restrict the covariance matrix to be diagonal. To avoid shrinkage, we estimate parameters by Maximum Likelihood Estimator on relevant variables selected by the Lasso estimator. Rather than selecting a regularization parameter, we present this issue at a model selection problem, considering a grid of regularization parameters. Indices of relevant variables for this grid of regularization parameters are denoted by J . Since we also have to estimate the number of components, we compute those models for different number of components, belonging to K. In this paper, K = {1, . . . , 8}. Among this collection, we could focus on a few models which seem interesting for clustering, depending on which characteristics we want to highlight. We propose to use the slope heuristics to extract potentially interesting models. The selected model minimizes the log-likelihood penalized by 2κ̂D(K,J) /n, where Dm denotes the dimension of the model m, and where κ̂ is constructed from a completely data-driven procedure. In practice, we use the Capushe package, see [BMM12]. In addition to this family of models, we need to have powerful tools to translate curves into variables. Rather than dealing with the discretization of the load consumption, we project it onto a functional basis to take into account the functional structure. Since we are interested in not only representing the curves into a functional basis, but also to benefit from a timescale interpretation of coefficients, we propose to use wavelet basis, see [Mal99] for a theoretical approach, and [MMOP07] for a practical purpose. To simplify our presentation, we will focus on the Haar basis. 5.3 5.3.1 Typical workflow using the example of the aggregated consumption General framework The goal is to cluster electricity consumers using a regression mixture model. We will consider the consumption of the eve for the regressors, to explain the consumption of the day. Consider the daily consumption, where we observe 48 points. We project the signal onto the Haar basis at level 4. The signal could be decomposed in approximation, denoted by A4 , and several details, denoted by D4 , D3 , D2 , and D1 . We illustrate it in Fig. 5.2, where in addition the decomposition in sum of orthogonal signals on the left, one can find a colored representation of the corresponding wavelet coefficients in the time scale plane. For an automatic denoising, we remove details of level 1 and 2, which correspond to high-frequency components. Two centerings will be considered: — preprocessing 1: before projecting, we center each signal individually. — preprocessing 2: we consider details coefficients of level 4 and 3. Here, we remove also a low-frequency approximation. Depending on the preprocessing, we will get different clusterings We observe the load consumption of n residentials over a year, denoted by (zi,t )1≤i≤n,t∈T . We consider P — Zt = ni=1 zi,t the aggregated signal, CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. D4 A4 s 151 0.5 0 −0.5 0 0.2 0 −0.2 −0.4 0 0.2 D3 30 40 10 20 30 40 0.5 10 20 30 40 0 10 20 30 40 −0.5 10 20 30 40 −1 10 20 30 40 10 20 30 40 0 −0.2 0 0.1 D2 1 20 0 −0.2 0 0.2 0 −0.1 0 0.1 D1 0.5 0 −0.5 0 10 0 −0.1 0 Figure 5.2: Projection of a load consumption for one day into Haar basis, level 4. By construction, we get s = A4 + D4 + D3 + D2 + D1 . On the left side, the signal is considered with reconstruction of dataset, the dotted being preprocessing 1 and the dotted-dashed being the preprocessing 2 — Zd = (Zt )48(d−1)≤t≤48d the aggregated signal for the day d, — zi,d = (zi,t )48(d−1)≤t≤48d the signal for the residential i for the day d. We consider three different ways to analyze this dataset. The first one consider (Zd , Zd+1 )1≤d≤338 over time, and the results are easy to interpret. We take this opportunity to develop in details the steps of the method we propose from the model to the clusters via model visualization and interpretation. In the second one, we want to cluster consumers on mean days. Working with mean days leads to some stability. The last one is the most difficult, since we consider individuals curves (zi,d0 , zi,d0 +1 )1≤i≤n and we classify these individuals for the days (d0 , d0 + 1). 5.3.2 Cluster days on the residential synchronous curve In this Section, we focus on the residential synchronous (Zt )t∈T . We will illustrate our procedure step by step, and highlight some features of data. The whole analysis will be done for the preprocessing 2. Model selection Our procedure leads to a model collection, with various number of components and various sparsities. Let us explain how to select some interesting models, thanks to the slope heuristic. We define   (K(κ), J(κ)) = argmin −γn (ŝ(K,J) ) + 2κD(K,J) /n , (K,J) where γn is the log-likelihood function and ŝ(K,J) is the log-likelihood minimizer among the collection S (K,J) . We consider the step function κ 7→ DK(κ),J(κ) , κ̂ being the abscissa which ˆ = (K(2κ̂), J(2κ̂)). To improve leads to the biggest dimension jump. We select the model (K̂, J) (K,J) that, we could consider the points (D(K,J) , −γn (s ) + 2κ̂D(K,J) /n)(K,J) , and select some models minimizing this criterion. According to Figs. 5.3 and 5.4, we could consider some κ̂ which seem to create big jumps, and several models which seem to minimize the penalized log-likelihood. Model dimension Penalized log-likelihood for 2κ̂ 5.3. TYPICAL WORKFLOW USING THE EXAMPLE OF THE AGGREGATED CONSUMPTION m̂ 0 0 κ̂ κ 2κ̂ 152 6000 4000 2000 0 −2000 −4000 −6000 0 100 200 300 400 500 600 700 Model dimension Figure 5.3: We select the model m̂ using the slope heuristic Figure 5.4: Minimization of the penalized log-likelihood. Interesting models are branded by red squares, the selected one by green diamond Model visualization Thanks to the model-based clustering, we have constructed a model in each cluster. Then, we could understand differences between clusters from β̂ and Σ̂ estimations. We represent it with an image, each coefficient being represented by a pixel. As we consider the linear model Y = Xβk for each cluster, rows correspond to regressors coefficients and columns to response coefficients. Diagonal coefficients will explain the main part. The Figs. 5.5 and 5.6 explain the image construction, whereas we compute it for the model selected by the previous step in Fig. 5.7. A4 D4 D4 D3 D3 A4 D 4 D3 Figure 5.5: Representation of the regression matrix βk for the preprocessing 1. D4 D3 Figure 5.6: Representation of the regression matrix βk for the preprocessing 2. |β −β | 1 2 0.4 0.3 0.2 0.1 Figure 5.7: For the selected model, we represent β̂ in each cluster. Absolute values of coefficients are represented by different colormaps, white for 0. Each color represents a cluster To highlight differences between clusters, we also plot β̂1 − β̂2 . First, we remark that β̂1 and β̂2 are sparse, thanks to the Lasso estimator. Moreover, the main difference between β̂1 and β̂2 is row 4, columns 1, 2 and 6. We could say that the procedure uses, depending on cluster, more or less the first coefficient of D3 of X to describe coefficients 1 and 2 of D3 and coefficient 3 of D4 of Y . The Fig. 5.11 enlightens those differences between clusters. CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. 153 We represent the covariance matrix in Fig. 5.8. Because we estimate it by a diagonal matrix in each cluster, we just display the diagonal coefficients. We keep the same scale for all the clusters, to highlight which clusters are noisier. Figure 5.8: For the model selected, we represent Σ in each cluster Model-based clustering We could compute the posterior probability for each observation. In Fig. 5.9, we represent it by boxplots. Closer there are to 1, more different are the clusters. 1 0.8 0.6 π1=0.69 π2 =0.31 Figure 5.9: Assignment boxplots per cluster In the present case, the two clusters are well defined and the clustering problem is quite easy, but see for example Fig. 5.13, in a different clustering issue, which presents a model with affectations less well separated. Clustering Mean of the load consumption in each cluster Now, we are able to try to interpret each cluster. In Fig. 5.10, we represent the mean curves for each cluster. We can also use functional descriptive statistics (see [Sha11]). Because clusters are done on the reliance between a day and its eve, we represent the both days. 0.8 Cluster 1 Cluster 2 0.7 0.6 0.5 0.4 0.3 0.2 0 X 10 20 Y 30 40 50 60 Instant of the day 70 80 90 Figure 5.10: Clustering representation. Each curve is the mean in each cluster 154 5.4. CLUSTERING CONSUMERS Discussion Mean of the load consumption in each cluster According to the preprocessing 2, we cluster weekdays and weekend days. The same procedure done with the preprocessing 1 shows the temperature influence. We construct four clusters, two of them being weekend days, and the two others are weekdays, differences made according to the temperature. In Fig. 5.11, we summarize clusters by the mean curves for this second model. 1 0.9 Cluster 1 Cluster 2 Cluster 3 Cluster 4 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 X 10 20 Y 30 40 50 60 Instant of the day 70 80 90 Figure 5.11: Clustering representation. Each curve is the mean in each cluster In Table 5.1, we summarize the both models considered according to the day type. Interpretation week weekend week, low T. weekend, low T. week, high T. weekend, high T. Mon. 0.88 0.12 0.26 0.1 0.64 0 Tue. 0.96 0.04 0.46 0.02 0.52 0 Wed. 0.94 0.06 0.46 0.03 0.5 0 Thur. 0.98 0.02 0.47 0 0.53 0 Fri. 0.96 0.04 0.51 0 0.45 0.04 Sat. 0 1 0 0.2 0 0.79 Sun. 0 1 0 0.65 0 0.35 Table 5.1: For each model selected, we summarize the proportion of day type in each cluster, and interpret it, T corresponding to the temperature. According to Table 5.1, difference between both preprocessing is the temperature influence: when we center curves before projecting, we translate the whole curves, but when we remove the low frequency approximation, we skip the main features. Depending on the goal, each preprocessing could be interesting. 5.4 Clustering consumers Another important approach considered in this paper is to cluster consumers. Before dealing with their daily consumption, in Section 5.4.2, which is an hard problem because of the irregularity of signals, we cluster consumers on mean days in Section 5.4.1. 155 5.4.1 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. Cluster consumers on mean days Define z̃i,d the mean signal over all days d for the residential i, over all days d ∈ {1, . . . , 7}. Then we consider couples (z̃i,d , z̃i,d+1 )1≤i≤n . If we look on the model collection constructed by our procedure for K = {1, . . . , 8}, we always select models with only one component, for every days d. Nevertheless, if we force the model to ′ have several clusters, restricting K to K = {2, . . . , 8}, we get some interesting informations. All results get here are done for preprocessing 2. For weekdays couples, Monday/Tuesday, Tuesdays/Wednesday, Wednesday/Thursday, Thursday/Friday, we select models with two clusters, with same means and same covariance matrices: the model with one component is needed. The only difference on load consumption is on the mean comportment. It is relevant with clusterings obtained in Section 5.3.2. We focus here on Saturday/Sunday, for which there are different interesting clusters, see Fig. 5.12. Remark that we cannot summarize a cluster to its mean because of the high variability. The main differences between those two clusters are on differences between lunch time and afternoon, and on the Sunday morning. Notice that the big variability over the two days is not explained by our model, for which the variability is small, explaining the noise for the reliance between a day and its eve. 3.5 Load consumption in each cluster 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 Instant of the day 70 80 90 10 20 30 40 50 60 Instant of the day 70 80 90 3.5 Load consumption in each cluster 3 2.5 2 1.5 1 0.5 0 0 Figure 5.12: Saturday and Sunday load consumption in each cluster. On Sunday/Monday, we get also three different clusters. Even if we identify differences on the shape, the main difference is still on the mean consumption. On Friday/Saturday, we see differences between people who have the same consumptions, and people who have a really different comportment. However, because the selected model is again with one component, we think that consider the mean over days of each consumer cancel interesting effects, as the temperature and seasonal 156 5.4. CLUSTERING CONSUMERS influence. 5.4.2 Cluster consumers on individual curves One major objective of this work is individual curves clustering. As already pointed in the introduction, this is a very challenging task that has various applications for smart grid management, going from demand response programs to energy reduction recommendations or household segmentation. We consider the complex situation where an electricity provider as access to very few recent measurements -2 days here- of each individual customers but need to classify them anyway. That could happen in a competitive electricity market where customers can change their electricity provider at any time without providing their past consumption curves. First, we focus on two successive days of electricity consumption measurements - Tuesday and Wednesday in winter: January 5th and 6th 2010- for our 487 residential customers. Note that we choose week days in winter following our experience on electricity data, as those days often bring important information about residential electricity usage (electrical heating, intra-day cycle, tariff effect...). Selected mixture models We apply the model-based clustering approach presented in Section 5.2 for preprocessing 1 and obtain two models minimizing the penalized log-likelihood corresponding to 2 and 5 clusters. Although these two classifications are based on auto-regression mixture model, we are able to analyze and interpret clusters in terms of consumption profiles and provide bellow a physical interpretation for the classes. In Fig. 5.13, we plot the proportions in each cluster for both models constructed by our procedure. The first remark is that this issue is harder than the one in Section 5.3.2. Nevertheless, even if there are a lot of outliers, for the model M1, a lot of affectations are well-separated. It is obviously less clear with the model M2, with 5 components. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.5 π1=0.5 π2 =0.5 π1=0.21 π2 =0.17 π3 =0.2 π4 =0.25 π5 =0.17 Figure 5.13: Proportions in each cluster for models constructed by our procedure In Fig. 5.14, we plot the regression matrix to highlight differences between clusters. Indeed, those two matrices are different, for example more variables are needed to describe the cluster 2. Analyzing clusters using classical profiling features We first represent the main features of the cluster centres (the mean of all individual curves in a cluster). Fig. 5.15 represents daily mean consumptions of these centres along the year. We clearly see that the two classifications separate customers that have different mean level of consumption (small, -middle- and big residential consumers) and different ratio winter/summer consumption probably due to the amount of electrical heating among house usage. Let recall that the model based clustering is done on centered data and that the mean level discrimination is 157 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. β1 β2 0.6 0.4 0.4 0.2 0.2 0 0 Figure 5.14: Regression matrix in each cluster for the model with 2 clusters 0.2 0.4 0.6 0.8 1.0 Consumption (kW) not straightforward. Schematically, the 2 clusters classification seems to separate big customers with electrical heating from small customers with few electrical heating. Whereas the 5 clusters classification separates the small customers with few electrical heating but peaks in consumption (flat curve with peaks, probably due to auxiliary heating when temperature is very low) and middle customers with electrical heating. The two clusters in the middle customers population don’t present any visible differences on this figure. The two big customers clusters have a different ratio winter/summer probably due to a difference in terms of electrical heating usage. 50 100 150 200 250 300 350 200 250 300 350 Day 0.2 0.4 0.6 0.8 1.0 Consumption (kW) 0 0 50 100 150 Day Figure 5.15: Daily mean consumptions of the cluster centres along the year for 2 (top) and 5 clusters (bottom) This analysis is confirmed in Fig. 5.16 where we represent the daily mean consumptions of the cluster centres in function of the daily mean temperature for the two classifications. Points correspond to the mean consumption of one day and smooth curves are obtained with P-spline regression. We observe that for all classes, the temperature effect due to electrical heating starts at around 12˚C and that the different clusters have various heating profiles. The 2 clusters classification profiles confirm the observation of Fig. 5.15 that the population is divided into small and big customers with electrical heating. Concerning the 5 clusters classification, we clearly see on the small customer profile (purple points/curves) an inflexion at around 0˚C -this inflexion is also observed in the small customer cluster of the 2 clusters classificationcorresponding to e.g. an auxiliary heating device effect or at least an increase of consumption 158 5.4. CLUSTERING CONSUMERS 0.9 0.9 of the house for very low temperature. This is also what distinguishes the 2 middle customers classes (blue and green points/curves). The two big customers’ clusters have similar heating profile, except that the green cluster correspond to higher electrical heating usage. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.8 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● 0.2 ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● 20 ● ● ● 15 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 10 ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● −5 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ●● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● 0.5 0.6 ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● −5 0 Daily Temperature (C°) 5 10 15 20 Daily Temperature (C°) Figure 5.16: Daily mean consumptions of the cluster centres in function of the daily mean temperature for 2 (on the left) and 5 clusters (on the right) 0.8 0.4 0.0 Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun 0.4 0.8 Mon 0.0 Consumption (kW) Consumption (kW) Another interesting observation concern the weekly and daily profiles of the centres. We represent on Fig. 5.17 an average (over time) week of consumption for each centre of the two classifications. For the 2 clusters classification, we see again the difference in average consumption between the big customer cluster profile and the small customer one. They have similar shapes but the difference day/night and peak loads (at around 8h, 13h and 18h for weekdays), are more marked. For the 5 clusters curves, even if the weekly profiles are quite similar (no major difference in the week days/week ends profiles in each clusters), the daily shapes exhibit more differences. The day/night ratio could be very different as well as small variation along the day, probably related to different tariff options (see [Com11a] for a description of the tariffs). Figure 5.17: Average (over time) week of consumption for each centre of the two classifications (2 clusters on the top and 5 on the bottom) 159 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. Cross analysis using survey data 60 40 50 Error rate (%) 70 80 To enrich this analysis, we also consider extra information providing in a survey realized by the Irish Commission for Energy Regulation. We summarize this large amount of information into 10 variables: ResidentialTariffallocation corresponds to the tariff option (see [Com11a], [Com11b]), Socialclass: AB, C1, C2, F in UK demographic classifications, Ownership whether or not a customer owns his/her house, ResidentialStimulusallocation the stimulus sends to the customer (see [Com11a], [Com11b]), Built.Year the year of construction of the building, Heat.Home and Heat.Water electrical or not, Windows.Doubleglazed the proportion of double gazed window in the house (none, quarter, half, 3 quarters, all), Home.Appliance.White.goods the number of white goods/major appliances of the household. To relate our clusters to those variables we consider the classification problem consisting in recovering model based classes with a random forest classifier. Random forest introduced in [Bre01] is a well known and tested non-parametric classification method that has the advantage to be quite automatic and easy to calibrate. In addition, it provides a nice summary of the previous covariate in terms of their importance for classification. On the Fig. 5.18 we represent the out of bag error of the random forest classifiers in function of the number of trees (one major parameter of the random forest classifier) for the two clusterings. That corresponds to a good estimate of what could be the classification error on a independent dataset. We have observed that, choosing a sufficiently large number of trees for the forest (300), the classification error rate attains 40% in the 2 clusters case and 75% in the 5 classes case which has to be compared to a random classifier who has respectively a 50% and 80% error rate. That means that the 10 previous covariates provide few but some information about the clusters. 0 100 200 300 400 500 Nb Trees Figure 5.18: Out of bag error of the random forest classifiers in function of the number of trees Quantifying the importance of each survey covariate in the classifications we observe that in both cases, Home.Appliance.White.goods and Socialclass play a major role. That could be explained as those covariates could discriminate small and big customers. Another interesting point is that in the 5 clusters classification, the variable Built.Year plays an important role which is probably related to different heating profiles explained by different isolation standards. That could explain the two big customers clusters. Then come the tariffs options which, in the 5 clusters case, could explain the different daily shapes of the Fig. 5.17. It is noteworthy that these two classifications provide clusters that exhibit those very nice 160 5.4. CLUSTERING CONSUMERS physical interpretation, considering that we only use two days of consumption in winter for each customer. Using model on one-day shifted data To highlight advantages of our procedure, we compare prediction for the consumption of Thursday, 7th January, 2010. Indeed, even if the method is not designed for forecasting purposes, we want to show that model-based clustering is an interesting tool also for prediction. We will compare linear models, estimated on couples Tuesday, 5th January and Wednesday, 6th January, and we will predict Thursday from Wednesday. This is suggested by the clustering get in Section 5.3.2, showing that transitions between weekdays are similar. We then compare the following models. First, the most common is the linear model, without clustering. The second model is the first constructed by our procedure, described before, with 2 components. Moreover, we could use the clustering get by the models constructed by our procedure, but estimate parameters without variable selection, using full linear model in each component. We restrict here our study to one model to narrow the analysis, but everything is also computable with the models with 5 clusters. For each consumer i, for each prediction procedure, we compute two prediction errors: the RMSE on Thursday prediction, and the RMSE of Wednesday prediction. Remind that RMSE, for a consumer i, is defined by v u 48 u1 X RM SE(i) = t (ẑi,t − zi,t )2 . 48 t=1 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 cluster LMLE 2 clusters LMLE 2 clusters bis Figure 5.19: RMSE on Thursday prediction for each procedure over all consumers We remark that if RMSE are almost the same for the three different models, the one estimated by our procedure leads to smaller median and interquartile range. For the three considered models, the median of the RMSE on Wednesday prediction (learning sample) and the RMSE on Thursday prediction (test sample) are close to each other, which means that the clustering remains good, even for one-day shifted data, of course as long as we remain in the class of working days, according to Section 5.3.2. To highlight this remark, we also compute our procedure on couple Wednesday/Thursday. Then, we select three different models, and involved clusterings are quite similar to clusterings given by models in Section 5.4.2. 161 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. Remarks on similar analyses 0.9 Alternatively, we make the same analysis on two successive weekdays of electricity consumption measurement in summer. We obtain three models, corresponding to 2, 3 and 5 clusters respectively. We compute, as in the Subsection 5.4.2, daily mean consumptions of the cluster center along the year, and in function of the daily mean temperature. The main difference is about the inflexion at around 0˚C. Because clustering is done for summer days, we do not distinguish cold effects. Moreover there are no cooling effects. We could remark again that clusterings are hierarchical, but different from those get in the winter study, as we expected. ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 Daily Consumption (kW) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.2 ● ● −5 0 5 10 15 20 Daily Temperature (C°) Figure 5.20: Daily mean consumptions of the cluster centres in function of the daily mean temperature for 5 clusters, clustering done by observing Thursday and Wednesday in summer We also study two successive weekend days of electricity consumption, in winter and in summer. We recognize different clusters, depending on behavior of consumers. We work with Friday/Saturday couples. The main thing we observe in summer is a cluster with no-consumption, consumers who leave their home. It could be useful to predict the Sunday consumption, but no more general for other weekend. 1 Consumption (kW) 0.8 0.6 0.4 0.2 0 0 50 100 150 Day 200 250 300 350 Figure 5.21: Daily mean consumptions of the cluster centres along the year for 3 clusters, clusering done on weekend observation 5.5 Discussion and conclusion Massive information about individual (household, small and medium enterprise) consumption are now provided with new metering technologies and smart grids. Two major exploitations of individual data are load profiling and forecasting at different scales on the grid. Customer segmentation based on load classification is a natural approach for that and is a prolific way of research. We propose here a new methodology based on high-dimensional regression models. 5.5. DISCUSSION AND CONCLUSION 162 The novelty of our approach is that we focus on uncovering clusters corresponding to different regression models that could then be exploited for profiling as well as forecasting. We focus on profiling and show how, exploiting few temporal measurements of 500 residential customers consumption, we can derive informative clusters. We provide some encouraging elements about how to exploit these models and clusters for bottom up forecasting. 163 CHAPTER 5. CLUSTERING ELECTRICITY CONSUMERS USING HIGH-DIMENSIONAL REGRESSION MIXTURE MODELS. 5.5. DISCUSSION AND CONCLUSION 164 Bibliography [Aka74] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. [Ald96] M. Aldrin. Moderate projection pursuit regression for multivariate response data. Computational Statistics & Data Analysis, 21(5):501–531, 1996. [And51] T. W. Anderson. Estimating Linear Restrictions on Regression Coefficients for Multivariate Normal Distributions. The Annals of Mathematical Statistics, 22(3):327–351, 1951. [AS13] C. Alzate and M. Sinn. Improved electricity load forecasting via kernel spectral clustering of smart meters. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013, pages 943–948, 2013. [ASS98] C.W. Anderson, E.A. Stolz, and S. Shamsunder. Multivariate autoregressive models for classification of spontaneous electroencephalographic signals during mental tasks. IEEE Transactions on Biomedical Engineering, 45(3):277–286, 1998. [Bah58] R. R. Bahadur. Examples of inconsistency of maximum likelihood estimates. Sankhya: The Indian Journal of Statistics (1933-1960), 20(3/4):pp. 207–210, 1958. [Bau09] J-P Baudry. Model Selection for Clustering. Choosing the Number of Classes. Ph.D. thesis, Université Paris-Sud 11, 2009. [BC11] A. Belloni and V. Chernozhukov. ℓ1 -penalized quantile regression in highdimensional sparse models. The Annals of Statistics, 39(1):82–130, 2011. [BC13] A. Belloni and V. Chernozhukov. Least squares after model selection in highdimensional sparse models. Bernoulli, 19(2):521–547, 2013. [BCG00] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719–725, July 2000. [BCG03] C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4):561–575, 2003. [BGH09] Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown variance. The Annals of Statistics, 37(2):630–672, 2009. [BKM04] E. Brown, R. Kass, and P. Mitra. Multiple neural spike train data analysis: stateof-the-art and future challenges. Nature neuroscience, 7(5):456–461, 2004. [BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: Nonasymptotic Theory of Independence. OUP Oxford, 2013. [BM93] L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields, 97(1-2):113–150, 1993. 165 A BIBLIOGRAPHY 166 [BM01] L. Birgé and P. Massart. Gaussian model selection. Journal of the European Mathematical Society, 3(3):203–268, 2001. [BM07] L. Birgé and P. Massart. Minimal penalties for Gaussian model selection. Probab. Theory Related Fields, 138(1-2), 2007. [BMM12] J-P Baudry, C. Maugis, and B. Michel. Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455–470, 2012. [Bre01] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [BRT09] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. [BSW11] F. Bunea, Y. She, and M. Wegkamp. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics, 39(2):1282–1309, 2011. [BSW12] F. Bunea, Y. She, and M. Wegkamp. Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics, 40(5):2359–2388, 2012. [Bun08] F. Bunea. Consistent selection via the Lasso for high dimensional approximating regression models. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, volume 3 of Inst. Math. Stat. Collect., pages 122– 137. Inst. Math. Statist., Beachwood, OH, 2008. [BvdG11] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, 2011. [CDS98] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Siam Journal On Scientific Computing, 20:33–61, 1998. [CG93] G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Technical Report RR-2028, INRIA, 1993. [Cha14] M. Chaouch. Clustering-based improvement of nonparametric functional time series forecasting: Application to intra-day household-level load curves. IEEE Transactions on Smart Grid, 5(1):411–419, 2014. [CLP11] S. Cohen and E. Le Pennec. Conditional density estimation by penalized likelihood model selection and applications. Research Report RR-7596, 2011. [CO14] A. Ciarleglio and T. Ogden. Wavelet-based scalar-on-function finite mixture regression models. Computational Statistics & Data Analysis, 2014. [Com11a] Commission for energy regulation, Dublin. Electricity smart metering customer behaviour trials findings report. 2011. [Com11b] Commission for energy regulation, Dublin. Results of electricity coast-benefit analysis, customer behaviour trials and technology trials commission for energy regulation. 2011. [CT07] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007. [Dau92] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. [Dev14a] E. Devijver. An ℓ1 -oracle inequality for the lasso in finite mixture of multivariate gaussian regression models, 2014. arXiv:1410.4682. [Dev14b] E. Devijver. Finite mixture regression: A sparse variable selection by model selection for clustering, 2014. arXiv:1409.1331. 167 BIBLIOGRAPHY [Dev14c] E. Devijver. Model-based clustering for high-dimensional data. application to functional data, 2014. arXiv:1409.1333. [Dev15] E. Devijver. Joint rank and variable selection for parsimonious estimation in highdimension finite mixture regression model, 2015. arXiv:1501.00442. [DJ94] D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81:425–455, 1994. [DLR77] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Discussion. Journal of the Royal Statistical Society. Series B, 39:1–38, 1977. [EHJT04] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004. [FHP03] K.J. Friston, L. Harrison, and W.D. Penny. Dynamic Causal Modelling. NeuroImage, 19(4):1273–1302, 2003. [FP04] J. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3):928–961, 2004. [FR00] C. Fraley and A. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611–631, 2000. [FV06] F. Ferraty and P. Vieu. Nonparametric functional data analysis : theory and practice. Springer series in statistics. Springer, New York, 2006. [Gir11] C. Giraud. Low rank multivariate regression. Electronic Journal of Statistics, 5:775–799, 2011. [GLMZ10] J. Guo, E. Levina, G. Michailidis, and J. Zhu. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010. [GW00] C. Genovese and L. Wasserman. Rates of convergence for the Gaussian mixture sieve. Annals of Statistics, 28(4):1105–1127, 2000. [H+ 13] L. Hancher et al. Think topic 11: ‘Shift, not drift: Towards active demand response and beyond’. 2013. [HK70] A. Hoerl and R. Kennard. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1):55–67, 1970. [HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. [Irw86] G.W. Irwin. Statistical electricity demand modelling from consumer billing data. IEE Proceedings C (Generation, Transmission and Distribution), 133:328–335(7), September 1986. [Ize75] A. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2):248–264, 1975. [KFR14] J. Kwac, J. Flora, and R. Rajagopal. Household energy consumption segmentation using hourly data. IEEE Transactions on Smart Grid, 5(1):420–430, 2014. [LSD15] W. Labeeuw, J. Stragier, and G. Deconinck. Potential of active demand reduction with residential wet appliances: A case study for belgium. Smart Grid, IEEE Transactions on, 6(1):315–323, Jan 2015. [Mal73] C. L. Mallows. Some comments on Cp. Technometrics, 15:661–675, 1973. 168 BIBLIOGRAPHY [Mal89] S. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:674–693, 1989. [Mal99] S. Mallat. A wavelet tour of signal processing. Academic Press, 1999. [Mas07] P. Massart. Concentration inequalities and model selection. Lecture Notes in Mathematics. Springer, 33, 2003, Saint-Flour, Cantal, 2007. [MB88] G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988. [MB06] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006. [MB10] N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Mey12] C. Meynet. Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11, 2012. [Mey13] C. Meynet. An ℓ1 -oracle inequality for the lasso in finite mixture gaussian regression models. ESAIM: Probability and Statistics, 17:650–671, 2013. [MK97] G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley & Sons, New York, 1997. [MM11a] P. Massart and C. Meynet. The Lasso as an ℓ1 -ball model selection procedure. Electronic Journal of Statistics, 5:669–687, 2011. [MM11b] C. Maugis and B. Michel. A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM Probab. Stat., 15:41–68, 2011. [MMOP04] M. Misiti, Y. Misiti, G. Oppenheim, and J-M. Poggi. Matlab Wavelet Toolbox User’s Guide. Version 3. The Mathworks, Inc., Natick, MA., 2004. [MMOP07] M. Misiti, Y. Misiti, G. Oppenheim, and J-M Poggi. Clustering signals using wavelets. In Francisco Sandoval, Alberto Prieto, Joan Cabestany, and Manuel Graña, editors, Computational and Ambient Intelligence, volume 4507 of Lecture Notes in Computer Science, pages 514–521. Springer Berlin Heidelberg, 2007. [MMOP10] M. Misiti, Y. Misiti, G. Oppenheim, and J-M Poggi. Optimized clusters for disaggregated electricity load forecasting. REVSTAT, 8(2):105–124, 2010. [MMR12] C. Meynet and C. Maugis-Rabusseau. A sparse variable selection procedure in model-based clustering. Research report, September 2012. [MP04] G. McLachlan and D. Peel. Finite Mixture Models. Wiley series in probability and statistics: Applied probability and statistics. Wiley, 2004. [MS14] Z. Ma and T. Sun. arXiv:1403.1922. [MY09] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009. [OPT99] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9:319–337, 1999. [PC08] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. [PS07] W. Pan and X. Shen. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research, 8:1145–1164, 2007. Adaptive sparse reduced-rank regression, 2014. 169 BIBLIOGRAPHY [Ran71] W. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. [RD06] A. Raftery and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101:168–178, 2006. [RS05] J. O. Ramsay and B. W. Silverman. Functional data analysis. Springer series in statistics. Springer, New York, 2005. [RT11] P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731–771, 2011. [SBG10] N. Städler, P. Bühlmann, and S. Van de Geer. ℓ1 -penalization for mixture regression models. Test, 19(2):209–256, 2010. [SBvdGR10] N. Städler, P. Bühlmann, S. van de Geer, and Rejoinder. Comments on ℓ1 penalization for mixture regression models. Test, 19(2):209–256, 2010. [Sch78] G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461–464, 1978. [SFHT13] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. A sparsegroup lasso. Journal of Computational and Graphical Statistics, 2013. [Sha11] H-L Shang. rainbow: An R Package for Visualizing Functional Time Series . The R Journal, 3(2):54–59, dec 2011. [SWF12] W. Sun, J. Wang, and Y. Fang. Regularized k-means clustering of highdimensional data and its asymptotic consistency. Electronic Journal of Statistics, 6:148–167, 2012. [SZ12] T. Sun and C-H Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. [Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B., 58(1):267–288, 1996. [TMZT06] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. Tseng. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22(19):2405–2412, 2006. [TSM85] D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1985. [TSR+ 05] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, pages 91–108, 2005. [US 06] US Department of Energy . Benefits of demand response in electricity markets and recommendations for achieving them - a report to the united states congress pursant to section 1252 of the energy policy act of 2005. page 122, 02/2006 2006. [Vap82] V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982. [vdG13] S. van de Geer. Generic chaining and the ℓ1 -penalty. Journal of Statistical Planning and Inference, 143(6):1001 – 1012, 2013. [vdGB09] S. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. 170 BIBLIOGRAPHY [vdGBRD14] S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 06 2014. [vdGBZ11] S. van de Geer, P. Bühlmann, and S. Zhou. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electronic Journal of Statistics, 5:688–749, 2011. [vdVW96] AW van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, 1996. [YB99] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564–1599, 1999. [YFL11] F. Yao, Y. Fu, and T. Lee. (2):341–353, 2011. [YLL06] M. Yuan, Y. Lin, and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. [YLL12] M-S Yang, C-Y Lai, and C-Y Lin. A robust EM clustering algorithm for Gaussian mixture models. Pattern Recognition, 45(11):3950–3961, 2012. [ZH05] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005. [ZH08] C-H Zhang and J. Huang. The sparsity and bias of the LASSO selection in highdimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008. [Zha10] C-H Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. [ZHT07] H. Zou, T. Hastie, and R. Tibshirani. On the “degrees of freedom” of the lasso. The Annals of Statistics, 35(5):2173–2192, 2007. [ZOR12] Y. Zhao, T. Ogden, and P. Reiss. Wavelet-Based LASSO in Functional Linear Regression. Journal of Computational and Graphical Statistics, 21(3):600–617, 2012. [Zou06] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. [ZPS09] H. Zhou, W. Pan, and X. Shen. Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3:1473–1496, 2009. [ZY06] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. [ZYS13] K-L Zhou, S-L Yang, and C. Shen. A review of electric load classification in smart grid environment. Renewable and Sustainable Energy Reviews, 24(0):103 – 110, 2013. [ZZ10] N. Zhou and J. Zhu. Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface, 3:557–574, 2010. Functional mixture regression. Biostatistics,