Academia.eduAcademia.edu

THE STRUCTURE OF MODEL SELECTION

Most treatments of the model selection problem are either re- stricted to special situations (lag selection in AR, MA or ARMA models, re- gression selection, selection of a model out of a nested sequence) or to special selection methods (selection through testing or penalization). Our aim is to provide some basic tools for the analysis of model selection as a statistical deci- sion problem, independently of the situation and of the method used. In order to achieve this objective, we embed model selection in the theoretical decision framework oered by modern Decision Theory. This allows us to obtain sim- ple conditions under which pairwise comparison of models and penalization of objective functions arise naturally from preferences defined on the collection of statistical models under scrutiny. As a major application of our framework, we derive necessary and sucient conditions for an information criterion to satisfy in the case of independent and identically distributed realizations ...

THE STRUCTURE OF MODEL SELECTION CHRISTINE CHOIRAT AND RAFFAELLO SERI Abstract. Most treatments of the model selection problem are either restricted to special situations (lag selection in AR, MA or ARMA models, regression selection, selection of a model out of a nested sequence) or to special selection methods (selection through testing or penalization). Our aim is to provide some basic tools for the analysis of model selection as a statistical decision problem, independently of the situation and of the method used. In order to achieve this objective, we embed model selection in the theoretical decision framework offered by modern Decision Theory. This allows us to obtain simple conditions under which pairwise comparison of models and penalization of objective functions arise naturally from preferences defined on the collection of statistical models under scrutiny. As a major application of our framework, we derive necessary and sufficient conditions for an information criterion to satisfy in the case of independent and identically distributed realizations in order to deliver almost surely the “true” model out of a class of J models. “In probability” versions of the previous results are also discussed. At last, some results concerning optimality of model selection procedures are discussed. Contents 1. Introduction 2. Preliminary Definitions 3. A Decision Theoretic Framework for Model Selection 3.1. Consistency and Conservativeness 3.2. Model Selection through Penalization 3.3. Model Selection through Pairwise Penalization 3.4. Model Selection through Testing 4. Necessary and Sufficient Conditions for Model Selection Through Penalization 4.1. Strong Limit Theorems for m−estimation 4.2. Necessary and Sufficient Conditions for Conservativeness and Consistency 5. Weak Forms of the Previous Concepts 6. Optimality of Model Selection Procedures 6.1. Risk Functions 6.2. Information Inequalities 7. Proofs References 1 2 3 5 5 6 8 8 10 10 13 14 15 16 17 18 26 THE STRUCTURE OF MODEL SELECTION 2 1. Introduction Most treatments of the model selection problem are either restricted to special situations (lag selection in AR, MA or ARMA models, regression selection, selection of a model out of a nested sequence) or to special selection methods (selection through testing or penalization). It is however unclear if the results obtained in these cases carry over to the general situation in which the models under scrutiny are nonnecessarily nested and selection is performed using general criteria. Our aim is to provide some basic tools for the analysis of model selection as a statistical decision problem. We try answer to the following questions: • How should we define the “best” model out of a class? Is the concept of “best model” necessarily linked to nesting? • What properties should we expect from a good model selection procedure? • Is model selection by pairwise comparisons able to choose the overall best model? If this is not true in general, under what conditions does this take place? • Is model selection through information criteria a general enough procedure? • Is it possible to derive necessary and sufficient conditions for model selection through information criteria? What is the link of model selection through penalization and testing? In order to achieve this objective, we embed the model selection procedure in the theoretical decision framework offered by modern Decision Theory. First of all, in Section 2, we analyze what a “true” or “best” model should be. In order to do so, we introduce two preference relations on the collection of models under scrutiny; these relations are defined ¡ ¢ in terms of their asymptotic goodness-of-fit (as measured by a function Q∞,j θj∗ , that can be a likelihood, a generic objective function or a measure of forecasting performance) and of their parsimony (as measured by the number of parameters pj ). The first preference relation, written as ◮, is a lexicographic order in which model j is preferred to model i if it has a higher value of the objective function or if the two have the same value of the objective function but j is more parsimonious than i. The second preference relation, written as ⊲, consists in choosing models with higher objective functions. Remark that these relations are defined only for the limit situation with n = +∞. As described in Section 3, when passing from the asymptotic framework to the finite sample situation, model selection becomes a problem of choice in a random setting. However, with respect to the literature of Decision Theory under uncertainty, the distinctive feature of our approach is that we are interested in imposing both finite sample and asymptotic requirements on the selection method. In particular, in Section 3.1, we define two properties of a model selection procedure, called consistency and conservativeness and corresponding respectively to the preference relations ◮ and ⊲; they have already appeared in the literature in special cases (see e.g. [22], for the case of a nested sequence of models). Moreover, we derive a condition that allows for reducing the asymptotic analysis of a model selection procedure to the case in which only two models are in competition. As concerns the situation in a finite sample, in Section 3.2 we show that model selection through penalization of the objective function arises in a natural way from some properties of a preference relation. Model selection through pairwise penalization and through testing are briefly reviewed in Sections 3.3 and 3.4. THE STRUCTURE OF MODEL SELECTION 3 Section 4 contains a major application of the previous Theorems: we derive necessary and sufficient conditions that an information criterion has to satisfy in the case of independent and identically distributed realizations in order to deliver almost surely the “true” model out of a class of J models. It turns out that the bounds on the model selection procedures arise as strong limit theorems (Laws of Large Numbers and Laws of the Iterated Logarithm, for sums and V −statistics) associated with the weak limit theorems (Central Limit Theorems) that constitute the basis of Vuong’s ([28]) model selection procedures based on likelihood ratio tests. In Section 5 we introduce the “in probability” versions of consistency and conservativeness and we extend some of the previous results to this case. At last, in Section 6, we outline an optimality result for model selection procedures. 2. Preliminary Definitions A statistical model is a triplet of the form M = (Ω, A, P) where Ω is an abstract space, A a σ−algebra defined on Ω and P a family of probability measures defined on (Ω, A). In our case, we will just consider parametric models, that is P = {Pθ , θ ∈ Θ} where Θ is a subset of an Euclidean space. Remark that nothing imposes that the family P contains the probability measure that has generated the data: indeed, this fact is supposed to be very rare and the main objective is to obtain a faithful but parsimonious representation of the data. Since we are interested in model selection, we suppose to have a collection of statistical models identified by an index j ∈ {1, . . . , J}; any model is given by ª¢ ¡ © Mj = Ωj , Aj , Pj = Pθj , θj ∈ Θj , where Θj ⊂ Rpj and pj is the number of parameters of the j−th model. A model selection procedure is a statistical decision (in the Wald sense) choosing a model out of a set {1, . . . , J}. The main difference with testing and estimation in discrete parameter sets is that the objective of model selection is to choose a model that is both the best in terms of explicative power and in terms of parameter parsimony. Therefore, we introduce a lexicographical order on the values of the objective functions and the number of parameters: first of all, the ¡statistical models ¢ (Mj ) with the highest value of the limit objective functions (Q∞,j θj∗ ), and among these, the models with the lowest number of parameters (pj ). We formalize this ordering through the relation ◮: we say that i ◮ j if ¡ ¢ ¡ ¢ Q∞,i (θi∗ ) > Q∞,j θj∗ or Q∞,i (θi∗ ) = Q∞,j θj∗ and pi ≤ pj . The relation ◮ is complete, symmetric, reflexive and transitive; it is therefore a complete preorder and a weak tournament (see e.g. [2]). This implies, in particular, that the restriction of this relation to any subset {1, . . . , J} is still a complete preorder. Remark that in this definition (and in the following ones), we make no explicit reference to nesting, that is to the fact that one of two models can be obtained from the other simply constraining the parameter space. Indeed, when comparing two models on the base of an information criterion, nesting is not at stake, even if often it is assumed. Indeed, we will show how it is possible to develop a theory of model selection without almost any reference to nesting. Remark that we do not suppose that the models Mj , j = 1, . . . , J form a nested sequence (i.e. Mj ⊂ Mj+1 ). This simplifies somewhat the concepts and allows for THE STRUCTURE OF MODEL SELECTION 4 relaxed conditions (i.e. J can be equal to +∞), but is quite restrictive (it arises in the estimation of AR or MA models but fails to cover the case of ARMA models). Probably the best way to embody the relation ◮ is a graph. In this case any model is identified with a vertex of the graph, and the relation i ◮ j is represented by a directed arrow from i to j. In order to simplify the graph, it is customary to define two new relations, called I◮ and P◮ : iI◮ j ⇔ i ◮ j and j ◮ i iP◮ j ⇔ i ◮ j and not j ◮ i, where the meaning of I◮ and of P◮ is respectively indifference and strict preference (and they are sometimes called respectively the symmetric and the asymmetric part of ◮). Moreover, in order to avoid complex diagrams, often reflexivity is not represented. Since I◮ is an equivalence relation, {1, . . . , J} can be partitioned in a unique way into subsets such that the elements of a subset are indifferent with respect to the relation I◮ : this partition is denoted {1, . . . , J} /I◮ , that is the quotient set of {1, . . . , J} with respect to I◮ .1 The relation ◮ defined on {1, . . . , J} /I◮ is also anti-symmetric and is therefore a linear order. The true model J ∗ is defined as the set of majorants of the relation ◮ in {1, . . . , J} and is therefore the element of {1, . . . , J} /I◮ that dominates every other element. As long as J is finite, this set is always nonempty (the extension to infinite choice sets can rise some problems). Therefore, more formally: ¡ ¢ J ∗∗ = arg max Q∞,j θj∗ , j∈{1,...,J} ∗ J = arg min∗∗ pj . j∈J Nothing prevents J ∗ and J ∗∗ from being sets composed of more than one element, since the majorant of a set needs not be unique. Another important concept often arising in model selection is linked to the idea that the choosen model can be as good as another one in terms of explicative power, even if it is not parsimonious in terms of parameters. It seems therefore interesting to introduce a new relation written as i ⊲ j if ¡ ¢ Q∞,i (θi∗ ) ≥ Q∞,j θj∗ . As before, we can define I⊲ , P⊲ and {1, . . . , J} /I⊲ : the last is a partition of {1, . . . , J} that can be ordered by the value of Q∞,i (θi∗ ). Example 2.1. Consider an iid sample {Y1 , . . . , Yn } from a standard normal distribution. Consider the two statistical models: µ ¶ y+1 q1 (y; θ1 ) = ln φ − ln σ σ ∈ (0, +∞) σ ¶ µ y−µ − ln σ σ ∈ (0, +∞) , µ ∈ [1, ∞) . q2 (y; θ2 ) = ln φ σ 1Given a set A and an equivalence relation R, the quotient space A/R is the set of equivalence classes induced by R on A. THE STRUCTURE OF MODEL SELECTION 5 The limit objective functions are: 1 2 Q∞,1 (θ1 ) = − ln 2π − ln σ − , 2 σ 1 + µ2 1 ; Q∞,2 (θ2 ) = − ln 2π − ln σ − 2 σ the pseudo-true values and the objective functions calculated in these points are: √ θ1∗ = σ ∗ = 2 Q∞,1 (θ1∗ ) = − ln 2e√2π θ2∗ = (µ∗ , σ ∗ ) = (1, 2) Q∞,2 (θ2∗ ) = − ln 2e 2π Therefore, Q∞,1 (θ1∗ ) = Q∞,2 (θ2∗ ) and p1 < p2 . We say that 1P◮ 2 (that is 1 ◮ 2 and 2 6◮ 1) and 1I⊲ 2 (that is 1 ⊲ 2 and 2 ⊲ 1). 3. A Decision Theoretic Framework for Model Selection The is that, when passing from theory to applications, the values ¢ ¡ problem Q∞,j θj∗ are not known and have to be approximated using the observed val³ ´ ues Qn,j θ̂j . Since these values are random, we have to introduce an estimator ˆ we suppose that Jˆ and limn→∞ Jˆ are single-valued, since ties can of J ∗ , say J: always be broken.2 In Wald’s statistical decision framework, Jˆ is called a strategy or a decision rule. ˆ that is consistency In the following, we are going to explore some properties of J, and conservativeness. In principle, it would be possible to define two different kinds of consistency and conservativeness, that is standard (with reference to a particular choice set {1, . . . , J}) and uniform (over any choice set {1, . . . , J}). However, this would create some problems since Jˆ could be conservative, consistent or even not conservative according to the choice set.3 Therefore, the following properties have to be intended over any possible choice of {1, . . . , J} and not with reference to a particular choice set. Therefore, we write JˆA to indicate that the model selection procedures works on the choice set A, and we use the shortcut JˆJ = Jˆ{1,...,J} . 3.1. Consistency and Conservativeness. Now, we introduce the definition of consistent and conservative model selection procedure. Definition 3.1. We say that Jˆ is: o n • strongly conservative if, for any choice set A, P limn→∞ JˆA ∈ JA∗∗ = 1; n o • strongly consistent if, for any choice set A, P limn→∞ JˆA ∈ JA∗ = 1. The following Proposition gives some conditions that allow for simplifying the analysis of the model selection problem. Proposition 3.2. Consider a group of competing models {1, . . . , J} and take a couple {i, j} out of the set {1, . . . , J}. Consider the following Assumptions: o n A: For any couple {i, j} such that iP⊲ j, P limn→∞ Jˆ{i,j} = i = 1. 2This means that if several models have exactly the same value of the objective function, we can choose one of them according to some deterministic (ordering according to the index) or random (sampling one of the models) rule. 3Indeed, if the models in {1, . . . , J} have different values of the limit objective function ³ ´ ³ ´ Q∞,j θ∗ , Jˆ = arg maxj∈{1,...,J} Qn,j θ̂j is consistent, while in general it is just conservative. j THE STRUCTURE OF MODEL SELECTION 6 n o B: For any couple {i, j} such that iP◮ j, P limn→∞ Jˆ{i,j} = i = 1. C: For any two choice sets A ⊂ B and any i ∈ A, we have n ¯ o n ¯ o ¯ ¯ ω ¯ lim JˆB (ω) = i ⊂ ω ¯ lim JˆA (ω) = i . n→∞ n→∞ D: Jˆ is strongly conservative. E: Jˆ is strongly consistent. Then: D =⇒ A, A & C =⇒ D, E =⇒ B, B & C =⇒ E, that is, for a model selection procedure respecting C, strong consistency (resp. conservativeness) over any choice set is equivalent to strong consistency (resp. conservativeness) over couples. Remark 3.3. (i) C is a requirement concerning the absence of rank reversal ([13], Definition 1, p. 45, speaks of rationality): the introduction of new alternatives does not enlarge the set of ω’s leading to the choice of i. Even if it does not seem to be a necessary and sufficient condition for equivalence of B and E (resp. A and D), it seems a minimal requirement of robustness on a model selection procedure. Moreover, the role of this hypothesis will be made clear in Section 3.2. (ii) The Theorem allows for decomposing the process of comparison of several alternatives in a series of pairwise comparisons. It is clear that A and B are much simpler than the original problem and therefore allows for analyzing more complex decision procedures. Remark that conservativeness puts some requirements only on models with different values of the limit objective function, while consistency requires also parsimony in parameters. According to common sense, it seems ³ ´that it would be possible to choose a good model comparing the values Qn,j θ̂j for any j ∈ {1, . . . , J} and picking up the model with the greatest value of the objective function. However, the model selection procedure defined by: ³ ´ JˆJ (ω) = arg max Qn,j θ̂j (ω) , j∈{1,...,J} is just conservative (strongly or weakly according to the type of consistency of θ̂j ³ ´ 4 and Qn,j θ̂j ). This points at a fundamental deficiency of this method: indeed, if the two models are equivalent in terms of explicative power (that is M2 ⊂ M1 and Q∞,1 (θ1∗ ) = Q∞,2 (θ2∗ )), the larger model will often be choosen for finite n because it is the result of maximization over a larger parameter space. 3.2. Model Selection through Penalization. The following Proposition (see [13], Proposition 3, p. 46) shows that hypothesis C is a necessary and sufficient condition for the existence of a utility function. Proposition 3.4. Consider a group of competing models A = {1, . . . , J}. Then, Assumption C in Theorem 3.2 holds if and only if there exists a function U : A×Ω → R, (j, ω) 7→ Uj (ω) such that JˆJ (ω) = arg maxj∈{1,...,J} Uj (ω). 4Indeed, it is consistent in a choice set {1, . . . , J} if there is no couple of indexes 1 and 2 such ¡ ¢ ¡ ¢ that M2 ⊂ M1 and Q∞,1 θ1∗ = Q∞,2 θ2∗ , since in any other case it will pick the right answer. THE STRUCTURE OF MODEL SELECTION 7 This Proposition means that model selection performed through a utility function is equivalent to rational model selection, and that if model selection through a utility function is consistent (conservative) over couples, than it is overall consistent (conservative). Remark that this situation is paradoxical since when n = +∞, ◮ is a lexicographical ordering and cannot be represented by a utility function, while for finite n it can be, and we would like to find conditions such that the choice arising from the optimization of a random utility function converges almost surely to the choice arising from the lexicographical ordering. In the following we show that, under quite general hypotheses, our comprehension of the form of U can be considerably increased. In order to do so, we suppose that the choice is based only the value of the objective function Qn,j (θj ) and on the number of parameters pj , and we define a preorder º defined on the couples of the form (Q¡n,j (θj )¡, pj¢). Remak that this mimicks ◮ that is defined on the couples ¢ of the form Q∞,j θj∗ , pj . We will need the following property of a preorder: for a preorder º defined on the space X = X1 × X2 with generic element (x1 , x2 ), we say that the two components of X are independent if, for any fixed value x1 ∈ X1 , the preorder induced by º on X2 is independent of the value x1 and viceversa with 1 replaced by 2. Proposition 3.5. Let n be fixed. Suppose that the following properties hold: (1) The preference relation º is a complete preorder defined on the couples of the form xj = (Qn,j (θj ) , pj ) ∈ R2 and can be extended (still being a complete preorder) to a connected set of the form Q × P. (2) The sets {x ∈ Q × P : x º x′ } and {x ∈ Q × P : x ¹ x′ } are closed for every x′ ∈ Q × P. (3) The factors Qn,j (θj ) and pj are independent and the preference relation is strictly increasing in Qn,j (θj ) (for fixed pj ) and strictly decreasing in pj (for fixed Qn,j (θj )). Then, there exists (up to an increasing linear transformation) a representation of U as: Uj (ω) = U 1 (Qn,j (θj ) , n) + U 2 (pj , n) with U 1 continuous and strictly increasing and U 2 continuous and strictly decreasing. Remark 3.6. (i) Hypothesis 1 requires that the relation º can be extended to the product of two connected sets; this is usually quite natural for the part involving Qn,j , while it can sound strange for pj since these values are integers. Hypothesis 2 is customary in the theory of utility and involves a sort of continuity of preferences (see e.g. [20], pp. 17-18); it is not satisfied by lexicographic orders (see e.g. [20], p. 18). On the other hand, hypothesis 3 is completely natural in this context. (ii) Remark that the resulting utility function depends on ω only through Qn,j (θj ). This shows that separability of goodness-of-fit and of parsimony is a rather natural property. The strategy that has been most studied in the literature and that we examine in the following is to build a utility function by penalizing the objective function with a random variable depending on n, on the pj ’s and, in some cases, even on the data: this is usually called in the literature a model selection criterion. For any model j ∈ {1, . . . , J}, we define the penalized objective functions Qn,j THE STRUCTURE OF MODEL SELECTION 8 defined as: (3.1) Qn,j (θj ) = Qn,j (θj ) − c (n, pj , Y) . We suppose that Qn,j (θj ) depends just on pj : some methods used in model selection allow the penalization to depend on some features of the other competing models, but this seems to be quite marginal. Remark that Theorem 3.2 authorizes to base the choice of a model on pairwise comparisons and that, therefore, we could more generally consider an information criterion having the same form of the ones considered in [25] (eq. (2.6)). We will discuss how the limitations we impose can be circumvented in the following. The Proposition implies that no information criterion can give rise to nonrational behavior. Therefore, when model selection is performed through an information criterion, the following equivalences in Proposition 3.2 hold true: D ⇔ A, E ⇔ B. This means that the failure to identify the correct model has to be imputed to failure over couples of models and that it is possible to obtain necessary and sufficient conditions for conservativeness and consistency of model selection procedures in any choice set simply considering choice sets composed of two models. As we will see below, the Akaike Information Criterion fails to deliver the right model over arbitrary choice sets because it fails to identify the best one over an arbitrary pair. 3.3. Model Selection through Pairwise Penalization. Looking at the previous Section, it is clear that the penalizations appear in pairwise comparisons only through their pairwise differences such as [c (n, pj , Y) − c (n, pi , Y)]. This means that we can study also the case in which no information criterion of the form Qn,j (θj ) = Qn,j (θj ) − c (n, pj , Y) exists but in which comparisons are carried through using pairwise information criteria of the form: (3.2) Qn,(i,j) (θi , θj ) = Qn,i (θi ) − Qn,j (θj ) + c (n, pj , pi , Y) . In this case, we would choose i over j if Qn,(i,j) (θi , θj ) > 0. This is the situation considered e.g. in [25]. It is reasonable to suppose that (3.3) Qn,(i,j) (θi , θj ) = −Qn,(j,i) (θj , θi ) , since in this case comparison of i and j is carried through just once. (3.3) is equivalent to skew-symmetry of the penalization c, i.e. c (n, pj , pi , Y) = −c (n, pi , pj , Y). The main problem with this approach is to show that hypothesis C holds. Indeed, as concerns the fulfillment of hypotheses A and B, it is simple to derive conditions for the Theorems of Section 4 to hold in the same form with [c (n, pj , Y) − c (n, pi , Y)] substituted with both c (n, pj , pi , Y) and −c (n, pi , pj , Y). To verify at what extent this is indeed the case, we introduce some concepts from Decision Theory. For any fixed value ω ∈ Ω, Qn,(i,j) corresponds to what is called a comparison function ([2]) on {1, . . . , J}, i.e. a skew-symmetric real-valued function on 2 {1, . . . , J} . It is well recognized that the most difficult problem with 3.4. Model Selection through Testing. An alternative strategy that has received a large attention in the literature is model selection through testing. Cox ([6, 7]) proposes to use pairwise comparisons through likelihood ratio (LR) tests for model selection: the idea is to perform an LR test between two alternative models under the null hypothesis that one of them is correctly specified. This yields THE STRUCTURE OF MODEL SELECTION 9 a test that depends on which model is retained to act as the null one. If, as often done, two different tests are performed assuming the two alternative models as null ones in either case, the result is highly informative about the adequacy of the models to the data (both models can be accepted, both rejected or one accepted and one rejected) but often cumbersome to interpret. In a very important paper, Vuong ([28]) (see also [23], for the dynamic case) modifies Cox’s idea: the strategy proposed by Vuong is based on a shift in the point of view. When LR tests are performed, the null hypothesis is that both models are equally close to the “best” one. The procedure is based on two test statistics, the likelihood ratio statistic and the variance statistic. Suppose to test that two densities f and g with respective parameters θbf and θbg are ³ ´ estimated ¡ ¢ indeed the same; let θf∗ and θg∗ . If f ·; θf∗ = g ·; θg∗ , then the likelihood ratio statistic has the following asymptotic behavior: ³ ´ f yi ; θbf D ´ → WC (λ∗ ) 2· ln ³ b g y ; θ i=1 i g n X (3.4) where WC (λ) is a weighted sum of chi-squared with weights given by λ, and λ∗ are the eigenvalues (this matrix will be described below). On the other ³ ´of a matrix ¡ ∗¢ ∗ hand, if f ·; θf 6= g ·; θg , then: (3.5) √ ³ ´ ³ ´ n f Y ; θf∗ f yi ; θbf X ¡ ¢ 1 D ´ − E0 ln ¡ ¢→ n· ln ³ N 0, ω∗2 . ∗ n i=1 g y ; θb g Y ; θg i g  ¶ µ ³ ´ ¡ ¢ f (Y ;θf∗ ) = 0 if and only if f ·; θf∗ = g ·; θg∗ . This Moreover, = V0 ln g Y ;θ∗ ( g) situation is reminiscent of the so-called von Mises expansion of a statistic and of the behavior of nondegenerate and degenerate U − and V −statistics. Indeed the asymptotic distribution is determined either by the linear part of the statistic (thus yielding a normal distribution) or by the quadratic part (thus yielding a weighted sum of chi-squares) according to the centering. As concerns the variance statistic, i h 2 2 D ∗ 2 Vuong proposes two estimators of ω∗ and shows that n · ω b∗ → WC (λ ) where ω∗2 2 (λ∗ ) is the square of λ∗ . Then, three situations can arise: • f anf g are strictly non-nested (i.e. there exists no value µ of the∗ parameters ¶ f (Y ;θ ) such that f = g): under the null hypothesis that E0 ln g Y ;θf∗ = 0, we ( g) ¶ µ Pn √ f (Y ;θ ∗ ) f (yi ;θbf ) as → build a test based on (3.5); remark that n· n1 i=1 ln g y ;θb − E0 ln g Y ;θf∗ ( g) ( i g) µ ¶ f (Y ;θf∗ ) +∞ if E0 ln g Y ;θ∗ > 0, and viceversa. ( g) THE STRUCTURE OF MODEL SELECTION 10 • f anf g are nested (i.e. one ofµthe models¶ is contained in the other): under f (Y ;θ ∗ ) the null hypothesis that E0 ln g Y ;θ∗f = 0, we build a test based on ( g) (3.4).5 • f anf g are overlapping (i.e. there exists some value of the parameters such that f = g, but the models are not nested): in this³case,´ Vuong proposes ¡ ¢ to use first the variance statistic to test whether f ·; θf∗ = g ·; θg∗ , and then to use the appropriate asymptotic distribution to perform a likelihood ratio test. Unfortunately these methods do not allow in general for asymptotic selection of a “best” model, since when testing is performed fixing the critical size, the probability of choosing the “best” model is bounded away from one. This can be overcome using a procedure such as the one in [29], in which the critical size of the tests goes to zero as long as the sample size goes to infinity. Apart from the fact that the properties stated by [29] are weak ones (that is they do not hold almost surely, but only in probability), in this case, an “in probability” analogous of property B is enforced to hold; if C is guaranteed, then we can expect that a weak concept of consistency holds. The appropriate tools to deal with this case will be briefly introduced in Section 5. 4. Necessary and Sufficient Conditions for Model Selection Through Penalization The objective of this Section is to obtain conditions for conservativeness and consistency of model selection procedures in this simplified situation. First of all, we introduce some hypotheses and derive some strong limit results for m−estimators. Then we show how these results can be used in order to obtain necessary and sufficient conditions for consistency and conservativeness of model selection procedures. The conditions are very similar to the ones in [21] and [25], since both theirs and ours establish a link between model selection and LLNs and LILs. However, while [25] deals with heterogeneous and dependent data thus covering a larger scope of cases, our objective is radically different since we aim at proving that strong laws for m−estimators act as bounds for model selection criteria. 4.1. Strong Limit Theorems for m−estimation. We have n independent and identically distributed (iid) observations from a probability measure P0 : the hypothesis of iid observations is just made to simplify things, but could be relaxed. Indeed, this hypothesis is needed just to derive the exact form of the bounds appearing in the Laws of the Iterated Logarithm and to derive necessary and sufficient conditions, but if only sufficient conditions are needed, weaker hypotheses can be used as well. We suppose that estimation on the parameters θj is performed by maximization of an objective function taking the following form: n Qn,j (θj ) = 1X qj (Yi (ω) ; θj ) . n i=1 5It is also possible to use the variance statistic to test this hypothesis but this requires more stringent hypotheses. THE STRUCTURE OF MODEL SELECTION 11 This kind of inference is called m−estimation, that is maximum-likelihood type estimation, since Qj (θj ) mimics the usual log-likelihood function. The m−estimator of θj is therefore: θ̂j = arg max Qn,j (θj ) . θj ∈Θj In the following, we will give some conditions ensuring that the objective function Qn,j (θj ) converges almost surely to a fixed Q∞,j (θj ) given by Q∞,j (θj ) = E0 [qj (Y ; θj )] where E0 is the expectation taken with respect to P0 , and θ̂j converges almost surely to a value called pseudo-true value and defined by: θj∗ = arg max Q∞,j (θj ) . θj ∈Θj A1: θj∗ exists for any j. A2: The derivatives ∂qj (y;θj ) ∂θj and ∂ 2 qj (y;θj ) ∂θj ∂θjT are measurable with respect to y for any θj and continuous with respect to θj . Moreover, if we ¯ ¯ ¯denote as¯ 2 ¯ ∂qj (y;θj ) ¯ ¯ ∂ 2 qj (y;θj ) ¯ θj,k the k−th element of θj , |qj (x; θj )|, |qj (x; θj )| , ¯ ∂θj,k ¯, ¯ ∂θj,k ∂θj,h ¯ ¯ ¯ ¯ ∂qj (y;θj ) ∂qj (y;θj ) ¯ and ¯ ∂θ ¯ are dominated by functions not dependent on θj and ∂θ j,k j,h integrable with respect to P0 . A3: Define the matrices " # ∂ 2 qj (Y ; θj ) Aj (θj ) = E0 , ∂θj ∂θjT # " ∂qj (Y ; θj ) ∂qj (Y ; θj ) , Bj (θj ) = E0 ∂θj ∂θjT " # ∂qi (Y ; θi ) ∂qj (Y ; θj ) Bij (θi , θj ) = E0 . ∂θi ∂θjT ¡ ¢ ¡ ¢ Then, −Aj θj∗ and Bj θj∗ are finite and positive definite. A4: The estimator θ̂j converges almost surely to θj∗ . ¯ ¯ ¯ ∂qj (y;θj ) ¯2 ∂ 3 qj (y;θj ) is measurable with respect to y for each θ. A5: ∂θj,k ∂θ ¯ ¯ , ∂θ ∂θ j,h j,ℓ j,k ¯ 2 ¯ 3 ¯2 ¯ ¯ ∂ qj (y;θj ) ¯ ¯ ∂ qj (y;θj ) ¯ ¯ ∂θj,k ∂θj,h ¯ and ¯ ∂θj,k ∂θ ¯ are dominated by P0 −integrable functions, j,h ∂θj,ℓ which do not depend on θ. Then we can state the following Theorem. Theorem 4.1. Suppose that Assumptions A1-A5 hold. Then ³ ´ ¡ ¢ lim Qn,j θ̂j = Q∞,j θj∗ n→∞ and the following facts hold. THE STRUCTURE OF MODEL SELECTION 12 (1) We have: n h ¡ ¢ ¡ ¢ ¡ ¢−1 io1/2 θ̂j − θj∗ −1 lim sup q = + diag Aj θj∗ · Bj θj∗ · Aj θj∗ , n→∞ 2 ln ln n n n h ¡ ¢ ¡ ¢ ¡ ¢−1 io1/2 θ̂j − θj∗ −1 = − diag Aj θj∗ · Bj θj∗ · Aj θj∗ . lim inf q n→∞ 2 ln ln n n (2) We have: nh ³ ´ ³ ´ io ¡ ¢i h n · Qn,j θ̂j − Qn,j θj∗ − Qn,i θ̂i − Qn,i (θi∗ ) lim sup = λji ln ln n n→∞ P − as. where λji is the maximal eigenvalue of the matrix " ¢ # ¡ ¡ ¢−1 ¡ ¢ ¡ ¢−1 Bji θj∗ , θi∗ −Aj θj∗ Bj θj∗ −Aj θj∗ ¢ ¡ . −1 −1 Ai (θi∗ ) Bi (θi∗ ) Ai (θi∗ ) Bij θi∗ , θj∗ (3) We have: ³ ´ ¡ ¢ n £ ¡ Qn,j θ̂j − Q∞,j θj∗ ¢ ¡ ¢¤2 o1/2 q = + E0 qj Yi ; θj∗ − Q∞,j θj∗ , lim sup n→∞ 2 ln ln n n ³ ´ ¡ ¢ n £ ¡ Qn,j θ̂j − Q∞,j θj∗ ¢ ¡ ¢¤2 o1/2 q lim inf = − E0 qj Yi ; θj∗ − Q∞,j θj∗ . n→∞ 2 ln ln n n Remark 4.2. The eigenvalues of the matrix appearing in the second part of the Theorem are the same as the matrix of Theorem 3.3 (i) in [28]. Indeed, let us set (as in [28], respectively on pages 313, 327 and 328): ¡ ¢ ¸ · 0 −Aj θj∗ , Q= 0 Ai (θi∗ ) ¡ ¢ ¸ · 0 Aj θj∗ , A= 0 Ai (θi∗ ) ¡ ¢ B = B θj∗ , θi∗ . Then, our matrix is Q−1 B, while Vuong’s matrix is W = QA−1 BA−1 . However: ³ ´ ¡ ¢ Sp QA−1 BA−1 = Sp B 1/2 A−1 QA−1 B 1/2 ³ ´ ¡ ¢ = Sp B 1/2 Q−1 B 1/2 = Sp Q−1 B , where Sp (·) is the spectrum of a matrix and we have used the fact that A−1 QA−1 = Q−1 . This suggests that model selection through penalization is consistent if the penalization term is bounded by the Law of the Iterated Logarithm associated with the asymptotic distributions of the test statistics proposed in [28]. THE STRUCTURE OF MODEL SELECTION 13 4.2. Necessary and Sufficient Conditions for Conservativeness and Consistency. Now, we state some conditions for conservativeness and consistency of a model selection criterion. Theorem 4.3. Let Assumptions A1-A5 hold. The model selection procedure defined by the maximization of the penalized objective function (3.1): ³ ´ JˆJ (ω) = arg max Qn,j θ̂j (ω) , j∈{1,...,J} respects Assumption C of Proposition 3.2. Then, the following necessary and sufficient conditions hold: • strong conservativeness: For any couple {i, j} such that iP⊲ j, c (n, p, Y) is increasing in p and: ¡ ¢ Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] . n→∞ • strong consistency: For any couple {i, j} such that iP◮ j, c (n, p, Y) is increasing in p; moreover, if iP⊲ j: ¡ ¢ Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] , n→∞ and if iP◮ j but iI⊲ j: n · [c (n, pj , Y) − c (n, pi , Y)] > λji . ln ln n Remark 4.4. (i) Consider in which model i and j are both correctly ¡ ¢ ¡ ¢ the situation specified. Then Aj θj∗ = −Bj θj∗ and Ai (θi∗ ) = −Bi (θi∗ ). If ¡ ¢ ¡ ¢−1 ¡ ¢ Bi (θi∗ ) = Bij θi∗ , θj∗ Bj θj∗ Bji θj∗ , θi∗ , lim inf n→∞ then we have λji = 1. The condition is the same as in Corollary 3.4 of [28]. p ·ln n (ii) For BIC, we have c (n, pj , Y) = j2n . Then, strong conservativeness holds since for iP⊲ j, c (n, p, Y) is increasing in p and: ¡ ¢ ln n Q∞,i (θi∗ ) − Q∞,j θj∗ > (pi − pj ) · lim sup = 0. n→∞ 2n As concerns strong consistency, the only new condition concerns the case in which iP◮ j but iI⊲ j: ln n ∞ = (pj − pi ) · lim inf > λji . n→∞ 2 ln ln n Therefore both conditions are verified automatically. p ·c·ln ln n (iii) HQIC offers a different example. For HQIC, we have c (n, pj , Y) = j n where c will be determined in the following. Then, strong conservativeness holds since for iP⊲ j, c (n, p, Y) is increasing in p and: ¡ ¢ ln ln n = 0. Q∞,i (θi∗ ) − Q∞,j θj∗ > (pi − pj ) · c · lim sup n n→∞ As concerns strong consistency, we have just to verify that if iP◮ j but iI⊲ j: ln n ∞ = (pj − pi ) · lim inf > λji . n→∞ 2 ln ln n In order to hold for pj − pi = 1, we need c > λji . Under correct specification, usually c > 1 is enough (Hannan and Quinn require c > 2 since they use the likelihood ratio). Moreover, HQIC is on the border of consistency. It is interesting THE STRUCTURE OF MODEL SELECTION 14 to make a comparison with Proposition 5.2 in [25]: the authors need a different condition involving all the eigenvalues of the matrix Q−1 B while we need just the greatest. This seems to be due to the fact that in the simple case of independent and identically distributed observations, it is possible to use a LIL for quadratic forms, while in the more general case of [25] this result is unavailable. This is really needed ³ ´ is a LIL foriothe second order component of ³ ´that what nh shows ¡ ∗ ¢i h and its availability would alQn,j θ̂j − Qn,j θj − Qn,i θ̂i − Qn,i (θi∗ ) low for applying the result also in heterogeneous and dependent cases. p (iv) A last example concerns AIC, for which c (n, pj , Y) = nj . Strong conservativeness holds since, if iP⊲ j, c (n, p, Y) is increasing in p and: ¡ ¢ pi − p j Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup = 0. n n→∞ Stong consistency fails to hold, since if iP◮ j but iI⊲ j: (pj − pi ) > λji . n→∞ ln ln n Therefore, AIC is conservative but not consistent because it fails to respect the LIL. 0 = lim inf 5. Weak Forms of the Previous Concepts The previous concepts are strong or almost sure ones since they need convergence over almost all trajectories. However, also the corresponding weak or in probability concepts are of some interest. Therefore, we show how the previous framework has to be modified to deal with weak conservativeness and consistency. Definition 5.1. We say that Jˆ is: n o • weakly conservative if, for any choice set A, limn→∞ P JˆA ∈ JA∗∗ = 1; n o • weakly consistent if, for any choice set A, limn→∞ P JˆA ∈ J ∗ = 1. A It is clear that a strongly consistent Jˆ is also weakly consistent and strongly conservative, a strongly conservative Jˆ is also weakly conservative, and a weakly consistent Jˆ is also weakly conservative. Proposition 5.2. Consider a group of competing models {1, . . . , J} and take a couple out of the set {1, . . . , J}. Consider the following Assumptions: n o A’: For any couple {i, j} such that iP⊲ j, limn→∞ P Jˆ{i,j} = i = 1. o n B’: For any couple {i, j} such that iP◮ j, limn→∞ P Jˆ{i,j} = i = 1. C’: For any two choice sets A ⊂ B and any i ∈ A, we have o n ¯ o n ¯ ¯ ¯ ω ¯JˆB (ω) = i ⊂ ω ¯JˆA (ω) = i . D’: Jˆ is weakly conservative. E’: Jˆ is weakly consistent. Then: D′ =⇒ A′ , E′ =⇒ B′ , A′ & C′ =⇒ D′ , B′ & C′ =⇒ E′ . We make some hypotheses in order to obtain the results. THE STRUCTURE OF MODEL SELECTION 15 A1: θj∗ exists for any j. A2: The derivatives ∂qj (y;θj ) ∂θj and ∂ 2 qj (y;θj ) ∂θj ∂θjT are measurable with respect to y for any θj and continuous with respect to θj . Moreover, if we ¯ ¯denote as¯ ¯ 2 ¯ ∂qj (y;θj ) ¯ ¯ ∂ 2 qj (y;θj ) ¯ θj,k the k−th element of θj , |qj (x; θj )|, |qj (x; θj )| , ¯ ∂θj,k ¯, ¯ ∂θj,k ∂θj,h ¯ ¯ ¯ ¯ ∂qj (y;θj ) ∂qj (y;θj ) ¯ and ¯ ∂θ ¯ are dominated by functions not dependent on θj and ∂θ j,k j,h integrable with respect to P0 . A3: Define the matrices " # ∂ 2 qj (Y ; θj ) Aj (θj ) = E0 , ∂θj ∂θjT # " ∂qj (Y ; θj ) ∂qj (Y ; θj ) , Bj (θj ) = E0 ∂θj ∂θjT " # ∂qi (Y ; θi ) ∂qj (Y ; θj ) Bij (θi , θj ) = E0 . ∂θi ∂θjT ¡ ¢ ¡ ¢ −Aj θj∗ and Bj θj∗ are finite and positive definite. A4: The estimator θ̂j converges in probability to θj∗ . ¯ ¯ ¯ ∂qj (y;θj ) ¯2 ∂ 3 qj (y;θj ) is measurable with respect to y for each θ. A5: ∂θj,k ∂θ ¯ ∂θj,k ¯ , j,h ∂θj,ℓ ¯ 2 ¯ 3 ¯2 ¯ ¯ ∂ qj (y;θj ) ¯ ¯ ∂ qj (y;θj ) ¯ ¯ ∂θj,k ∂θj,h ¯ and ¯ ∂θj,k ∂θ ¯ are dominated by P0 −integrable functions, j,h ∂θj,ℓ which do not depend on θ. Theorem 5.3. Let Assumptions A1-A5 hold. The model selection procedure defined by the maximization of the penalized objective function (3.1): ³ ´ JˆJ (ω) = arg max Qn,j θ̂j (ω) , j∈{1,...,J} respects Assumption C′ of Proposition 5.2. Then, the following necessary and sufficient conditions hold: • weak conservativeness: For any couple {i, j} such that iP⊲ j, c (n, p, Y) is increasing in p: ¡ ¢¤ √ £ lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞. n • weak consistency: For any couple {i, j} such that iP◮ j, c (n, p, Y) is increasing in p; moreover, if iP⊲ j: ¡ ¢¤ √ £ lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞, n and if iP◮ j but iI⊲ j: √ lim inf n · [c (n, pj , Y) − c (n, pi , Y)] = ∞. n→∞ 6. Optimality of Model Selection Procedures In this Section we study the optimality properties of model selection procedures. THE STRUCTURE OF MODEL SELECTION 16 6.1. Risk Functions. An interesting problem concerns the choice of a measure of efficiency in model selection: in a seminal paper about estimation in discrete parameter models, Hammersley ([16]) derives a generalization of Cramér-Rao inequality for the variance that is also valid when the parameter space is countable. The same inequality has been derived, in slightly more generality, by [4] and by [1]. Therefore, we will consider the following cost and risk functions: ³ ´ ³ ´2 ˜ J0 = J˜ − J0 , C1 J, ³ ´ ³ ´ ³ ´ ˜ J0 = MSE J˜ . R1 J˜n , J0 , E0 C1 J, However, this choice is well-suited only in cases in which the variance or the MSE are good measures of risk. This is indeed the case if the limiting distribution of the normalized estimator is normal. Following the discussion by Lindley in [16], we consider also a different cost function C2 (J, J0 ): ³ ´ ˜ J0 = 1 ˜ C2 J, {J6=J0 } ; the risk function is therefore given by the probability of missclassification: ³ ³ ³ ´ ´ ´ ˜ J0 , E0 C2 J, ˜ J0 = P0 J˜ 6= J0 . R2 J, The previous cost function has the drawback of weighting in the same way points of the parameter space that lies at different distances with respect to the true value J0 . In many cases, a more general loss function can be considered, as suggested in [14] (Vol. 1, p. 51) for multiple tests: ³ ´ ½ 0 if J˜ = J0 ˜ C3 J, J0 = aj (J0 ) if J˜n = j where aj (J0 ) > 0 for j = 1, ..., J. The risk function is therefore given by the weighted probability of missclassification:   J ³ ´ ³ ´ X ˜ J0 , E0 C3 J, ˜ J0 = E0   R3 J, aj (J0 ) · 1{J=j ˜ } j=1 = J X j=1 o n aj (J0 ) · P0 J˜ = j . The aj (J0 )’s can be tuned in order to give more or less weight to different points of the parameter space. At last, we define the Bayes risk (under the zero-one loss function) associated with a prior distribution π on the parameter ´ Θ. In particular, we consider ³ space ˜ the Bayes risk under the risk function R2 J, J0 as: J J ³ ´ X ³ ´ X ³ ´ ˜π , ˜j = r2 J, π (j) · R2 J, π (j) · Pθj J˜ 6= j . j=0 j=0 ³ ´ −1 ˜ π as the average probability of error. If π (j) = (J + 1) we define Pe , r2 J, Remark that this is indeed the measure of error used by Vajda ([26, 27]). THE STRUCTURE OF MODEL SELECTION 17 6.2. Information Inequalities. This Section contains lower bounds for the ³ pre´ ˜ J0 viously introduced risk functions and in particular for the risk function R2 J, corresponding to the zero-one loss. In our specific case, these generalize and unify the lower bounds proposed by [16, 4, 18, 15]. Lower bounds for more general cost functions can be obtained using, for example, Markov inequality. Indeed, if wn is a strictly positive Borel function increasing on R∗+ , then: °´i h ³° ˜ − J0 ° ° ³° ´ E0 wn ° J ° ° ° ° P0 °J˜ − J0 ° ≥ kn ≤ , wn (kn ) ¶¸ · µ ˜ 0k kJ−J E w 0 n ° ³° ´ kn ° ° P0 °J˜ − J0 ° ≥ kn ≤ . wn (1) First of all, a lower bound is proved and then, we obtain a minimax version of the same result. We will sometimes refer to the first one as Chapman-Robbins lower bound (and to the related efficiency concept as Chapman-Robbins efficiency) since it recalls the lower bound proposed by these two authors in their 1951 paper. Then, from these results, we derive lower bounds for the MSE, for the weighted probability of missclassification and for the Bayes risk. 6.2.1. A Lower Bound for the Probability of Missclassification. The first Theorem is a classical efficiency result, in the spirit of [11, 5, 3, 12]. It corresponds essentially to Stein’s Lemma in hypothesis testing; the reduction of an estimation problem to a test between two hypotheses is a standard technique in the derivation of efficiency lower bounds and is attributed to [10] (see also [19], for a related technique). Here, Stein’s Lemma is applied as in Theorem 9.2 in [17] (p. 96), taking into account the fact that the parameter space is made up of more than two points. Remark, however, that our version does not assume almost sure convergence of the KullbackLeibler information measure to a fixed constant. Theorem 6.1. Let P0 be the true probability measure that has generated the data. Suppose that o n (6.1) lim sup Pj Jb 6= j < 1 n→∞ and that (6.2) 1 dPj ln = H (Pj |P0 ) , n→∞ n dP0 lim P0 − as, where the limit H (Pj |P0 ) (that can be a random variable) corresponds in most cases with the so-called Kullback-Leibler divergence. Then, for any j ∈ / J ∗: n o 1 / J ∗ ≥ − inf H (Pj |P0 ) . lim inf ln P0 Jb ∈ n→∞ n j 6.2.2. Lower Bounds for the Other Risk Functions. The results of the previous Sections can ³ ´ easily be converted in corresponding results for the risk function ˜ R1 J, J0 . The MSE of a generic estimator J˜ can be shown to be: ³ ´ ³ ´ MSE J˜ ≍ P0 J˜ 6= J0 , THE STRUCTURE OF MODEL SELECTION 18 so that ³ ³ ´ ´ 1 1 ln MSE J˜ = lim inf ln P0 J˜ 6= J0 . n→∞ n n→∞ n The same reasoning can be applied also to the risk function R3 : ³ ³ ´ ´ 1 1 ˜ J0 = lim inf ln P0 J˜ 6= J0 . lim inf ln R3 J, n→∞ n n→∞ n lim inf 7. Proofs Proof of Proposition 3.2. We embed the proof in the framework provided by [13] (Section 3.3). • strong conservativeness: D is sufficient for A: this can be shown simply considering a generic set composed by a couple of indexes, say {i, j}, such that iP⊲ j. Now, we show that A and C are sufficient for D. For any choice set {1, . . . , J} and any i ∈ {1, . . . , J}, define o n ¯ ¯ Ωi|{1,...,J} = ω ¯ lim JˆJ (ω) = i . n→∞ (7.1) Under C, from [13] (Lemma 2 on p. 66), we have \ Ωi|{1,...,J} = Ωi|{i,j} . j∈{1,...,J}, j6=i Consider fictitiously JJ∗∗ as a new element of the choice set (that is the choice set is now made of (J + 1 − |JJ∗∗ |) elements). We have:   ´ o ³ n \ ΩJ ∗∗ |{k,J ∗∗ }  P lim JˆJ ∈ JJ∗∗ = P ΩJJ∗∗ |{1,...,J} = P  J J n→∞ ≥1− X k∈{1,...,J},k∈J / J∗∗ h ³ 1 − P ΩJ ∗∗ |{k,J ∗∗ } J J ´i k∈{1,...,J},k∈J / J∗∗ , where we have used (7.1) and Bonferroni inequality. But:   ³ ´ ³ ´ \   P ΩJ ∗∗ |{k,J ∗∗ } = 1 − P Ωk|{k,J ∗∗ } = 1 − P  Ωk|{k,j}  J J J j∈{k,JJ∗∗ }   \ ¡ ¢ ¡ ¢ = 1 − P Ωk|{k,j}  ≥ 1 − max P Ωk|{k,j} = min∗∗ P Ωj|{k,j} . ∗∗ j∈JJ∗∗ Summing up, we have: n o P lim JˆJ ∈ JJ∗∗ ≥ 1 − n→∞ j∈JJ X j∈JJ k∈{1,...,J},k∈J / J∗∗ ≥ (1 − J + |JJ∗∗ |) + (J − |JJ∗∗ |) · h ³ ´i 1 − P ΩJ ∗∗ |{k,J ∗∗ } min k∈{1,...,J},k∈J / J∗∗ J J ¡ ¢ min∗∗ P Ωj|{k,j} . j∈JJ Therefore, A implies D. • strong consistency: Substitute E to D, B to A and JJ∗ to JJ∗∗ . THE STRUCTURE OF MODEL SELECTION 19 Proof of Proposition 3.5. According to hypothesis 3, both Qn,j (θj ) and pj are essential.6 This implies that the discussion on pages 21-22 following the statement of Theorem 3 of [8] applies (the Theorem itself does not apply directly since the essential factors are not more than two). Proof of Theorem 4.1. We state here some results for future reference: ¡ ¢−1 ¡ ¢−1 ¡ ¢−1 h ¡ ¢−1 i Q̈n,j θj∗ = Q̈∞,j θj∗ − Q̈n,j θj∗ − Q̈∞,j θj∗ (7.2) ¡ ¢ ¡ ¢ Q̈n,j θj∗ = Q̈∞,j θj∗ + o (1) (7.3) ¡ ¢−1 ¡ ¢−1 Q̈n,j θj∗ = Q̈∞,j θj∗ (7.4) + o (1) where the last two results hold by A2 and A3. Moreover, (7.5) " # ∂ 2 Q∞,j (θj ) ∂ 2 qj (Y ; θj ) ∂ 2 E0 [qj (Y ; θj )] Q̈∞,j (θj ) = = = E0 = Aj (θj ) , ∂θj ∂θjT ∂θj ∂θjT ∂θj ∂θjT where the third equality derives from A2. (1) Consider the condition defining the estimator θ̂j (the so-called ³first-order ´ condition, since it involves the first-order derivative) 0 = Q̇n,j θ̂j . We expand it around the value θj∗ : ´ ¡ ¢ ¡ ¢ ³ 0 = Q̇n,j θj∗ + Q̈n,j θj∗ · θ̂j − θj∗ + rn , where rn is defined by: ³ ´     ´ ³ ´T ∂Q 2 n,j θ̃j  ³ ∂  θ̂j − θj∗ , rn,k = θ̂j − θj∗   ∂θj ∂θjT  ∂θj,k (7.6) (7.7) k = 1, . . . , pj and θ̃j = δ · θ̂j + (1 − δ) · θj∗ for a certain δ ∈ (0, 1). Therefore: i ¡ ¢−1 h ¡ ¢ θ̂j − θj∗ = −Q̈n,j θj∗ · Q̇n,j θj∗ + rn i ¡ ¢−1 h ¡ ¢ = −Q̈∞,j θj∗ · Q̇n,j θj∗ + rn h i ¡ ¢−1 ¡ ¢−1 i h ¡ ¢ + Q̈∞,j θj∗ − Q̈n,j θj∗ · Q̇n,j θj∗ + rn , ¡ ¢ where we have used (7.2). Now, we study the behavior of Q̇n,j θj∗ = Pn ∂qj (Yi ;θj∗ ) 1 . This is the average of n iid random variables with moi=1 n ∂θj ments: " £ ¡ ¢# ¡ ¢¤ ¡ ¢ ∂ E0 qj Yi ; θj∗ ∂qj Yi ; θj∗ = Q̇∞,j θj∗ = 0, = E0 ∂θj ∂θj " # ¢ ¡ ¢ ¡ ¡ ¢ ∂qj Yi ; θj∗ ∂qj Yi ; θj∗ E0 = Bj θj∗ , ∂θj ∂θjT 6Consider a preorder ¹ defined on the space X = X × X with generic element (x , x ); the 1 2 1 2 component 1 is said to be essential if for every x2 ∈ X2 , not all the elements of X1 are indifferent with respect to ¹. THE STRUCTURE OF MODEL SELECTION 20 where we have used A2 to interchange derivative and expectation. Thus, by the classical LIL, the following result holds © £ ¡ ¢¤ª1/2 Q̇n,j (θ ∗ ) lim supn→∞ √ 2 ln lnjn = diag Bj θj∗ , n ∗ © £ ¡ ¢¤ª1/2 Q̇n,j (θ ) . lim inf n→∞ √ 2 ln lnjn = − diag Bj θj∗ (7.8) n ¶ µq ¡ ¢ ¡ ¢−1 i ¡ ¢−1 ln ln n = o (1) (by (7.4)), Q̇n,j θj∗ = O Now, Q̈∞,j θj∗ − Q̈n,j θj∗ n ½ ¾¸ · 2 ∂Qn,j (θ̃j ) = O (1) (by A5) imply: (by (7.8)) and ∂θ∂j ∂θT ∂θj,k h j µ° °2 ¶ ° ∗° rn = O °θ̂j − θj ° , ! Ãr µ° °2 ¶ ln ln n ° ∗° ∗ + O °θ̂j − θj ° . θ̂j − θj = O n (7.9) (7.10) Therefore, since θ̂j − θj∗ = o (1) (by A4), the asymptotic as behavior of ¡ ¢ ¡ ¢−1 · Q̇n,j θj∗ : θ̂j − θj∗ is determined by −Q̈∞,j θj∗ n h ¡ ¢−1 io1/2 ¡ ¢−1 ¡ ¢ θ̂j −θ ∗ lim supn→∞ √ 2 ln lnj n = + diag Q̈∞,j θj∗ , Bj θj∗ Q̈∞,j θj∗ n io n h ¡ ¢−1 1/2 ¡ ¢ ¡ ¢−1 θ̂j −θ ∗ . Bj θj∗ Q̈∞,j θj∗ lim inf n→∞ √ 2 ln lnj n = − diag Q̈∞,j θj∗ n Using (7.5), the result follows. (2) We have: ³ ´ ´T ¡ ¢ ³ ¡ ¢ Qn,j θ̂j − Qn,j θj∗ = θ̂j − θj∗ · Q̇n,j θj∗ ´T ´ ¡ ¢ ³ 1 ³ + · θ̂j − θj∗ · Q̈n,j θj∗ · θ̂j − θj∗ 2 ¡ ¢ ³ pj ´³ ´³ ´ ∂ 3 Qn,j θ̄j 1 X ∗ ∗ ∗ · θ̂j,h − θj,h θ̂j,k − θj,k θ̂j,ℓ − θj,ℓ + · 6 ∂θj,h ∂θj,k ∂θj,ℓ h,k,ℓ=1 where θ̄j = ε · θ̂j + (1 − ε) · θj∗ for a certain ε ∈ (0, 1). Using (7.6), we get: ³ ´ ¡ ¢ Qn,j θ̂j − Qn,j θj∗ ¡ ¢T ¡ ¢−1 ¡ ¢ 1 · Q̇n,j θj∗ = − · Q̇n,j θj∗ · Q̈n,j θj∗ 2 ¡ ¢−1 1 T + · rn · Q̈n,j θj∗ · rn 2 ¡ ¢ ³ pj ´³ ´³ ´ ∂ 3 Qn,j θ̄j 1 X ∗ ∗ ∗ + · · θ̂j,h − θj,h θ̂j,k − θj,k θ̂j,ℓ − θj,ℓ . 6 ∂θj,h ∂θj,k ∂θj,ℓ h,k,ℓ=1 µ° °2 ¶ ¡ ¢−1 ° ° Now, Q̈n,j θj∗ = O (1) (by (7.4)) and rn = O °θ̂j − θj∗ ° (by (7.9)) µ° ¶ °4 ∂ 3 Qn,j (θ̄j ) ° ° imply that the second term is O °θ̂j − θj∗ ° , while ∂θj,h ∂θj,k ∂θj,ℓ = O (1) THE STRUCTURE OF MODEL SELECTION 21 µ° °3 ¶ ° ° (by A5) implies that the third term is O °θ̂j − θj∗ ° . Therefore: ³ ´ ¡ ¢ Qn,j θ̂j − Qn,j θj∗ µ° °3 ¶ ¡ ∗ ¢T ¡ ∗ ¢−1 ¡ ∗¢ 1 ° ∗° = − · Q̇n,j θj · Q̈n,j θj · Q̇n,j θj + O °θ̂j − θj ° 2 ¡ ¢T ¡ ¢−1 ¡ ¢ 1 · Q̇n,j θj∗ = − · Q̇n,j θj∗ · Q̈∞,j θj∗ 2 µ° °3 ¶ ¡ ∗ ¢T h ¡ ∗ ¢−1 ¡ ∗¢ ¡ ∗ ¢−1 i 1 ° ∗° − · Q̇n,j θj · Q̈n,j θj · Q̇n,j θj + O °θ̂j − θj ° − Q̈∞,j θj 2 ¡ ¢T ¡ ¢−1 ¡ ¢ 1 = · Q̇n,j θj∗ · W θj∗ · Q̇n,j θj∗ 2µ µ° ° °2 ¶ °3 ¶ ° ° ∗° ∗° + o °θ̂j − θj ° + O °θ̂j − θj ° where µ we have used (7.2), (7.5). Using (7.10), it can be seen µ° (7.4) and ° °2 ¶ °3 ¶ ¢ ¡ ° ° n ∗° ∗° , while the leading term that o °θ̂j − θj ° + O °θ̂j − θj ° = o ln ln n ¡ ln ln n ¢ is O ³ n´ . Therefore, it is this last term that dictates the behavior of ¡ ¢ Qn,j θ̂j − Qn,j θj∗ . Now, consider the difference: h ³ ´ h ³ ´ i ¡ ¢i ∆ji = 2 · Qn,j θ̂j − Qn,j θj∗ − 2 · Qn,i θ̂i − Qn,i (θi∗ ) # · ¡ ¢ ¸ ¡ ¢ ¸T " µ ¶ ¡ ¢−1 n X n · X ln ln n q̇j Yk ; θj∗ q̇j Yh ; θj∗ 0 −Aj θj∗ · = · +o . −1 q̇i (Yk ; θi∗ ) q̇i (Yh ; θi∗ ) n 0 Ai (θi∗ ) h=1 k=1 Let: h (yh , yk ) , · ¢ ¸T " ¡ ¡ ¢−1 q̇j yh ; θj∗ −Aj θj∗ · q̇i (yh ; θi∗ ) 0 0 −1 Ai (θi∗ ) # · ¢ ¸ ¡ q̇j yk ; θj∗ . · q̇i (yk ; θi∗ ) 2 Using the notation of [24] (Section 5.1.5), if E [h (Yh , Yk )] < ∞, we have: θ = Eh (Yh , Yk ) = 0 h1 (yh ) = Eh (yh , Yk ) = 0 h̃1 (yh ) = h1 (yh ) − θ = 0 i2 h ζ1 = E h̃1 (Yh ) = 0, so that the leading term of ∆ji is a degenerate V −statistic. We can use Theorem 2 of [9] to derive a Law of the Iterated Logarithm. Decompose as follows the quadratic discrepancy: n X h,k=1 h (Yh , Yk ) = 2 n k−1 X X k=1 h=1 h (Yh , Yk ) + n X k=1 h (Xk , Xk ) . THE STRUCTURE OF MODEL SELECTION 22 Using Fischer’s variational property of eigenvalues: ° ° #¯ " ° ∂qj (Yh ;θj∗ ) °2 ¯¯ ¡ ∗ ¢−1 ¯ ° ¯ ° −A θ 0 ¯ j j ∂θj °  E |h (Yh , Yh )| ≤ E ° ¯ −1 ° ∂qi (Yh ;θi∗ ) ° · ¯¯λmax ∗ ¯ 0 Ai (θi ) ° ° ∂θi 2   ¯ ¯ ¯ ¯ pj pi ¯ ∂q ¡Y ; θ∗ ¢ ¯2 X X ¯ ∂qi (Yh ; θi∗ ) ¯2 j h j ¯ ¯ ¯  = E ¯¯ E¯ ¯ + ¯ ¯ ¯ ∂θj,k ∂θi,k k=1 k=1 h ¡ ¢−1 i · λmax −Aj θj∗ <∞ by A3 and A5, so that Kolmogorov’s law of large numbers implies: lim sup n→∞ Pn k=1 h (Xk , Xk ) n · = 0, n n ln ln n P − as. Since Eh (Yh , Yk ) = 0, we use Fischer’s property of eigenvalues and CauchySchwarz inequality to derive:   T " #  ∂q Y ;θ∗ 2  ¡ ∗ ¢−1   ∂qj (Yh ;θj∗ ) j( ℓ j ) 0 ∂θj ∂θj   · −Aj θj  · Eh2 (Yh , Yk ) ≤ E  ∂q (Y ∗ ∗ −1 ∂qi (Yℓ ;θi ) i h ;θi )   0 Ai (θi∗ )   ∂θi ∂θi   ¯ ¯ 2 " p ¯ #2   ¯ pj i ¯ ∂q ¡Y ; θ∗ ¢ ¯2  X ∗ ¯2 X ¯ ∂qi (Y ; θi ) ¯  ¯ j j ¯  ≤2·  E ¯¯ + E¯ ¯ ¯  ∂θi,k ¯    k=1 ¯ ∂θj,k k=1 h n ¡ ¢−1 io2 · λmax −Aj θj∗ . This is finite by A3 and A5; then, from Theorem 2 of [9]: lim sup n→∞ 2 Pn k=1 Pk−1 h=1 h (Yh , Yk ) = 2C (h) . n ln ln n It is simple to see that C (h) = maxk {λk } where λk are the eigenvalues of the integral operator A: Ag (x) = Z h (x, y) g (y) P (dy) , for g ∈ L2 . Indeed, this operator has a finite spectrum (it is a quadratic form), so that it is possible to obtain a simpler expression for C (h). Let: h ¡ ¢ ¡ ¢T i Bij θi∗ , θj∗ = E0 q̇i (yℓ ; θi∗ ) · q̇j yℓ ; θj∗ , ¡ ¢ ¢ ¡ ¸ · · ¡ ¢ Bj¡ θj∗ ¢ q̇j yℓ ; θj∗ = B θj∗ , θi∗ = V0 Bij θi∗ , θj∗ q̇i (yℓ ; θi∗ ) ¢ ¸ ¡ Bji θj∗ , θi∗ . Bi (θi∗ ) THE STRUCTURE OF MODEL SELECTION 23 Then we can write: ¢ ¸T ¡ · ¡ ¢−1/2 q̇j yh ; θj∗ · B θj∗ , θi∗ h (yh , yk ) = q̇i (yh ; θi∗ ) # " ¡ ¢−1 ¡ ¢1/2 ¡ ∗ ∗ ¢1/2 0 −Aj θj∗ · B θj∗ , θi∗ · B θ j , θi · ∗ −1 0 Ai (θi ) ¢ ¡ ¸ · ¡ ¢−1/2 q̇j yℓ ; θj∗ . · B θj∗ , θi∗ · q̇i (yℓ ; θi∗ ) Now, ¢ ¸ ¡ q̇j yℓ ; θj∗ q̇i (yℓ ; θi∗ ) is an orthonormal vector and so the nonnull eigenvalues of the spectrum of A are equal to the spectrum of: # " ¡ ¢−1 ¡ ¢1/2 ¡ ∗ ∗ ¢1/2 −Aj θj∗ 0 · B θj∗ , θi∗ B θj , θi · ∗ −1 0 Ai (θi ) ¡ ¢−1/2 B θj∗ , θi∗ · · and to the one of: # " ¡ ¢−1 ¡ ¢ −Aj θj∗ 0 · B θj∗ , θi∗ ∗ −1 0 Ai (θi ) # · " ¢ ¸ ¡ ¡ ¢ ¡ ∗ ¢−1 Bj¡ θj∗ ¢ Bji θj∗ , θi∗ 0 −Aj θj · = −1 Bij θi∗ , θj∗ Bi (θi∗ ) 0 Ai (θi∗ ) " ¢ # ¡ ¡ ¢−1 ¡ ¢ ¡ ¢−1 Bji θj∗ , θi∗ −Aj θj∗ Bj θj∗ −Aj θj∗ ¢ ¡ . = −1 −1 Ai (θi∗ ) Bi (θi∗ ) Ai (θi∗ ) Bij θi∗ , θj∗ (3) We start from the decomposition: ³ ´ ³ ´ ¡ ¢ h ¡ ¢i Qn,j θ̂j = Q∞,j θj∗ + Qn,j θ̂j − Q∞,j θj∗ ³ ´ ¡ ¢ h ¡ ¢ ¡ ¢i £ ¡ ¢¤ = Q∞,j θj∗ + Qn,j θ̂j − Qn,j θj∗ + Qn,j θj∗ − Q∞,j θj∗ . (7.11) h ³ ´ ¡ ¢i As concerns Qn,j θ̂j − Qn,j θj∗ , its behavior can be analyzed as before ¡ ¢ n and it O ln ln . Then, we start from the last addendum in (7.11): n n ¢ ¡ ¢¤ ¡ ¢ ¡ ¢ 1 X£ ¡ qj Yi ; θj∗ − Q∞,j θj∗ Qn,j θj∗ − Q∞,j θj∗ = n i=1 is an average of n iid terms with zero mean (by defnition of Q∞,j ) and finite variance (by A2). The LIL holds and we have ¡ ¢ ¡ ¢ n £ ¡ ¢ ¡ ¢¤2 o1/2 Qn,j θj∗ − Q∞,j θj∗ q = + E0 qj Yi ; θj∗ − Q∞,j θj∗ , lim sup n→∞ 2 ln ln n n ¡ ¢ ¡ ¢ n £ ¡ ¢ ¡ ¢¤2 o1/2 Qn,j θj∗ − Q∞,j θj∗ q lim inf = − E0 qj Yi ; θj∗ − Q∞,j θj∗ . n→∞ 2 ln ln n n Proof of Theorem 4.3. The fact that JˆJ (ω) satisfies Assumption C of Proposition 3.2 is evident from Proposition 3.4. THE STRUCTURE OF MODEL SELECTION 24 n o (1) In this case, we need that for any couple {i, j} such that iP⊲ j, P limn→∞ Jˆ{i,j} = i = ¡ ¢ 1. Therefore consider i and j such that Q∞,i (θi∗ ) > Q∞,j θj∗ : then, for ³ ´ any ω ∈ Ω (up to negligibility) there exists a nω such that Qn,i θ̂i (ω) > ³ ´ Qn,j θ̂j (ω) for n ≥ nω and we want that also Qn,i > Qn,j holds true. Therefore, we need: ³ ´ ³ ´ ³ ´ ³ ´ 0 < Qn,i θ̂i − Qn,j θ̂j = Qn,i θ̂i − Qn,j θ̂j − c (n, pi , Y) + c (n, pj , Y) ³ ´ ³ ´ h i h ¡ ¢i = Qn,i θ̂i − Q∞,i (θi∗ ) − Qn,j θ̂j − Q∞,j θj∗ ¡ ¢ + Q∞,i (θi∗ ) − Q∞,j θj∗ − c (n, pi , Y) + c (n, pj , Y) . Taking the lim inf, we get: h ³ ´ i h ³ ´ ¡ ¢i 0 < lim inf Qn,i θ̂i − Q∞,i (θi∗ ) − lim sup Qn,j θ̂j − Q∞,j θj∗ n→∞ n→∞ £ ¡ ∗ ¢¤ ∗ + lim inf Q∞,i (θi ) − Q∞,j θj − lim sup [c (n, pi , Y) − c (n, pj , Y)] n→∞ n→∞ ¡ ¢ = Q∞,i (θi∗ ) − Q∞,j θj∗ − lim sup [c (n, pi , Y) − c (n, pj , Y)] . n→∞ and at last: ¡ ¢ Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] . n→∞ o n (2) For any couple {i, j} such that iP◮ j, we should have P limn→∞ Jˆ{i,j} = i = ¡ ¢ 1. If Q∞,i (θi∗ ) > Q∞,j θj∗ (i.e. iP⊲ j), then the condition is the same as ¡ ¢ for strong conservativeness. On the other hand, if Q∞,i (θi∗ ) = Q∞,j θj∗ and pi < pj , that is i is a more parsimonious representation than j (i.e. iP◮ j but iI⊲ j), we want ³ ´ that for anyω ³ ´∈ Ω (up to negligibility) there exists a nω such that Qn,j θ̂j (ω) − Qn,i θ̂i (ω) < 0 for n ≥ nω . In order to be able to discriminate between the two models, we need to divide the terms n in (7.12) by the order of the largest term, that is ln ln n . Then, taking a lim supn→∞ , we have: ³ ´ i ³ ´ h ³ ´ ³ ´ ¡ ¢i h Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) Qn,j θ̂j − Qn,i θ̂i = lim sup 0 > lim sup ln ln n ln ln n n→∞ − lim inf n→∞ n→∞ n c (n, pj , Y) − c (n, pi , Y) ln ln n n n . According to Theorem 4.1, this becomes: ³ ´ ³ ´ Qn,j θ̂j − Qn,i θ̂i c (n, pj , Y) − c (n, pi , Y) 0 > lim sup = λji − lim inf . ln ln n ln ln n n→∞ n n→∞ n Proof of Proposition 5.2. The proof follows the same lines of the corresponding proof for strong properties. THE STRUCTURE OF MODEL SELECTION 25 • weak conservativeness: The n ¯ o reasoning is the same, but we use the set ¯ˆ (n) Ωi|{1,...,J} = ω ¯JJ (ω) = i . The minorization is now: o n P JˆJ ∈ JJ∗∗ ³ ´ (n) ≥ (1 − J + |JJ∗∗ |) + (J − |JJ∗∗ |) · min min∗∗ P Ωj|{k,j} . ∗∗ k∈{1,...,J},k∈J / J j∈JJ Taking limn→∞ , we get the result. • weak consistency: Substitute E′ to D′ , B′ to A′ and JJ∗ to JJ∗∗ . Proof of Theorem 5.3. The fact that JˆJ (ω) satisfies Assumption C′ of Proposition 3.2 is evident from Proposition 3.4. ∗ (1) In this case, we need that {i, j} such n for any couple o n that iPo⊲ j (i.e. n Q∞,i¯ (θi ) > o ¡ ∗¢ ¯ Q∞,j θj ), limn→∞ P Jˆ{i,j} = i = 1. Now, Jˆ{i,j} = i = ω ∈ Ω ¯Jˆ{i,j} (ω) = i ³ ´ ³ ´ ³ ´ ³ ´ can arise either if Qn,i θ̂i > Qn,j θ̂j or if Qn,i θ̂i = Qn,j θ̂j but the tie is broken in advantage of i: n ³ ´ ³ ´o n ³ ´ ³ ´ o n o Jˆ{i,j} = i = Qn,i θ̂i > Qn,j θ̂j ⊎ Qn,i θ̂i = Qn,j θ̂j , but i is chosen , o n n ³ ´ ³ ´o n ³ ´ ³ ´ o P Jˆ{i,j} = i = P Qn,i θ̂i > Qn,j θ̂j + P Qn,i θ̂i = Qn,j θ̂j , but i is chosen . ∗ The second ¡ ∗ ¢ probability is asymptotically null, since iP⊲ j (i.e. Q∞,i (θi ) > Q∞,j θj ). As concerns the first probability, we have: ³ ´ ³ ´o n P Qn,i θ̂i > Qn,j θ̂j ³ ´ ³ ´ n o = P Qn,i θ̂i − Qn,j θ̂j > c (n, pi , Y) − c (n, pj , Y) nh ³ ´ i h ³ ´ ¡ ¢i = P Qn,i θ̂i − Q∞,i (θi∗ ) − Qn,j θ̂j − Q∞,j θj∗ £ ¡ ¢¤ª > [c (n, pi , Y) − c (n, pj , Y)] − Q∞,i (θi∗ ) − Q∞,j θj∗ . h h ³ ³ ´ i ´ ¡ ¢i √ √ Since n Qn,i θ̂i − Q∞,i (θi∗ ) and n Qn,j θ̂j − Q∞,j θj∗ are OP (1), we need: ¡ ¢¤ √ £ lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞. n n o (2) For any couple {i, j} such that iP◮ j, limn→∞ P Jˆ{i,j} = i = 1. If Q∞,i (θi∗ ) > ¡ ¢ Q∞,j θj∗ (i.e. iP⊲ j), then the condition is the same as for weak conser¡ ¢ vativeness. On the other hand, if Q∞,i (θi∗ ) = Q∞,j θj∗ and pi < pj , that is i is a more parsimonious representation than j (i.e. iP◮ j but iI⊲ j), we want that ³ ´ n ³ ´o lim P Qn,i θ̂i > Qn,j θ̂j = 1. n→∞ This event becomes: ³ ´ ³ ´ ³ ´ ³ ´ 0 > Qn,j θ̂j − Qn,i θ̂i = Qn,j θ̂j − Qn,i θ̂i − c (n, pj , Y) + c (n, pi , Y) (7.12) ³ ´ i ³ ´ h ¡ ¢i h = Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) − [c (n, pj , Y) − c (n, pi , Y)] . THE STRUCTURE OF MODEL SELECTION 26 Now, we have: µ ¶ h ³ ´ ³ ´ i ¡ ¢i h 1 Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) = OP √ , n and therefore: ³ ´ i o ³ ´ nh ¡ ¢i h lim P Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) < [c (n, pj , Y) − c (n, pi , Y)] = 1, n→∞ as long as: lim inf n→∞ √ n · [c (n, pj , Y) − c (n, pi , Y)] = ∞. Proof of Theorem 6.1. Define the sets: n o An (j) = ω : Jb = j ¾ ½ 1 dPj ≤ H (Pj |P0 ) + ε Bn (j) = ω : ln n dP0 Therefore, we have: n o n o P0 Jb ∈ / J∗ / J∗ = E0 1 Jb ∈ o dP0 n b 1 J∈ ≥ Ej / J∗ dPj dP0 1 {An (j)} ≥ Ej dPj ≥ Ej 1 {An (j)} 1 {Bn (j)} · exp {−n · [H (Pj |P0 ) + ε]} ≥ [1 − Pj {Acn (j)} − Pj {Bnc (j)}] · exp {−n · [H (Pj |P0 ) + ε]} h o i n ≥ 1 − Pj Jb 6= j − Pj {Bnc (j)} · exp {−n · [H (Pj |P0 ) + ε]} . This implies: n o n o i 1 h 1 / J ∗ ≥ −H (Pj |P0 )−ε+lim inf ln 1 − Pj Jb 6= j − Pj {Bnc (j)} . lim inf ln P0 Jb ∈ n→∞ n n→∞ n Under (6.1) and (6.2), since ε is arbitrary, we get: o n 1 lim inf ln P0 Jb ∈ / J ∗ ≥ − min∗ inf H (Pj |P0 ) . n→∞ n j∈J θj ∈Θj References [1] C.R. Blyth and D.M. Roberts. On inequalities of cramér-rao type and admissibility proofs. In Sixth Berkeley Symposium, pages 17–30. University of California Press, 1972. [2] Denis Bouyssou. Monotonicity of ‘ranking by choosing’: a progress report. Soc. Choice Welf., 23(2):249–273, 2004. [3] Antoine Chambaz. Estimating and testing the order of a model. Technical report, Université Paris-Sud, 2002. [4] Douglas G. Chapman and Herbert Robbins. Minimum variance estimation without regularity assumptions. Ann. Math. Statistics, 22:581–586, 1951. [5] Christine Choirat and Raffaello Seri. Estimation in discrete parameter models. Document de Travail 2001-38, CREST, 2001. [6] D. R. Cox. Tests of separate families of hypotheses. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I, pages 105–123. Univ. California Press, Berkeley, Calif., 1961. [7] D. R. Cox. Further results on tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B, 24:406–424, 1962. [8] Gerard Debreu. Topological methods in cardinal utility theory. In Mathematical methods in the social sciences 1959, pages 16–26. Stanford Univ. Press, Stanford, Calif., 1960. THE STRUCTURE OF MODEL SELECTION 27 [9] Herold Dehling. Complete convergence of triangular arrays and the law of the iterated logarithm for U -statistics. Statist. Probab. Lett., 7(4):319–321, 1989. [10] R. H. Farrell. On the best obtainable asymptotic rates of convergence in estimation of a density at a point. Ann. Math. Statist., 43:170–180, 1972. [11] Lorenzo Finesso, Chuang-Chun Liu, and Prakash Narayan. The optimal error exponent for Markov order estimation. IEEE Trans. Inform. Theory, 42(5):1488–1497, 1996. [12] Elisabeth Gassiat and Stéphane Boucheron. Optimal error exponents in hidden Markov models order estimation. IEEE Trans. Inform. Theory, 49(4):964–980, 2003. [13] Christian Gouriéroux. Econometrics of qualitative dependent variables. Themes in Modern Econometrics. Cambridge University Press, Cambridge, 2000. Translated from the second French (1991) edition by Paul B. Klassen. [14] Christian Gouriéroux and Alain Monfort. Statistics and Econometric Models. Cambridge University Press, 1995. [15] P. Hall. On convergence rates in nonparametric problems. International Statistical Review, 57:45–58, 1989. [16] J. M. Hammersley. On estimating restricted parameters. J. Roy. Statist. Soc. Ser. B., 12:192– 229; discussion, 230–240, 1950. [17] I. A. Ibragimov and R. Z. Has′ minskiı̆. Statistical estimation, volume 16 of Applications of Mathematics. Springer-Verlag, New York, 1981. Asymptotic theory, Translated from the Russian by Samuel Kotz. [18] A. D. M. Kester and W. C. M. Kallenberg. Large deviations of estimators. Ann. Statist., 14(2):648–664, 1986. [19] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann. Statist., 1:38–53, 1973. [20] E. Malinvaud. Lecons de theorie microeconomique. Dunod, 1968. [21] R. Nishii. Maximum likelihood principle and model selection when the true model is unspecified. J. Multivariate Anal., 27(2):392–403, 1988. [22] B. M. Pötscher. Effects of model selection on inference. Econometric Theory, 7(2):163–185, 1991. [23] Douglas Rivers and Quang Vuong. Model selection tests for nonlinear dynamic models. Econom. J., 5(1):1–39, 2002. [24] Robert J. Serfling. Approximation theorems of mathematical statistics. John Wiley & Sons Inc., New York, 1980. Wiley Series in Probability and Mathematical Statistics. [25] Chor-Yiu Sin and Halbert White. Information criteria for selecting possibly misspecified parametric models. J. Econometrics, 71(1-2):207–225, 1996. [26] Igor Vajda. A discrete theory of search. I. Apl. Mat., 16:241–255, 1971. [27] Igor Vajda. A discrete theory of search. II. Apl. Mat., 16:319–335, 1971. [28] Quang H. Vuong. Likelihood ratio tests for model selection and nonnested hypotheses. Econometrica, 57(2):307–333, 1989. [29] H. White. A consistent model selection procedure based on m-testing. In C.W.J. Granger, editor, Modelling Economic Series: Readings in Econometric Methodology, pages 369–403. Oxford University Press, 1989. Dipartimento di Economia, Università degli Studi dell’Insubria, Via Ravasi 2, 21100 Varese, Italy E-mail address: [email protected] Dipartimento di Economia, Università degli Studi dell’Insubria, Via Ravasi 2, 21100 Varese, Italy E-mail address: [email protected]