THE STRUCTURE OF MODEL SELECTION
CHRISTINE CHOIRAT AND RAFFAELLO SERI
Abstract. Most treatments of the model selection problem are either restricted to special situations (lag selection in AR, MA or ARMA models, regression selection, selection of a model out of a nested sequence) or to special
selection methods (selection through testing or penalization). Our aim is to
provide some basic tools for the analysis of model selection as a statistical decision problem, independently of the situation and of the method used. In order
to achieve this objective, we embed model selection in the theoretical decision
framework offered by modern Decision Theory. This allows us to obtain simple conditions under which pairwise comparison of models and penalization of
objective functions arise naturally from preferences defined on the collection
of statistical models under scrutiny. As a major application of our framework,
we derive necessary and sufficient conditions for an information criterion to
satisfy in the case of independent and identically distributed realizations in
order to deliver almost surely the “true” model out of a class of J models. “In
probability” versions of the previous results are also discussed. At last, some
results concerning optimality of model selection procedures are discussed.
Contents
1. Introduction
2. Preliminary Definitions
3. A Decision Theoretic Framework for Model Selection
3.1. Consistency and Conservativeness
3.2. Model Selection through Penalization
3.3. Model Selection through Pairwise Penalization
3.4. Model Selection through Testing
4. Necessary and Sufficient Conditions for Model Selection Through
Penalization
4.1. Strong Limit Theorems for m−estimation
4.2. Necessary and Sufficient Conditions for Conservativeness and
Consistency
5. Weak Forms of the Previous Concepts
6. Optimality of Model Selection Procedures
6.1. Risk Functions
6.2. Information Inequalities
7. Proofs
References
1
2
3
5
5
6
8
8
10
10
13
14
15
16
17
18
26
THE STRUCTURE OF MODEL SELECTION
2
1. Introduction
Most treatments of the model selection problem are either restricted to special
situations (lag selection in AR, MA or ARMA models, regression selection, selection
of a model out of a nested sequence) or to special selection methods (selection
through testing or penalization). It is however unclear if the results obtained in
these cases carry over to the general situation in which the models under scrutiny
are nonnecessarily nested and selection is performed using general criteria.
Our aim is to provide some basic tools for the analysis of model selection as a
statistical decision problem. We try answer to the following questions:
• How should we define the “best” model out of a class? Is the concept of
“best model” necessarily linked to nesting?
• What properties should we expect from a good model selection procedure?
• Is model selection by pairwise comparisons able to choose the overall best
model? If this is not true in general, under what conditions does this take
place?
• Is model selection through information criteria a general enough procedure?
• Is it possible to derive necessary and sufficient conditions for model selection
through information criteria? What is the link of model selection through
penalization and testing?
In order to achieve this objective, we embed the model selection procedure in the
theoretical decision framework offered by modern Decision Theory. First of all, in
Section 2, we analyze what a “true” or “best” model should be. In order to do so,
we introduce two preference relations on the collection of models under scrutiny;
these relations are defined
¡ ¢ in terms of their asymptotic goodness-of-fit (as measured
by a function Q∞,j θj∗ , that can be a likelihood, a generic objective function or
a measure of forecasting performance) and of their parsimony (as measured by
the number of parameters pj ). The first preference relation, written as ◮, is a
lexicographic order in which model j is preferred to model i if it has a higher value
of the objective function or if the two have the same value of the objective function
but j is more parsimonious than i. The second preference relation, written as ⊲,
consists in choosing models with higher objective functions. Remark that these
relations are defined only for the limit situation with n = +∞.
As described in Section 3, when passing from the asymptotic framework to the
finite sample situation, model selection becomes a problem of choice in a random
setting. However, with respect to the literature of Decision Theory under uncertainty, the distinctive feature of our approach is that we are interested in imposing
both finite sample and asymptotic requirements on the selection method. In particular, in Section 3.1, we define two properties of a model selection procedure, called
consistency and conservativeness and corresponding respectively to the preference
relations ◮ and ⊲; they have already appeared in the literature in special cases
(see e.g. [22], for the case of a nested sequence of models). Moreover, we derive
a condition that allows for reducing the asymptotic analysis of a model selection
procedure to the case in which only two models are in competition. As concerns the
situation in a finite sample, in Section 3.2 we show that model selection through
penalization of the objective function arises in a natural way from some properties
of a preference relation. Model selection through pairwise penalization and through
testing are briefly reviewed in Sections 3.3 and 3.4.
THE STRUCTURE OF MODEL SELECTION
3
Section 4 contains a major application of the previous Theorems: we derive
necessary and sufficient conditions that an information criterion has to satisfy in
the case of independent and identically distributed realizations in order to deliver
almost surely the “true” model out of a class of J models. It turns out that the
bounds on the model selection procedures arise as strong limit theorems (Laws of
Large Numbers and Laws of the Iterated Logarithm, for sums and V −statistics)
associated with the weak limit theorems (Central Limit Theorems) that constitute
the basis of Vuong’s ([28]) model selection procedures based on likelihood ratio
tests. In Section 5 we introduce the “in probability” versions of consistency and
conservativeness and we extend some of the previous results to this case. At last,
in Section 6, we outline an optimality result for model selection procedures.
2. Preliminary Definitions
A statistical model is a triplet of the form M = (Ω, A, P) where Ω is an abstract space, A a σ−algebra defined on Ω and P a family of probability measures
defined on (Ω, A). In our case, we will just consider parametric models, that is
P = {Pθ , θ ∈ Θ} where Θ is a subset of an Euclidean space. Remark that nothing
imposes that the family P contains the probability measure that has generated the
data: indeed, this fact is supposed to be very rare and the main objective is to
obtain a faithful but parsimonious representation of the data.
Since we are interested in model selection, we suppose to have a collection of
statistical models identified by an index j ∈ {1, . . . , J}; any model is given by
ª¢
¡
©
Mj = Ωj , Aj , Pj = Pθj , θj ∈ Θj ,
where Θj ⊂ Rpj and pj is the number of parameters of the j−th model.
A model selection procedure is a statistical decision (in the Wald sense) choosing
a model out of a set {1, . . . , J}. The main difference with testing and estimation
in discrete parameter sets is that the objective of model selection is to choose a
model that is both the best in terms of explicative power and in terms of parameter
parsimony. Therefore, we introduce a lexicographical order on the values of the
objective functions and the number of parameters: first of all, the ¡statistical
models
¢
(Mj ) with the highest value of the limit objective functions (Q∞,j θj∗ ), and among
these, the models with the lowest number of parameters (pj ). We formalize this
ordering through the relation ◮: we say that i ◮ j if
¡ ¢
¡ ¢
Q∞,i (θi∗ ) > Q∞,j θj∗ or Q∞,i (θi∗ ) = Q∞,j θj∗ and pi ≤ pj .
The relation ◮ is complete, symmetric, reflexive and transitive; it is therefore a
complete preorder and a weak tournament (see e.g. [2]). This implies, in particular,
that the restriction of this relation to any subset {1, . . . , J} is still a complete
preorder.
Remark that in this definition (and in the following ones), we make no explicit
reference to nesting, that is to the fact that one of two models can be obtained
from the other simply constraining the parameter space. Indeed, when comparing
two models on the base of an information criterion, nesting is not at stake, even if
often it is assumed. Indeed, we will show how it is possible to develop a theory of
model selection without almost any reference to nesting.
Remark that we do not suppose that the models Mj , j = 1, . . . , J form a nested
sequence (i.e. Mj ⊂ Mj+1 ). This simplifies somewhat the concepts and allows for
THE STRUCTURE OF MODEL SELECTION
4
relaxed conditions (i.e. J can be equal to +∞), but is quite restrictive (it arises in
the estimation of AR or MA models but fails to cover the case of ARMA models).
Probably the best way to embody the relation ◮ is a graph. In this case any
model is identified with a vertex of the graph, and the relation i ◮ j is represented
by a directed arrow from i to j. In order to simplify the graph, it is customary to
define two new relations, called I◮ and P◮ :
iI◮ j ⇔ i ◮ j and j ◮ i
iP◮ j ⇔ i ◮ j and not j ◮ i,
where the meaning of I◮ and of P◮ is respectively indifference and strict preference
(and they are sometimes called respectively the symmetric and the asymmetric
part of ◮). Moreover, in order to avoid complex diagrams, often reflexivity is not
represented. Since I◮ is an equivalence relation, {1, . . . , J} can be partitioned
in a unique way into subsets such that the elements of a subset are indifferent
with respect to the relation I◮ : this partition is denoted {1, . . . , J} /I◮ , that is
the quotient set of {1, . . . , J} with respect to I◮ .1 The relation ◮ defined on
{1, . . . , J} /I◮ is also anti-symmetric and is therefore a linear order.
The true model J ∗ is defined as the set of majorants of the relation ◮ in {1, . . . , J}
and is therefore the element of {1, . . . , J} /I◮ that dominates every other element.
As long as J is finite, this set is always nonempty (the extension to infinite choice
sets can rise some problems). Therefore, more formally:
¡ ¢
J ∗∗ = arg max Q∞,j θj∗ ,
j∈{1,...,J}
∗
J = arg min∗∗ pj .
j∈J
Nothing prevents J ∗ and J ∗∗ from being sets composed of more than one element,
since the majorant of a set needs not be unique.
Another important concept often arising in model selection is linked to the idea
that the choosen model can be as good as another one in terms of explicative power,
even if it is not parsimonious in terms of parameters. It seems therefore interesting
to introduce a new relation written as i ⊲ j if
¡ ¢
Q∞,i (θi∗ ) ≥ Q∞,j θj∗ .
As before, we can define I⊲ , P⊲ and {1, . . . , J} /I⊲ : the last is a partition of
{1, . . . , J} that can be ordered by the value of Q∞,i (θi∗ ).
Example 2.1. Consider an iid sample {Y1 , . . . , Yn } from a standard normal distribution. Consider the two statistical models:
µ
¶
y+1
q1 (y; θ1 ) = ln φ
− ln σ
σ ∈ (0, +∞)
σ
¶
µ
y−µ
− ln σ
σ ∈ (0, +∞) , µ ∈ [1, ∞) .
q2 (y; θ2 ) = ln φ
σ
1Given a set A and an equivalence relation R, the quotient space A/R is the set of equivalence
classes induced by R on A.
THE STRUCTURE OF MODEL SELECTION
5
The limit objective functions are:
1
2
Q∞,1 (θ1 ) = − ln 2π − ln σ − ,
2
σ
1 + µ2
1
;
Q∞,2 (θ2 ) = − ln 2π − ln σ −
2
σ
the pseudo-true values and the objective functions calculated in these points are:
√
θ1∗ = σ ∗ = 2
Q∞,1 (θ1∗ ) = − ln 2e√2π
θ2∗ = (µ∗ , σ ∗ ) = (1, 2) Q∞,2 (θ2∗ ) = − ln 2e 2π
Therefore, Q∞,1 (θ1∗ ) = Q∞,2 (θ2∗ ) and p1 < p2 . We say that 1P◮ 2 (that is 1 ◮ 2
and 2 6◮ 1) and 1I⊲ 2 (that is 1 ⊲ 2 and 2 ⊲ 1).
3. A Decision Theoretic Framework for Model Selection
The
is that, when passing from theory to applications, the values
¢
¡ problem
Q∞,j θj∗ are not known and have to be approximated using the observed val³ ´
ues Qn,j θ̂j . Since these values are random, we have to introduce an estimator
ˆ we suppose that Jˆ and limn→∞ Jˆ are single-valued, since ties can
of J ∗ , say J:
always be broken.2 In Wald’s statistical decision framework, Jˆ is called a strategy
or a decision rule.
ˆ that is consistency
In the following, we are going to explore some properties of J,
and conservativeness. In principle, it would be possible to define two different kinds
of consistency and conservativeness, that is standard (with reference to a particular
choice set {1, . . . , J}) and uniform (over any choice set {1, . . . , J}). However, this
would create some problems since Jˆ could be conservative, consistent or even not
conservative according to the choice set.3 Therefore, the following properties have
to be intended over any possible choice of {1, . . . , J} and not with reference to a
particular choice set. Therefore, we write JˆA to indicate that the model selection
procedures works on the choice set A, and we use the shortcut JˆJ = Jˆ{1,...,J} .
3.1. Consistency and Conservativeness. Now, we introduce the definition of
consistent and conservative model selection procedure.
Definition 3.1. We say that Jˆ is:
o
n
• strongly conservative if, for any choice set A, P limn→∞ JˆA ∈ JA∗∗ = 1;
n
o
• strongly consistent if, for any choice set A, P limn→∞ JˆA ∈ JA∗ = 1.
The following Proposition gives some conditions that allow for simplifying the
analysis of the model selection problem.
Proposition 3.2. Consider a group of competing models {1, . . . , J} and take a
couple {i, j} out of the set {1, . . . , J}. Consider the following Assumptions:
o
n
A: For any couple {i, j} such that iP⊲ j, P limn→∞ Jˆ{i,j} = i = 1.
2This means that if several models have exactly the same value of the objective function, we
can choose one of them according to some deterministic (ordering according to the index) or
random (sampling one of the models) rule.
3Indeed, if the models in {1, . . . , J} have different values of the limit objective function
³ ´
³ ´
Q∞,j θ∗ , Jˆ = arg maxj∈{1,...,J} Qn,j θ̂j is consistent, while in general it is just conservative.
j
THE STRUCTURE OF MODEL SELECTION
6
n
o
B: For any couple {i, j} such that iP◮ j, P limn→∞ Jˆ{i,j} = i = 1.
C: For any two choice sets A ⊂ B and any i ∈ A, we have
n ¯
o n ¯
o
¯
¯
ω ¯ lim JˆB (ω) = i ⊂ ω ¯ lim JˆA (ω) = i .
n→∞
n→∞
D: Jˆ is strongly conservative.
E: Jˆ is strongly consistent.
Then:
D =⇒ A,
A & C =⇒ D,
E =⇒ B,
B & C =⇒ E,
that is, for a model selection procedure respecting C, strong consistency (resp. conservativeness) over any choice set is equivalent to strong consistency (resp. conservativeness) over couples.
Remark 3.3. (i) C is a requirement concerning the absence of rank reversal ([13],
Definition 1, p. 45, speaks of rationality): the introduction of new alternatives does
not enlarge the set of ω’s leading to the choice of i. Even if it does not seem to
be a necessary and sufficient condition for equivalence of B and E (resp. A and
D), it seems a minimal requirement of robustness on a model selection procedure.
Moreover, the role of this hypothesis will be made clear in Section 3.2.
(ii) The Theorem allows for decomposing the process of comparison of several
alternatives in a series of pairwise comparisons. It is clear that A and B are much
simpler than the original problem and therefore allows for analyzing more complex
decision procedures.
Remark that conservativeness puts some requirements only on models with different values of the limit objective function, while consistency requires also parsimony
in parameters. According to common sense, it seems
³ ´that it would be possible to
choose a good model comparing the values Qn,j θ̂j for any j ∈ {1, . . . , J} and
picking up the model with the greatest value of the objective function. However,
the model selection procedure defined by:
³ ´
JˆJ (ω) = arg max Qn,j θ̂j (ω) ,
j∈{1,...,J}
is just conservative
(strongly or weakly according to the type of consistency of θ̂j
³ ´
4
and Qn,j θ̂j ). This points at a fundamental deficiency of this method: indeed, if
the two models are equivalent in terms of explicative power (that is M2 ⊂ M1 and
Q∞,1 (θ1∗ ) = Q∞,2 (θ2∗ )), the larger model will often be choosen for finite n because
it is the result of maximization over a larger parameter space.
3.2. Model Selection through Penalization. The following Proposition (see
[13], Proposition 3, p. 46) shows that hypothesis C is a necessary and sufficient
condition for the existence of a utility function.
Proposition 3.4. Consider a group of competing models A = {1, . . . , J}. Then,
Assumption C in Theorem 3.2 holds if and only if there exists a function U : A×Ω →
R, (j, ω) 7→ Uj (ω) such that JˆJ (ω) = arg maxj∈{1,...,J} Uj (ω).
4Indeed, it is consistent in a choice set {1, . . . , J} if there is no couple of indexes 1 and 2 such
¡ ¢
¡ ¢
that M2 ⊂ M1 and Q∞,1 θ1∗ = Q∞,2 θ2∗ , since in any other case it will pick the right answer.
THE STRUCTURE OF MODEL SELECTION
7
This Proposition means that model selection performed through a utility function
is equivalent to rational model selection, and that if model selection through a
utility function is consistent (conservative) over couples, than it is overall consistent
(conservative). Remark that this situation is paradoxical since when n = +∞, ◮ is
a lexicographical ordering and cannot be represented by a utility function, while for
finite n it can be, and we would like to find conditions such that the choice arising
from the optimization of a random utility function converges almost surely to the
choice arising from the lexicographical ordering.
In the following we show that, under quite general hypotheses, our comprehension of the form of U can be considerably increased. In order to do so, we suppose
that the choice is based only the value of the objective function Qn,j (θj ) and on
the number of parameters pj , and we define a preorder º defined on the couples of
the form (Q¡n,j (θj )¡, pj¢). Remak
that this mimicks ◮ that is defined on the couples
¢
of the form Q∞,j θj∗ , pj . We will need the following property of a preorder: for
a preorder º defined on the space X = X1 × X2 with generic element (x1 , x2 ), we
say that the two components of X are independent if, for any fixed value x1 ∈ X1 ,
the preorder induced by º on X2 is independent of the value x1 and viceversa with
1 replaced by 2.
Proposition 3.5. Let n be fixed. Suppose that the following properties hold:
(1) The preference relation º is a complete preorder defined on the couples
of the form xj = (Qn,j (θj ) , pj ) ∈ R2 and can be extended (still being a
complete preorder) to a connected set of the form Q × P.
(2) The sets {x ∈ Q × P : x º x′ } and {x ∈ Q × P : x ¹ x′ } are closed for every x′ ∈ Q × P.
(3) The factors Qn,j (θj ) and pj are independent and the preference relation is
strictly increasing in Qn,j (θj ) (for fixed pj ) and strictly decreasing in pj
(for fixed Qn,j (θj )).
Then, there exists (up to an increasing linear transformation) a representation of
U as:
Uj (ω) = U 1 (Qn,j (θj ) , n) + U 2 (pj , n)
with U 1 continuous and strictly increasing and U 2 continuous and strictly decreasing.
Remark 3.6. (i) Hypothesis 1 requires that the relation º can be extended to the
product of two connected sets; this is usually quite natural for the part involving
Qn,j , while it can sound strange for pj since these values are integers. Hypothesis 2
is customary in the theory of utility and involves a sort of continuity of preferences
(see e.g. [20], pp. 17-18); it is not satisfied by lexicographic orders (see e.g. [20], p.
18). On the other hand, hypothesis 3 is completely natural in this context.
(ii) Remark that the resulting utility function depends on ω only through Qn,j (θj ).
This shows that separability of goodness-of-fit and of parsimony is a rather natural property. The strategy that has been most studied in the literature and that
we examine in the following is to build a utility function by penalizing the objective
function with a random variable depending on n, on the pj ’s and, in some cases,
even on the data: this is usually called in the literature a model selection criterion.
For any model j ∈ {1, . . . , J}, we define the penalized objective functions Qn,j
THE STRUCTURE OF MODEL SELECTION
8
defined as:
(3.1)
Qn,j (θj ) = Qn,j (θj ) − c (n, pj , Y) .
We suppose that Qn,j (θj ) depends just on pj : some methods used in model selection
allow the penalization to depend on some features of the other competing models,
but this seems to be quite marginal. Remark that Theorem 3.2 authorizes to
base the choice of a model on pairwise comparisons and that, therefore, we could
more generally consider an information criterion having the same form of the ones
considered in [25] (eq. (2.6)). We will discuss how the limitations we impose can
be circumvented in the following.
The Proposition implies that no information criterion can give rise to nonrational
behavior. Therefore, when model selection is performed through an information
criterion, the following equivalences in Proposition 3.2 hold true:
D ⇔ A,
E ⇔ B.
This means that the failure to identify the correct model has to be imputed to failure
over couples of models and that it is possible to obtain necessary and sufficient
conditions for conservativeness and consistency of model selection procedures in
any choice set simply considering choice sets composed of two models. As we will
see below, the Akaike Information Criterion fails to deliver the right model over
arbitrary choice sets because it fails to identify the best one over an arbitrary pair.
3.3. Model Selection through Pairwise Penalization. Looking at the previous Section, it is clear that the penalizations appear in pairwise comparisons only
through their pairwise differences such as [c (n, pj , Y) − c (n, pi , Y)]. This means
that we can study also the case in which no information criterion of the form
Qn,j (θj ) = Qn,j (θj ) − c (n, pj , Y) exists but in which comparisons are carried
through using pairwise information criteria of the form:
(3.2)
Qn,(i,j) (θi , θj ) = Qn,i (θi ) − Qn,j (θj ) + c (n, pj , pi , Y) .
In this case, we would choose i over j if Qn,(i,j) (θi , θj ) > 0. This is the situation
considered e.g. in [25]. It is reasonable to suppose that
(3.3)
Qn,(i,j) (θi , θj ) = −Qn,(j,i) (θj , θi ) ,
since in this case comparison of i and j is carried through just once. (3.3) is equivalent to skew-symmetry of the penalization c, i.e. c (n, pj , pi , Y) = −c (n, pi , pj , Y).
The main problem with this approach is to show that hypothesis C holds. Indeed,
as concerns the fulfillment of hypotheses A and B, it is simple to derive conditions for
the Theorems of Section 4 to hold in the same form with [c (n, pj , Y) − c (n, pi , Y)]
substituted with both c (n, pj , pi , Y) and −c (n, pi , pj , Y). To verify at what extent
this is indeed the case, we introduce some concepts from Decision Theory.
For any fixed value ω ∈ Ω, Qn,(i,j) corresponds to what is called a comparison function ([2]) on {1, . . . , J}, i.e. a skew-symmetric real-valued function on
2
{1, . . . , J} . It is well recognized that the most difficult problem with
3.4. Model Selection through Testing. An alternative strategy that has received a large attention in the literature is model selection through testing.
Cox ([6, 7]) proposes to use pairwise comparisons through likelihood ratio (LR)
tests for model selection: the idea is to perform an LR test between two alternative
models under the null hypothesis that one of them is correctly specified. This yields
THE STRUCTURE OF MODEL SELECTION
9
a test that depends on which model is retained to act as the null one. If, as often
done, two different tests are performed assuming the two alternative models as
null ones in either case, the result is highly informative about the adequacy of the
models to the data (both models can be accepted, both rejected or one accepted
and one rejected) but often cumbersome to interpret.
In a very important paper, Vuong ([28]) (see also [23], for the dynamic case)
modifies Cox’s idea: the strategy proposed by Vuong is based on a shift in the
point of view. When LR tests are performed, the null hypothesis is that both
models are equally close to the “best” one. The procedure is based on two test
statistics, the likelihood ratio statistic and the variance statistic. Suppose to test
that two densities f and g with respective
parameters θbf and θbg are
³
´ estimated
¡
¢
indeed the same; let θf∗ and θg∗ . If f ·; θf∗ = g ·; θg∗ , then the likelihood ratio
statistic has the following asymptotic behavior:
³
´
f yi ; θbf D
´ → WC (λ∗ )
2·
ln ³
b
g
y
;
θ
i=1
i g
n
X
(3.4)
where WC (λ) is a weighted sum of chi-squared with weights given by λ, and λ∗ are
the eigenvalues
(this matrix will be described below). On the other
³
´of a matrix
¡ ∗¢
∗
hand, if f ·; θf 6= g ·; θg , then:
(3.5)
√
³
´
³
´
n
f Y ; θf∗
f yi ; θbf
X
¡
¢
1
D
´ − E0 ln ¡
¢→
n·
ln ³
N 0, ω∗2 .
∗
n i=1 g y ; θb
g Y ; θg
i g
¶
µ
³
´
¡
¢
f (Y ;θf∗ )
= 0 if and only if f ·; θf∗ = g ·; θg∗ . This
Moreover,
= V0 ln g Y ;θ∗
( g)
situation is reminiscent of the so-called von Mises expansion of a statistic and of
the behavior of nondegenerate and degenerate U − and V −statistics. Indeed the
asymptotic distribution is determined either by the linear part of the statistic (thus
yielding a normal distribution) or by the quadratic part (thus yielding a weighted
sum of chi-squares) according to the centering. As concerns the variance
statistic,
i
h
2
2 D
∗ 2
Vuong proposes two estimators of ω∗ and shows that n · ω
b∗ → WC (λ ) where
ω∗2
2
(λ∗ ) is the square of λ∗ .
Then, three situations can arise:
• f anf g are strictly non-nested (i.e. there exists no value
µ of the∗ parameters
¶
f (Y ;θ )
such that f = g): under the null hypothesis that E0 ln g Y ;θf∗
= 0, we
( g)
¶
µ
Pn
√
f (Y ;θ ∗ )
f (yi ;θbf )
as
→
build a test based on (3.5); remark that n· n1 i=1 ln g y ;θb − E0 ln g Y ;θf∗
( g)
( i g)
µ
¶
f (Y ;θf∗ )
+∞ if E0 ln g Y ;θ∗
> 0, and viceversa.
( g)
THE STRUCTURE OF MODEL SELECTION
10
• f anf g are nested (i.e. one ofµthe models¶ is contained in the other): under
f (Y ;θ ∗ )
the null hypothesis that E0 ln g Y ;θ∗f
= 0, we build a test based on
( g)
(3.4).5
• f anf g are overlapping (i.e. there exists some value of the parameters such
that f = g, but the models are not nested): in this³case,´ Vuong proposes
¡
¢
to use first the variance statistic to test whether f ·; θf∗ = g ·; θg∗ , and
then to use the appropriate asymptotic distribution to perform a likelihood
ratio test.
Unfortunately these methods do not allow in general for asymptotic selection of a
“best” model, since when testing is performed fixing the critical size, the probability
of choosing the “best” model is bounded away from one. This can be overcome using
a procedure such as the one in [29], in which the critical size of the tests goes to zero
as long as the sample size goes to infinity. Apart from the fact that the properties
stated by [29] are weak ones (that is they do not hold almost surely, but only in
probability), in this case, an “in probability” analogous of property B is enforced
to hold; if C is guaranteed, then we can expect that a weak concept of consistency
holds. The appropriate tools to deal with this case will be briefly introduced in
Section 5.
4. Necessary and Sufficient Conditions for Model Selection
Through Penalization
The objective of this Section is to obtain conditions for conservativeness and consistency of model selection procedures in this simplified situation. First of all, we
introduce some hypotheses and derive some strong limit results for m−estimators.
Then we show how these results can be used in order to obtain necessary and sufficient conditions for consistency and conservativeness of model selection procedures.
The conditions are very similar to the ones in [21] and [25], since both theirs and
ours establish a link between model selection and LLNs and LILs. However, while
[25] deals with heterogeneous and dependent data thus covering a larger scope of
cases, our objective is radically different since we aim at proving that strong laws
for m−estimators act as bounds for model selection criteria.
4.1. Strong Limit Theorems for m−estimation. We have n independent and
identically distributed (iid) observations from a probability measure P0 : the hypothesis of iid observations is just made to simplify things, but could be relaxed.
Indeed, this hypothesis is needed just to derive the exact form of the bounds appearing in the Laws of the Iterated Logarithm and to derive necessary and sufficient
conditions, but if only sufficient conditions are needed, weaker hypotheses can be
used as well.
We suppose that estimation on the parameters θj is performed by maximization
of an objective function taking the following form:
n
Qn,j (θj ) =
1X
qj (Yi (ω) ; θj ) .
n i=1
5It is also possible to use the variance statistic to test this hypothesis but this requires more
stringent hypotheses.
THE STRUCTURE OF MODEL SELECTION
11
This kind of inference is called m−estimation, that is maximum-likelihood type estimation, since Qj (θj ) mimics the usual log-likelihood function. The m−estimator
of θj is therefore:
θ̂j = arg max Qn,j (θj ) .
θj ∈Θj
In the following, we will give some conditions ensuring that the objective function
Qn,j (θj ) converges almost surely to a fixed Q∞,j (θj ) given by
Q∞,j (θj ) = E0 [qj (Y ; θj )]
where E0 is the expectation taken with respect to P0 , and θ̂j converges almost
surely to a value called pseudo-true value and defined by:
θj∗ = arg max Q∞,j (θj ) .
θj ∈Θj
A1: θj∗ exists for any j.
A2: The derivatives
∂qj (y;θj )
∂θj
and
∂ 2 qj (y;θj )
∂θj ∂θjT
are measurable with respect to y
for any θj and continuous with respect to θj . Moreover,
if we
¯
¯ ¯denote as¯
2 ¯ ∂qj (y;θj ) ¯ ¯ ∂ 2 qj (y;θj ) ¯
θj,k the k−th element of θj , |qj (x; θj )|, |qj (x; θj )| , ¯ ∂θj,k ¯, ¯ ∂θj,k ∂θj,h ¯
¯
¯
¯ ∂qj (y;θj ) ∂qj (y;θj ) ¯
and ¯ ∂θ
¯ are dominated by functions not dependent on θj and
∂θ
j,k
j,h
integrable with respect to P0 .
A3: Define the matrices
"
#
∂ 2 qj (Y ; θj )
Aj (θj ) = E0
,
∂θj ∂θjT
#
"
∂qj (Y ; θj ) ∂qj (Y ; θj )
,
Bj (θj ) = E0
∂θj
∂θjT
"
#
∂qi (Y ; θi ) ∂qj (Y ; θj )
Bij (θi , θj ) = E0
.
∂θi
∂θjT
¡ ¢
¡ ¢
Then, −Aj θj∗ and Bj θj∗ are finite and positive definite.
A4: The estimator θ̂j converges almost surely to θj∗ .
¯
¯
¯ ∂qj (y;θj ) ¯2
∂ 3 qj (y;θj )
is
measurable
with
respect
to
y
for
each
θ.
A5: ∂θj,k ∂θ
¯
¯ ,
∂θ
∂θ
j,h
j,ℓ
j,k
¯ 2
¯ 3
¯2
¯
¯ ∂ qj (y;θj ) ¯
¯ ∂ qj (y;θj ) ¯
¯ ∂θj,k ∂θj,h ¯ and ¯ ∂θj,k ∂θ
¯ are dominated by P0 −integrable functions,
j,h ∂θj,ℓ
which do not depend on θ.
Then we can state the following Theorem.
Theorem 4.1. Suppose that Assumptions A1-A5 hold. Then
³ ´
¡ ¢
lim Qn,j θ̂j = Q∞,j θj∗
n→∞
and the following facts hold.
THE STRUCTURE OF MODEL SELECTION
12
(1) We have:
n
h ¡ ¢
¡ ¢
¡ ¢−1 io1/2
θ̂j − θj∗
−1
lim sup q
= + diag Aj θj∗
· Bj θj∗ · Aj θj∗
,
n→∞
2 ln ln n
n
n
h ¡ ¢
¡ ¢
¡ ¢−1 io1/2
θ̂j − θj∗
−1
= − diag Aj θj∗
· Bj θj∗ · Aj θj∗
.
lim inf q
n→∞
2 ln ln n
n
(2) We have:
nh
³ ´
³ ´
io
¡ ¢i h
n · Qn,j θ̂j − Qn,j θj∗ − Qn,i θ̂i − Qn,i (θi∗ )
lim sup
= λji
ln ln n
n→∞
P − as.
where λji is the maximal eigenvalue of the matrix
"
¢ #
¡
¡ ¢−1
¡ ¢
¡ ¢−1
Bji θj∗ , θi∗
−Aj θj∗
Bj θj∗
−Aj θj∗
¢
¡
.
−1
−1
Ai (θi∗ ) Bi (θi∗ )
Ai (θi∗ ) Bij θi∗ , θj∗
(3) We have:
³ ´
¡ ¢
n £ ¡
Qn,j θ̂j − Q∞,j θj∗
¢
¡ ¢¤2 o1/2
q
= + E0 qj Yi ; θj∗ − Q∞,j θj∗
,
lim sup
n→∞
2 ln ln n
n
³ ´
¡ ¢
n £ ¡
Qn,j θ̂j − Q∞,j θj∗
¢
¡ ¢¤2 o1/2
q
lim inf
= − E0 qj Yi ; θj∗ − Q∞,j θj∗
.
n→∞
2 ln ln n
n
Remark 4.2. The eigenvalues of the matrix appearing in the second part of the
Theorem are the same as the matrix of Theorem 3.3 (i) in [28]. Indeed, let us set
(as in [28], respectively on pages 313, 327 and 328):
¡ ¢
¸
·
0
−Aj θj∗
,
Q=
0
Ai (θi∗ )
¡ ¢
¸
·
0
Aj θj∗
,
A=
0
Ai (θi∗ )
¡
¢
B = B θj∗ , θi∗ .
Then, our matrix is Q−1 B, while Vuong’s matrix is W = QA−1 BA−1 . However:
³
´
¡
¢
Sp QA−1 BA−1 = Sp B 1/2 A−1 QA−1 B 1/2
³
´
¡
¢
= Sp B 1/2 Q−1 B 1/2 = Sp Q−1 B ,
where Sp (·) is the spectrum of a matrix and we have used the fact that A−1 QA−1 =
Q−1 . This suggests that model selection through penalization is consistent if the
penalization term is bounded by the Law of the Iterated Logarithm associated with
the asymptotic distributions of the test statistics proposed in [28].
THE STRUCTURE OF MODEL SELECTION
13
4.2. Necessary and Sufficient Conditions for Conservativeness and Consistency. Now, we state some conditions for conservativeness and consistency of a
model selection criterion.
Theorem 4.3. Let Assumptions A1-A5 hold. The model selection procedure defined
by the maximization of the penalized objective function (3.1):
³ ´
JˆJ (ω) = arg max Qn,j θ̂j (ω) ,
j∈{1,...,J}
respects Assumption C of Proposition 3.2. Then, the following necessary and sufficient conditions hold:
• strong conservativeness: For any couple {i, j} such that iP⊲ j, c (n, p, Y)
is increasing in p and:
¡ ¢
Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] .
n→∞
• strong consistency: For any couple {i, j} such that iP◮ j, c (n, p, Y) is
increasing in p; moreover, if iP⊲ j:
¡ ¢
Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] ,
n→∞
and if iP◮ j but iI⊲ j:
n · [c (n, pj , Y) − c (n, pi , Y)]
> λji .
ln ln n
Remark 4.4. (i) Consider
in which model i and j are both correctly
¡ ¢
¡ ¢ the situation
specified. Then Aj θj∗ = −Bj θj∗ and Ai (θi∗ ) = −Bi (θi∗ ). If
¡
¢
¡ ¢−1
¡
¢
Bi (θi∗ ) = Bij θi∗ , θj∗ Bj θj∗
Bji θj∗ , θi∗ ,
lim inf
n→∞
then we have λji = 1. The condition is the same as in Corollary 3.4 of [28].
p ·ln n
(ii) For BIC, we have c (n, pj , Y) = j2n . Then, strong conservativeness holds
since for iP⊲ j, c (n, p, Y) is increasing in p and:
¡ ¢
ln n
Q∞,i (θi∗ ) − Q∞,j θj∗ > (pi − pj ) · lim sup
= 0.
n→∞ 2n
As concerns strong consistency, the only new condition concerns the case in which
iP◮ j but iI⊲ j:
ln n
∞ = (pj − pi ) · lim inf
> λji .
n→∞ 2 ln ln n
Therefore both conditions are verified automatically.
p ·c·ln ln n
(iii) HQIC offers a different example. For HQIC, we have c (n, pj , Y) = j n
where c will be determined in the following. Then, strong conservativeness holds
since for iP⊲ j, c (n, p, Y) is increasing in p and:
¡ ¢
ln ln n
= 0.
Q∞,i (θi∗ ) − Q∞,j θj∗ > (pi − pj ) · c · lim sup
n
n→∞
As concerns strong consistency, we have just to verify that if iP◮ j but iI⊲ j:
ln n
∞ = (pj − pi ) · lim inf
> λji .
n→∞ 2 ln ln n
In order to hold for pj − pi = 1, we need c > λji . Under correct specification,
usually c > 1 is enough (Hannan and Quinn require c > 2 since they use the likelihood ratio). Moreover, HQIC is on the border of consistency. It is interesting
THE STRUCTURE OF MODEL SELECTION
14
to make a comparison with Proposition 5.2 in [25]: the authors need a different
condition involving all the eigenvalues of the matrix Q−1 B while we need just
the greatest. This seems to be due to the fact that in the simple case of independent and identically distributed observations, it is possible to use a LIL for
quadratic forms, while in the more general case of [25] this result is unavailable.
This
is really needed
³ ´ is a LIL foriothe second order component of
³ ´that what
nh shows
¡ ∗ ¢i h
and its availability would alQn,j θ̂j − Qn,j θj − Qn,i θ̂i − Qn,i (θi∗ )
low for applying the result also in heterogeneous and dependent cases.
p
(iv) A last example concerns AIC, for which c (n, pj , Y) = nj . Strong conservativeness holds since, if iP⊲ j, c (n, p, Y) is increasing in p and:
¡ ¢
pi − p j
Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup
= 0.
n
n→∞
Stong consistency fails to hold, since if iP◮ j but iI⊲ j:
(pj − pi )
> λji .
n→∞
ln ln n
Therefore, AIC is conservative but not consistent because it fails to respect the
LIL.
0 = lim inf
5. Weak Forms of the Previous Concepts
The previous concepts are strong or almost sure ones since they need convergence
over almost all trajectories. However, also the corresponding weak or in probability
concepts are of some interest. Therefore, we show how the previous framework has
to be modified to deal with weak conservativeness and consistency.
Definition 5.1. We say that Jˆ is:
n
o
• weakly conservative if, for any choice set A, limn→∞ P JˆA ∈ JA∗∗ = 1;
n
o
• weakly consistent if, for any choice set A, limn→∞ P JˆA ∈ J ∗ = 1.
A
It is clear that a strongly consistent Jˆ is also weakly consistent and strongly
conservative, a strongly conservative Jˆ is also weakly conservative, and a weakly
consistent Jˆ is also weakly conservative.
Proposition 5.2. Consider a group of competing models {1, . . . , J} and take a
couple out of the set {1, . . . , J}. Consider the following Assumptions:
n
o
A’: For any couple {i, j} such that iP⊲ j, limn→∞ P Jˆ{i,j} = i = 1.
o
n
B’: For any couple {i, j} such that iP◮ j, limn→∞ P Jˆ{i,j} = i = 1.
C’: For any two choice sets A ⊂ B and any i ∈ A, we have
o n ¯
o
n ¯
¯
¯
ω ¯JˆB (ω) = i ⊂ ω ¯JˆA (ω) = i .
D’: Jˆ is weakly conservative.
E’: Jˆ is weakly consistent.
Then:
D′ =⇒ A′ ,
E′ =⇒ B′ ,
A′ & C′ =⇒ D′ ,
B′ & C′ =⇒ E′ .
We make some hypotheses in order to obtain the results.
THE STRUCTURE OF MODEL SELECTION
15
A1: θj∗ exists for any j.
A2: The derivatives
∂qj (y;θj )
∂θj
and
∂ 2 qj (y;θj )
∂θj ∂θjT
are measurable with respect to y
for any θj and continuous with respect to θj . Moreover,
if we
¯ ¯denote as¯
¯
2 ¯ ∂qj (y;θj ) ¯ ¯ ∂ 2 qj (y;θj ) ¯
θj,k the k−th element of θj , |qj (x; θj )|, |qj (x; θj )| , ¯ ∂θj,k ¯, ¯ ∂θj,k ∂θj,h ¯
¯
¯
¯ ∂qj (y;θj ) ∂qj (y;θj ) ¯
and ¯ ∂θ
¯ are dominated by functions not dependent on θj and
∂θ
j,k
j,h
integrable with respect to P0 .
A3: Define the matrices
"
#
∂ 2 qj (Y ; θj )
Aj (θj ) = E0
,
∂θj ∂θjT
#
"
∂qj (Y ; θj ) ∂qj (Y ; θj )
,
Bj (θj ) = E0
∂θj
∂θjT
"
#
∂qi (Y ; θi ) ∂qj (Y ; θj )
Bij (θi , θj ) = E0
.
∂θi
∂θjT
¡ ¢
¡ ¢
−Aj θj∗ and Bj θj∗ are finite and positive definite.
A4: The estimator θ̂j converges in probability to θj∗ .
¯
¯
¯ ∂qj (y;θj ) ¯2
∂ 3 qj (y;θj )
is
measurable
with
respect
to
y
for
each
θ.
A5: ∂θj,k ∂θ
¯
∂θj,k ¯ ,
j,h ∂θj,ℓ
¯ 2
¯ 3
¯2
¯
¯ ∂ qj (y;θj ) ¯
¯ ∂ qj (y;θj ) ¯
¯ ∂θj,k ∂θj,h ¯ and ¯ ∂θj,k ∂θ
¯ are dominated by P0 −integrable functions,
j,h ∂θj,ℓ
which do not depend on θ.
Theorem 5.3. Let Assumptions A1-A5 hold. The model selection procedure defined
by the maximization of the penalized objective function (3.1):
³ ´
JˆJ (ω) = arg max Qn,j θ̂j (ω) ,
j∈{1,...,J}
respects Assumption C′ of Proposition 5.2. Then, the following necessary and sufficient conditions hold:
• weak conservativeness: For any couple {i, j} such that iP⊲ j, c (n, p, Y)
is increasing in p:
¡ ¢¤
√ £
lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞.
n
• weak consistency: For any couple {i, j} such that iP◮ j, c (n, p, Y) is
increasing in p; moreover, if iP⊲ j:
¡ ¢¤
√ £
lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞,
n
and if iP◮ j but iI⊲ j:
√
lim inf n · [c (n, pj , Y) − c (n, pi , Y)] = ∞.
n→∞
6. Optimality of Model Selection Procedures
In this Section we study the optimality properties of model selection procedures.
THE STRUCTURE OF MODEL SELECTION
16
6.1. Risk Functions. An interesting problem concerns the choice of a measure
of efficiency in model selection: in a seminal paper about estimation in discrete
parameter models, Hammersley ([16]) derives a generalization of Cramér-Rao inequality for the variance that is also valid when the parameter space is countable.
The same inequality has been derived, in slightly more generality, by [4] and by [1].
Therefore, we will consider the following cost and risk functions:
³
´ ³
´2
˜ J0 = J˜ − J0 ,
C1 J,
³
´
³
´
³ ´
˜ J0 = MSE J˜ .
R1 J˜n , J0 , E0 C1 J,
However, this choice is well-suited only in cases in which the variance or the MSE
are good measures of risk. This is indeed the case if the limiting distribution of
the normalized estimator is normal. Following the discussion by Lindley in [16], we
consider also a different cost function C2 (J, J0 ):
³
´
˜ J0 = 1 ˜
C2 J,
{J6=J0 } ;
the risk function is therefore given by the probability of missclassification:
³
³
³
´
´
´
˜ J0 , E0 C2 J,
˜ J0 = P0 J˜ 6= J0 .
R2 J,
The previous cost function has the drawback of weighting in the same way points
of the parameter space that lies at different distances with respect to the true value
J0 . In many cases, a more general loss function can be considered, as suggested in
[14] (Vol. 1, p. 51) for multiple tests:
³
´ ½
0
if J˜ = J0
˜
C3 J, J0 =
aj (J0 ) if J˜n = j
where aj (J0 ) > 0 for j = 1, ..., J. The risk function is therefore given by the
weighted probability of missclassification:
J
³
´
³
´
X
˜ J0 , E0 C3 J,
˜ J0 = E0
R3 J,
aj (J0 ) · 1{J=j
˜ }
j=1
=
J
X
j=1
o
n
aj (J0 ) · P0 J˜ = j .
The aj (J0 )’s can be tuned in order to give more or less weight to different points
of the parameter space.
At last, we define the Bayes risk (under the zero-one loss function) associated
with a prior distribution π on the parameter
´ Θ. In particular, we consider
³ space
˜
the Bayes risk under the risk function R2 J, J0 as:
J
J
³
´ X
³
´ X
³
´
˜π ,
˜j =
r2 J,
π (j) · R2 J,
π (j) · Pθj J˜ 6= j .
j=0
j=0
³
´
−1
˜ π as the average probability of error.
If π (j) = (J + 1) we define Pe , r2 J,
Remark that this is indeed the measure of error used by Vajda ([26, 27]).
THE STRUCTURE OF MODEL SELECTION
17
6.2. Information Inequalities. This Section contains lower bounds for the
³ pre´
˜ J0
viously introduced risk functions and in particular for the risk function R2 J,
corresponding to the zero-one loss. In our specific case, these generalize and unify
the lower bounds proposed by [16, 4, 18, 15]. Lower bounds for more general cost
functions can be obtained using, for example, Markov inequality. Indeed, if wn is
a strictly positive Borel function increasing on R∗+ , then:
°´i
h ³°
˜ − J0 °
°
³°
´ E0 wn °
J
°
°
°
°
P0 °J˜ − J0 ° ≥ kn ≤
,
wn (kn )
¶¸
· µ
˜ 0k
kJ−J
E
w
0
n
°
³°
´
kn
°
°
P0 °J˜ − J0 ° ≥ kn ≤
.
wn (1)
First of all, a lower bound is proved and then, we obtain a minimax version
of the same result. We will sometimes refer to the first one as Chapman-Robbins
lower bound (and to the related efficiency concept as Chapman-Robbins efficiency)
since it recalls the lower bound proposed by these two authors in their 1951 paper.
Then, from these results, we derive lower bounds for the MSE, for the weighted
probability of missclassification and for the Bayes risk.
6.2.1. A Lower Bound for the Probability of Missclassification. The first Theorem
is a classical efficiency result, in the spirit of [11, 5, 3, 12]. It corresponds essentially
to Stein’s Lemma in hypothesis testing; the reduction of an estimation problem to
a test between two hypotheses is a standard technique in the derivation of efficiency
lower bounds and is attributed to [10] (see also [19], for a related technique). Here,
Stein’s Lemma is applied as in Theorem 9.2 in [17] (p. 96), taking into account
the fact that the parameter space is made up of more than two points. Remark,
however, that our version does not assume almost sure convergence of the KullbackLeibler information measure to a fixed constant.
Theorem 6.1. Let P0 be the true probability measure that has generated the data.
Suppose that
o
n
(6.1)
lim sup Pj Jb 6= j < 1
n→∞
and that
(6.2)
1 dPj
ln
= H (Pj |P0 ) ,
n→∞ n
dP0
lim
P0 − as,
where the limit H (Pj |P0 ) (that can be a random variable) corresponds in most
cases with the so-called Kullback-Leibler divergence. Then, for any j ∈
/ J ∗:
n
o
1
/ J ∗ ≥ − inf H (Pj |P0 ) .
lim inf ln P0 Jb ∈
n→∞ n
j
6.2.2. Lower Bounds for the Other Risk Functions. The results of the previous
Sections
can
³
´ easily be converted in corresponding results for the risk function
˜
R1 J, J0 . The MSE of a generic estimator J˜ can be shown to be:
³ ´
³
´
MSE J˜
≍ P0 J˜ 6= J0 ,
THE STRUCTURE OF MODEL SELECTION
18
so that
³
³ ´
´
1
1
ln MSE J˜
= lim inf ln P0 J˜ 6= J0 .
n→∞ n
n→∞ n
The same reasoning can be applied also to the risk function R3 :
³
³
´
´
1
1
˜ J0
= lim inf ln P0 J˜ 6= J0 .
lim inf ln R3 J,
n→∞ n
n→∞ n
lim inf
7. Proofs
Proof of Proposition 3.2. We embed the proof in the framework provided by
[13] (Section 3.3).
• strong conservativeness: D is sufficient for A: this can be shown simply
considering a generic set composed by a couple of indexes, say {i, j}, such
that iP⊲ j. Now, we show that A and C are sufficient for D. For any choice
set {1, . . . , J} and any i ∈ {1, . . . , J}, define
o
n ¯
¯
Ωi|{1,...,J} = ω ¯ lim JˆJ (ω) = i .
n→∞
(7.1)
Under C, from [13] (Lemma 2 on p. 66), we have
\
Ωi|{1,...,J} =
Ωi|{i,j} .
j∈{1,...,J}, j6=i
Consider fictitiously JJ∗∗ as a new element of the choice set (that is the
choice set is now made of (J + 1 − |JJ∗∗ |) elements). We have:
´
o
³
n
\
ΩJ ∗∗ |{k,J ∗∗ }
P lim JˆJ ∈ JJ∗∗ = P ΩJJ∗∗ |{1,...,J} = P
J
J
n→∞
≥1−
X
k∈{1,...,J},k∈J
/ J∗∗
h
³
1 − P ΩJ ∗∗ |{k,J ∗∗ }
J
J
´i
k∈{1,...,J},k∈J
/ J∗∗
,
where we have used (7.1) and Bonferroni inequality. But:
³
´
³
´
\
P ΩJ ∗∗ |{k,J ∗∗ } = 1 − P Ωk|{k,J ∗∗ } = 1 − P
Ωk|{k,j}
J
J
J
j∈{k,JJ∗∗ }
\
¡
¢
¡
¢
= 1 − P
Ωk|{k,j} ≥ 1 − max
P Ωk|{k,j} = min∗∗ P Ωj|{k,j} .
∗∗
j∈JJ∗∗
Summing up, we have:
n
o
P lim JˆJ ∈ JJ∗∗ ≥ 1 −
n→∞
j∈JJ
X
j∈JJ
k∈{1,...,J},k∈J
/ J∗∗
≥ (1 − J + |JJ∗∗ |) + (J − |JJ∗∗ |) ·
h
³
´i
1 − P ΩJ ∗∗ |{k,J ∗∗ }
min
k∈{1,...,J},k∈J
/ J∗∗
J
J
¡
¢
min∗∗ P Ωj|{k,j} .
j∈JJ
Therefore, A implies D.
• strong consistency: Substitute E to D, B to A and JJ∗ to JJ∗∗ .
THE STRUCTURE OF MODEL SELECTION
19
Proof of Proposition 3.5. According to hypothesis 3, both Qn,j (θj ) and pj are
essential.6 This implies that the discussion on pages 21-22 following the statement
of Theorem 3 of [8] applies (the Theorem itself does not apply directly since the
essential factors are not more than two).
Proof of Theorem 4.1. We state here some results for future reference:
¡ ¢−1
¡ ¢−1
¡ ¢−1 h
¡ ¢−1 i
Q̈n,j θj∗
= Q̈∞,j θj∗
− Q̈n,j θj∗
− Q̈∞,j θj∗
(7.2)
¡ ¢
¡ ¢
Q̈n,j θj∗ = Q̈∞,j θj∗ + o (1)
(7.3)
¡ ¢−1
¡ ¢−1
Q̈n,j θj∗
= Q̈∞,j θj∗
(7.4)
+ o (1)
where the last two results hold by A2 and A3. Moreover,
(7.5)
"
#
∂ 2 Q∞,j (θj )
∂ 2 qj (Y ; θj )
∂ 2 E0 [qj (Y ; θj )]
Q̈∞,j (θj ) =
=
= E0
= Aj (θj ) ,
∂θj ∂θjT
∂θj ∂θjT
∂θj ∂θjT
where the third equality derives from A2.
(1) Consider the condition defining the estimator θ̂j (the so-called ³first-order
´
condition, since it involves the first-order derivative) 0 = Q̇n,j θ̂j . We
expand it around the value θj∗ :
´
¡ ¢
¡ ¢ ³
0 = Q̇n,j θj∗ + Q̈n,j θj∗ · θ̂j − θj∗ + rn ,
where rn is defined by:
³ ´
´
³
´T
∂Q
2
n,j θ̃j ³
∂
θ̂j − θj∗ ,
rn,k = θ̂j − θj∗
∂θj ∂θjT ∂θj,k
(7.6)
(7.7)
k = 1, . . . , pj
and θ̃j = δ · θ̂j + (1 − δ) · θj∗ for a certain δ ∈ (0, 1). Therefore:
i
¡ ¢−1 h
¡ ¢
θ̂j − θj∗ = −Q̈n,j θj∗
· Q̇n,j θj∗ + rn
i
¡ ¢−1 h
¡ ¢
= −Q̈∞,j θj∗
· Q̇n,j θj∗ + rn
h
i
¡ ¢−1
¡ ¢−1 i h
¡ ¢
+ Q̈∞,j θj∗
− Q̈n,j θj∗
· Q̇n,j θj∗ + rn ,
¡ ¢
where we have used (7.2). Now, we study the behavior of Q̇n,j θj∗ =
Pn ∂qj (Yi ;θj∗ )
1
. This is the average of n iid random variables with moi=1
n
∂θj
ments:
"
£
¡
¢#
¡
¢¤
¡ ¢
∂ E0 qj Yi ; θj∗
∂qj Yi ; θj∗
= Q̇∞,j θj∗ = 0,
=
E0
∂θj
∂θj
"
#
¢
¡
¢
¡
¡ ¢
∂qj Yi ; θj∗ ∂qj Yi ; θj∗
E0
= Bj θj∗ ,
∂θj
∂θjT
6Consider a preorder ¹ defined on the space X = X × X with generic element (x , x ); the
1
2
1
2
component 1 is said to be essential if for every x2 ∈ X2 , not all the elements of X1 are indifferent
with respect to ¹.
THE STRUCTURE OF MODEL SELECTION
20
where we have used A2 to interchange derivative and expectation. Thus,
by the classical LIL, the following result holds
©
£ ¡ ¢¤ª1/2
Q̇n,j (θ ∗ )
lim supn→∞ √ 2 ln lnjn = diag Bj θj∗
,
n
∗
©
£ ¡ ¢¤ª1/2
Q̇n,j (θ )
.
lim inf n→∞ √ 2 ln lnjn = − diag Bj θj∗
(7.8)
n
¶
µq
¡ ¢
¡ ¢−1 i
¡ ¢−1
ln ln n
= o (1) (by (7.4)), Q̇n,j θj∗ = O
Now, Q̈∞,j θj∗
− Q̈n,j θj∗
n
½
¾¸
·
2
∂Qn,j (θ̃j )
= O (1) (by A5) imply:
(by (7.8)) and ∂θ∂j ∂θT
∂θj,k
h
j
µ°
°2 ¶
°
∗°
rn = O °θ̂j − θj ° ,
!
Ãr
µ°
°2 ¶
ln
ln
n
°
∗°
∗
+ O °θ̂j − θj ° .
θ̂j − θj = O
n
(7.9)
(7.10)
Therefore, since θ̂j − θj∗ = o (1) (by A4), the asymptotic as behavior of
¡ ¢
¡ ¢−1
· Q̇n,j θj∗ :
θ̂j − θj∗ is determined by −Q̈∞,j θj∗
n
h
¡ ¢−1 io1/2
¡ ¢−1
¡ ¢
θ̂j −θ ∗
lim supn→∞ √ 2 ln lnj n = + diag Q̈∞,j θj∗
,
Bj θj∗ Q̈∞,j θj∗
n
io
n
h
¡ ¢−1 1/2
¡ ¢
¡ ¢−1
θ̂j −θ ∗
.
Bj θj∗ Q̈∞,j θj∗
lim inf n→∞ √ 2 ln lnj n = − diag Q̈∞,j θj∗
n
Using (7.5), the result follows.
(2) We have:
³ ´
´T
¡ ¢ ³
¡ ¢
Qn,j θ̂j − Qn,j θj∗ = θ̂j − θj∗ · Q̇n,j θj∗
´T
´
¡ ¢ ³
1 ³
+ · θ̂j − θj∗ · Q̈n,j θj∗ · θ̂j − θj∗
2
¡ ¢ ³
pj
´³
´³
´
∂ 3 Qn,j θ̄j
1 X
∗
∗
∗
· θ̂j,h − θj,h
θ̂j,k − θj,k
θ̂j,ℓ − θj,ℓ
+ ·
6
∂θj,h ∂θj,k ∂θj,ℓ
h,k,ℓ=1
where θ̄j = ε · θ̂j + (1 − ε) · θj∗ for a certain ε ∈ (0, 1). Using (7.6), we get:
³ ´
¡ ¢
Qn,j θ̂j − Qn,j θj∗
¡ ¢T
¡ ¢−1
¡ ¢
1
· Q̇n,j θj∗
= − · Q̇n,j θj∗ · Q̈n,j θj∗
2
¡ ¢−1
1 T
+ · rn · Q̈n,j θj∗
· rn
2
¡ ¢ ³
pj
´³
´³
´
∂ 3 Qn,j θ̄j
1 X
∗
∗
∗
+ ·
· θ̂j,h − θj,h
θ̂j,k − θj,k
θ̂j,ℓ − θj,ℓ
.
6
∂θj,h ∂θj,k ∂θj,ℓ
h,k,ℓ=1
µ°
°2 ¶
¡ ¢−1
°
°
Now, Q̈n,j θj∗
= O (1) (by (7.4)) and rn = O °θ̂j − θj∗ °
(by (7.9))
µ°
¶
°4
∂ 3 Qn,j (θ̄j )
°
°
imply that the second term is O °θ̂j − θj∗ ° , while ∂θj,h ∂θj,k ∂θj,ℓ = O (1)
THE STRUCTURE OF MODEL SELECTION
21
µ°
°3 ¶
°
°
(by A5) implies that the third term is O °θ̂j − θj∗ ° . Therefore:
³ ´
¡ ¢
Qn,j θ̂j − Qn,j θj∗
µ°
°3 ¶
¡ ∗ ¢T
¡ ∗ ¢−1
¡ ∗¢
1
°
∗°
= − · Q̇n,j θj · Q̈n,j θj
· Q̇n,j θj + O °θ̂j − θj °
2
¡ ¢T
¡ ¢−1
¡ ¢
1
· Q̇n,j θj∗
= − · Q̇n,j θj∗ · Q̈∞,j θj∗
2
µ°
°3 ¶
¡ ∗ ¢T h
¡ ∗ ¢−1
¡ ∗¢
¡ ∗ ¢−1 i
1
°
∗°
− · Q̇n,j θj · Q̈n,j θj
· Q̇n,j θj + O °θ̂j − θj °
− Q̈∞,j θj
2
¡ ¢T
¡ ¢−1
¡ ¢
1
= · Q̇n,j θj∗ · W θj∗
· Q̇n,j θj∗
2µ
µ°
°
°2 ¶
°3 ¶
°
°
∗°
∗°
+ o °θ̂j − θj ° + O °θ̂j − θj °
where µ
we have used
(7.2),
(7.5). Using (7.10), it can be seen
µ° (7.4) and
°
°2 ¶
°3 ¶
¢
¡
°
°
n
∗°
∗°
, while the leading term
that o °θ̂j − θj ° + O °θ̂j − θj ° = o ln ln
n
¡ ln ln n ¢
is O ³ n´ . Therefore, it is this last term that dictates the behavior of
¡ ¢
Qn,j θ̂j − Qn,j θj∗ . Now, consider the difference:
h
³ ´
h
³ ´
i
¡ ¢i
∆ji = 2 · Qn,j θ̂j − Qn,j θj∗ − 2 · Qn,i θ̂i − Qn,i (θi∗ )
# ·
¡
¢ ¸
¡
¢ ¸T "
µ
¶
¡ ¢−1
n X
n ·
X
ln ln n
q̇j Yk ; θj∗
q̇j Yh ; θj∗
0
−Aj θj∗
·
=
·
+o
.
−1
q̇i (Yk ; θi∗ )
q̇i (Yh ; θi∗ )
n
0
Ai (θi∗ )
h=1 k=1
Let:
h (yh , yk ) ,
·
¢ ¸T "
¡
¡ ¢−1
q̇j yh ; θj∗
−Aj θj∗
·
q̇i (yh ; θi∗ )
0
0
−1
Ai (θi∗ )
# ·
¢ ¸
¡
q̇j yk ; θj∗
.
·
q̇i (yk ; θi∗ )
2
Using the notation of [24] (Section 5.1.5), if E [h (Yh , Yk )] < ∞, we have:
θ = Eh (Yh , Yk ) = 0
h1 (yh ) = Eh (yh , Yk ) = 0
h̃1 (yh ) = h1 (yh ) − θ = 0
i2
h
ζ1 = E h̃1 (Yh ) = 0,
so that the leading term of ∆ji is a degenerate V −statistic. We can use
Theorem 2 of [9] to derive a Law of the Iterated Logarithm. Decompose as
follows the quadratic discrepancy:
n
X
h,k=1
h (Yh , Yk ) = 2
n k−1
X
X
k=1 h=1
h (Yh , Yk ) +
n
X
k=1
h (Xk , Xk ) .
THE STRUCTURE OF MODEL SELECTION
22
Using Fischer’s variational property of eigenvalues:
°
°
#¯
"
° ∂qj (Yh ;θj∗ ) °2 ¯¯
¡ ∗ ¢−1
¯
° ¯
°
−A
θ
0
¯
j
j
∂θj
°
E |h (Yh , Yh )| ≤ E °
¯
−1
° ∂qi (Yh ;θi∗ ) ° · ¯¯λmax
∗
¯
0
Ai (θi )
°
°
∂θi
2
¯
¯
¯
¯
pj
pi
¯ ∂q ¡Y ; θ∗ ¢ ¯2 X
X
¯ ∂qi (Yh ; θi∗ ) ¯2
j
h j ¯
¯
¯
=
E ¯¯
E¯
¯ +
¯
¯
¯
∂θj,k
∂θi,k
k=1
k=1
h
¡ ¢−1 i
· λmax −Aj θj∗
<∞
by A3 and A5, so that Kolmogorov’s law of large numbers implies:
lim sup
n→∞
Pn
k=1
h (Xk , Xk )
n
·
= 0,
n
n ln ln n
P − as.
Since Eh (Yh , Yk ) = 0, we use Fischer’s property of eigenvalues and CauchySchwarz inequality to derive:
T "
# ∂q Y ;θ∗ 2
¡ ∗ ¢−1
∂qj (Yh ;θj∗ )
j( ℓ j )
0
∂θj
∂θj
· −Aj θj
·
Eh2 (Yh , Yk ) ≤ E ∂q (Y
∗
∗
−1
∂qi (Yℓ ;θi )
i
h ;θi )
0
Ai (θi∗ )
∂θi
∂θi
¯
¯ 2 " p
¯ #2
¯
pj
i
¯ ∂q ¡Y ; θ∗ ¢ ¯2
X
∗ ¯2
X
¯
∂qi (Y ; θi ) ¯
¯ j
j ¯
≤2·
E ¯¯
+
E¯
¯
¯
∂θi,k ¯
k=1 ¯ ∂θj,k
k=1
h
n
¡ ¢−1 io2
· λmax −Aj θj∗
.
This is finite by A3 and A5; then, from Theorem 2 of [9]:
lim sup
n→∞
2
Pn
k=1
Pk−1
h=1 h (Yh , Yk )
= 2C (h) .
n ln ln n
It is simple to see that C (h) = maxk {λk } where λk are the eigenvalues of
the integral operator A:
Ag (x) =
Z
h (x, y) g (y) P (dy) ,
for g ∈ L2 . Indeed, this operator has a finite spectrum (it is a quadratic
form), so that it is possible to obtain a simpler expression for C (h). Let:
h
¡
¢
¡
¢T i
Bij θi∗ , θj∗ = E0 q̇i (yℓ ; θi∗ ) · q̇j yℓ ; θj∗
,
¡ ¢
¢
¡
¸
·
·
¡
¢
Bj¡ θj∗ ¢
q̇j yℓ ; θj∗
=
B θj∗ , θi∗ = V0
Bij θi∗ , θj∗
q̇i (yℓ ; θi∗ )
¢ ¸
¡
Bji θj∗ , θi∗
.
Bi (θi∗ )
THE STRUCTURE OF MODEL SELECTION
23
Then we can write:
¢ ¸T
¡
·
¡
¢−1/2
q̇j yh ; θj∗
· B θj∗ , θi∗
h (yh , yk ) =
q̇i (yh ; θi∗ )
#
"
¡ ¢−1
¡
¢1/2
¡ ∗ ∗ ¢1/2
0
−Aj θj∗
· B θj∗ , θi∗
· B θ j , θi
·
∗ −1
0
Ai (θi )
¢
¡
¸
·
¡
¢−1/2
q̇j yℓ ; θj∗
.
· B θj∗ , θi∗
·
q̇i (yℓ ; θi∗ )
Now,
¢ ¸
¡
q̇j yℓ ; θj∗
q̇i (yℓ ; θi∗ )
is an orthonormal vector and so the nonnull eigenvalues of the spectrum of
A are equal to the spectrum of:
#
"
¡ ¢−1
¡
¢1/2
¡ ∗ ∗ ¢1/2
−Aj θj∗
0
· B θj∗ , θi∗
B θj , θi
·
∗ −1
0
Ai (θi )
¡
¢−1/2
B θj∗ , θi∗
·
·
and to the one of:
#
"
¡ ¢−1
¡
¢
−Aj θj∗
0
· B θj∗ , θi∗
∗ −1
0
Ai (θi )
# ·
"
¢ ¸
¡
¡ ¢
¡ ∗ ¢−1
Bj¡ θj∗ ¢ Bji θj∗ , θi∗
0
−Aj θj
·
=
−1
Bij θi∗ , θj∗
Bi (θi∗ )
0
Ai (θi∗ )
"
¢ #
¡
¡ ¢−1
¡ ¢
¡ ¢−1
Bji θj∗ , θi∗
−Aj θj∗
Bj θj∗
−Aj θj∗
¢
¡
.
=
−1
−1
Ai (θi∗ ) Bi (θi∗ )
Ai (θi∗ ) Bij θi∗ , θj∗
(3) We start from the decomposition:
³ ´
³ ´
¡ ¢ h
¡ ¢i
Qn,j θ̂j = Q∞,j θj∗ + Qn,j θ̂j − Q∞,j θj∗
³ ´
¡ ¢ h
¡ ¢
¡ ¢i £
¡ ¢¤
= Q∞,j θj∗ + Qn,j θ̂j − Qn,j θj∗ + Qn,j θj∗ − Q∞,j θj∗ .
(7.11)
h
³ ´
¡ ¢i
As concerns Qn,j θ̂j − Qn,j θj∗ , its behavior can be analyzed as before
¡
¢
n
and it O ln ln
. Then, we start from the last addendum in (7.11):
n
n
¢
¡ ¢¤
¡ ¢
¡ ¢
1 X£ ¡
qj Yi ; θj∗ − Q∞,j θj∗
Qn,j θj∗ − Q∞,j θj∗ =
n i=1
is an average of n iid terms with zero mean (by defnition of Q∞,j ) and finite
variance (by A2). The LIL holds and we have
¡ ¢
¡ ¢
n £ ¡
¢
¡ ¢¤2 o1/2
Qn,j θj∗ − Q∞,j θj∗
q
= + E0 qj Yi ; θj∗ − Q∞,j θj∗
,
lim sup
n→∞
2 ln ln n
n
¡ ¢
¡ ¢
n £ ¡
¢
¡ ¢¤2 o1/2
Qn,j θj∗ − Q∞,j θj∗
q
lim inf
= − E0 qj Yi ; θj∗ − Q∞,j θj∗
.
n→∞
2 ln ln n
n
Proof of Theorem 4.3. The fact that JˆJ (ω) satisfies Assumption C of Proposition
3.2 is evident from Proposition 3.4.
THE STRUCTURE OF MODEL SELECTION
24
n
o
(1) In this case, we need that for any couple {i, j} such that iP⊲ j, P limn→∞ Jˆ{i,j} = i =
¡ ¢
1. Therefore consider i and j such that Q∞,i (θi∗ ) > Q∞,j θj∗ : then, for
³ ´
any ω ∈ Ω (up to negligibility) there exists a nω such that Qn,i θ̂i (ω) >
³ ´
Qn,j θ̂j (ω) for n ≥ nω and we want that also Qn,i > Qn,j holds true.
Therefore, we need:
³ ´
³ ´
³ ´
³ ´
0 < Qn,i θ̂i − Qn,j θ̂j = Qn,i θ̂i − Qn,j θ̂j − c (n, pi , Y) + c (n, pj , Y)
³ ´
³ ´
h
i h
¡ ¢i
= Qn,i θ̂i − Q∞,i (θi∗ ) − Qn,j θ̂j − Q∞,j θj∗
¡ ¢
+ Q∞,i (θi∗ ) − Q∞,j θj∗ − c (n, pi , Y) + c (n, pj , Y) .
Taking the lim inf, we get:
h
³ ´
i
h
³ ´
¡ ¢i
0 < lim inf Qn,i θ̂i − Q∞,i (θi∗ ) − lim sup Qn,j θ̂j − Q∞,j θj∗
n→∞
n→∞
£
¡ ∗ ¢¤
∗
+ lim inf Q∞,i (θi ) − Q∞,j θj − lim sup [c (n, pi , Y) − c (n, pj , Y)]
n→∞
n→∞
¡ ¢
= Q∞,i (θi∗ ) − Q∞,j θj∗ − lim sup [c (n, pi , Y) − c (n, pj , Y)] .
n→∞
and at last:
¡ ¢
Q∞,i (θi∗ ) − Q∞,j θj∗ > lim sup [c (n, pi , Y) − c (n, pj , Y)] .
n→∞
o
n
(2) For any couple {i, j} such that iP◮ j, we should have P limn→∞ Jˆ{i,j} = i =
¡ ¢
1. If Q∞,i (θi∗ ) > Q∞,j θj∗ (i.e. iP⊲ j), then the condition is the same as
¡ ¢
for strong conservativeness. On the other hand, if Q∞,i (θi∗ ) = Q∞,j θj∗
and pi < pj , that is i is a more parsimonious representation than j (i.e.
iP◮ j but iI⊲ j), we want
³ ´ that for anyω
³ ´∈ Ω (up to negligibility) there exists
a nω such that Qn,j θ̂j (ω) − Qn,i θ̂i (ω) < 0 for n ≥ nω . In order to be
able to discriminate between the two models, we need to divide the terms
n
in (7.12) by the order of the largest term, that is ln ln
n . Then, taking a
lim supn→∞ , we have:
³ ´
i
³ ´
h
³ ´
³ ´
¡ ¢i h
Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ )
Qn,j θ̂j − Qn,i θ̂i
= lim sup
0 > lim sup
ln ln n
ln ln n
n→∞
− lim inf
n→∞
n→∞
n
c (n, pj , Y) − c (n, pi , Y)
ln ln n
n
n
.
According to Theorem 4.1, this becomes:
³ ´
³ ´
Qn,j θ̂j − Qn,i θ̂i
c (n, pj , Y) − c (n, pi , Y)
0 > lim sup
= λji − lim inf
.
ln ln n
ln ln n
n→∞
n
n→∞
n
Proof of Proposition 5.2. The proof follows the same lines of the corresponding
proof for strong properties.
THE STRUCTURE OF MODEL SELECTION
25
• weak conservativeness:
The
n ¯
o reasoning is the same, but we use the set
¯ˆ
(n)
Ωi|{1,...,J} = ω ¯JJ (ω) = i . The minorization is now:
o
n
P JˆJ ∈ JJ∗∗
³
´
(n)
≥ (1 − J + |JJ∗∗ |) + (J − |JJ∗∗ |) ·
min
min∗∗ P Ωj|{k,j} .
∗∗
k∈{1,...,J},k∈J
/ J j∈JJ
Taking limn→∞ , we get the result.
• weak consistency: Substitute E′ to D′ , B′ to A′ and JJ∗ to JJ∗∗ .
Proof of Theorem 5.3. The fact that JˆJ (ω) satisfies Assumption C′ of Proposition 3.2 is evident from Proposition 3.4.
∗
(1) In this case, we need that
{i, j} such
n for any couple
o
n that iPo⊲ j (i.e.
n Q∞,i¯ (θi ) >
o
¡ ∗¢
¯
Q∞,j θj ), limn→∞ P Jˆ{i,j} = i = 1. Now, Jˆ{i,j} = i = ω ∈ Ω ¯Jˆ{i,j} (ω) = i
³ ´
³ ´
³ ´
³ ´
can arise either if Qn,i θ̂i > Qn,j θ̂j or if Qn,i θ̂i = Qn,j θ̂j but the
tie is broken in advantage of i:
n
³ ´
³ ´o n
³ ´
³ ´
o n
o
Jˆ{i,j} = i = Qn,i θ̂i > Qn,j θ̂j
⊎ Qn,i θ̂i = Qn,j θ̂j , but i is chosen ,
o
n
n
³ ´
³ ´o
n
³ ´
³ ´
o
P Jˆ{i,j} = i = P Qn,i θ̂i > Qn,j θ̂j
+ P Qn,i θ̂i = Qn,j θ̂j , but i is chosen .
∗
The second
¡ ∗ ¢ probability is asymptotically null, since iP⊲ j (i.e. Q∞,i (θi ) >
Q∞,j θj ). As concerns the first probability, we have:
³ ´
³ ´o
n
P Qn,i θ̂i > Qn,j θ̂j
³ ´
³ ´
n
o
= P Qn,i θ̂i − Qn,j θ̂j > c (n, pi , Y) − c (n, pj , Y)
nh
³ ´
i h
³ ´
¡ ¢i
= P Qn,i θ̂i − Q∞,i (θi∗ ) − Qn,j θ̂j − Q∞,j θj∗
£
¡ ¢¤ª
> [c (n, pi , Y) − c (n, pj , Y)] − Q∞,i (θi∗ ) − Q∞,j θj∗
.
h
h
³
³
´
i
´
¡ ¢i
√
√
Since n Qn,i θ̂i − Q∞,i (θi∗ ) and n Qn,j θ̂j − Q∞,j θj∗ are OP (1),
we need:
¡ ¢¤
√ £
lim sup n c (n, pi , Y) − c (n, pj , Y) − Q∞,i (θi∗ ) + Q∞,j θj∗ = −∞.
n
n
o
(2) For any couple {i, j} such that iP◮ j, limn→∞ P Jˆ{i,j} = i = 1. If Q∞,i (θi∗ ) >
¡ ¢
Q∞,j θj∗ (i.e. iP⊲ j), then the condition is the same as for weak conser¡ ¢
vativeness. On the other hand, if Q∞,i (θi∗ ) = Q∞,j θj∗ and pi < pj , that
is i is a more parsimonious representation than j (i.e. iP◮ j but iI⊲ j), we
want that
³ ´
n
³ ´o
lim P Qn,i θ̂i > Qn,j θ̂j
= 1.
n→∞
This event becomes:
³ ´
³ ´
³ ´
³ ´
0 > Qn,j θ̂j − Qn,i θ̂i = Qn,j θ̂j − Qn,i θ̂i − c (n, pj , Y) + c (n, pi , Y)
(7.12)
³ ´
i
³ ´
h
¡ ¢i h
= Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) − [c (n, pj , Y) − c (n, pi , Y)] .
THE STRUCTURE OF MODEL SELECTION
26
Now, we have:
µ
¶
h
³ ´
³ ´
i
¡ ¢i h
1
Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) = OP √
,
n
and therefore:
³ ´
i
o
³ ´
nh
¡ ¢i h
lim P Qn,j θ̂j − Q∞,j θj∗ − Qn,i θ̂i − Q∞,i (θi∗ ) < [c (n, pj , Y) − c (n, pi , Y)] = 1,
n→∞
as long as:
lim inf
n→∞
√
n · [c (n, pj , Y) − c (n, pi , Y)] = ∞.
Proof of Theorem 6.1. Define the sets:
n
o
An (j) =
ω : Jb = j
¾
½
1 dPj
≤ H (Pj |P0 ) + ε
Bn (j) =
ω : ln
n dP0
Therefore, we have:
n
o
n
o
P0 Jb ∈
/ J∗
/ J∗
= E0 1 Jb ∈
o
dP0 n b
1 J∈
≥ Ej
/ J∗
dPj
dP0
1 {An (j)}
≥ Ej
dPj
≥ Ej 1 {An (j)} 1 {Bn (j)} · exp {−n · [H (Pj |P0 ) + ε]}
≥ [1 − Pj {Acn (j)} − Pj {Bnc (j)}] · exp {−n · [H (Pj |P0 ) + ε]}
h
o
i
n
≥ 1 − Pj Jb 6= j − Pj {Bnc (j)} · exp {−n · [H (Pj |P0 ) + ε]} .
This implies:
n
o
n
o
i
1 h
1
/ J ∗ ≥ −H (Pj |P0 )−ε+lim inf ln 1 − Pj Jb 6= j − Pj {Bnc (j)} .
lim inf ln P0 Jb ∈
n→∞ n
n→∞ n
Under (6.1) and (6.2), since ε is arbitrary, we get:
o
n
1
lim inf ln P0 Jb ∈
/ J ∗ ≥ − min∗ inf H (Pj |P0 ) .
n→∞ n
j∈J θj ∈Θj
References
[1] C.R. Blyth and D.M. Roberts. On inequalities of cramér-rao type and admissibility proofs.
In Sixth Berkeley Symposium, pages 17–30. University of California Press, 1972.
[2] Denis Bouyssou. Monotonicity of ‘ranking by choosing’: a progress report. Soc. Choice Welf.,
23(2):249–273, 2004.
[3] Antoine Chambaz. Estimating and testing the order of a model. Technical report, Université
Paris-Sud, 2002.
[4] Douglas G. Chapman and Herbert Robbins. Minimum variance estimation without regularity
assumptions. Ann. Math. Statistics, 22:581–586, 1951.
[5] Christine Choirat and Raffaello Seri. Estimation in discrete parameter models. Document de
Travail 2001-38, CREST, 2001.
[6] D. R. Cox. Tests of separate families of hypotheses. In Proc. 4th Berkeley Sympos. Math.
Statist. and Prob., Vol. I, pages 105–123. Univ. California Press, Berkeley, Calif., 1961.
[7] D. R. Cox. Further results on tests of separate families of hypotheses. J. Roy. Statist. Soc.
Ser. B, 24:406–424, 1962.
[8] Gerard Debreu. Topological methods in cardinal utility theory. In Mathematical methods in
the social sciences 1959, pages 16–26. Stanford Univ. Press, Stanford, Calif., 1960.
THE STRUCTURE OF MODEL SELECTION
27
[9] Herold Dehling. Complete convergence of triangular arrays and the law of the iterated logarithm for U -statistics. Statist. Probab. Lett., 7(4):319–321, 1989.
[10] R. H. Farrell. On the best obtainable asymptotic rates of convergence in estimation of a
density at a point. Ann. Math. Statist., 43:170–180, 1972.
[11] Lorenzo Finesso, Chuang-Chun Liu, and Prakash Narayan. The optimal error exponent for
Markov order estimation. IEEE Trans. Inform. Theory, 42(5):1488–1497, 1996.
[12] Elisabeth Gassiat and Stéphane Boucheron. Optimal error exponents in hidden Markov models order estimation. IEEE Trans. Inform. Theory, 49(4):964–980, 2003.
[13] Christian Gouriéroux. Econometrics of qualitative dependent variables. Themes in Modern
Econometrics. Cambridge University Press, Cambridge, 2000. Translated from the second
French (1991) edition by Paul B. Klassen.
[14] Christian Gouriéroux and Alain Monfort. Statistics and Econometric Models. Cambridge
University Press, 1995.
[15] P. Hall. On convergence rates in nonparametric problems. International Statistical Review,
57:45–58, 1989.
[16] J. M. Hammersley. On estimating restricted parameters. J. Roy. Statist. Soc. Ser. B., 12:192–
229; discussion, 230–240, 1950.
[17] I. A. Ibragimov and R. Z. Has′ minskiı̆. Statistical estimation, volume 16 of Applications
of Mathematics. Springer-Verlag, New York, 1981. Asymptotic theory, Translated from the
Russian by Samuel Kotz.
[18] A. D. M. Kester and W. C. M. Kallenberg. Large deviations of estimators. Ann. Statist.,
14(2):648–664, 1986.
[19] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann. Statist., 1:38–53,
1973.
[20] E. Malinvaud. Lecons de theorie microeconomique. Dunod, 1968.
[21] R. Nishii. Maximum likelihood principle and model selection when the true model is unspecified. J. Multivariate Anal., 27(2):392–403, 1988.
[22] B. M. Pötscher. Effects of model selection on inference. Econometric Theory, 7(2):163–185,
1991.
[23] Douglas Rivers and Quang Vuong. Model selection tests for nonlinear dynamic models.
Econom. J., 5(1):1–39, 2002.
[24] Robert J. Serfling. Approximation theorems of mathematical statistics. John Wiley & Sons
Inc., New York, 1980. Wiley Series in Probability and Mathematical Statistics.
[25] Chor-Yiu Sin and Halbert White. Information criteria for selecting possibly misspecified
parametric models. J. Econometrics, 71(1-2):207–225, 1996.
[26] Igor Vajda. A discrete theory of search. I. Apl. Mat., 16:241–255, 1971.
[27] Igor Vajda. A discrete theory of search. II. Apl. Mat., 16:319–335, 1971.
[28] Quang H. Vuong. Likelihood ratio tests for model selection and nonnested hypotheses. Econometrica, 57(2):307–333, 1989.
[29] H. White. A consistent model selection procedure based on m-testing. In C.W.J. Granger,
editor, Modelling Economic Series: Readings in Econometric Methodology, pages 369–403.
Oxford University Press, 1989.
Dipartimento di Economia, Università degli Studi dell’Insubria, Via Ravasi 2, 21100
Varese, Italy
E-mail address:
[email protected]
Dipartimento di Economia, Università degli Studi dell’Insubria, Via Ravasi 2, 21100
Varese, Italy
E-mail address:
[email protected]