Average (E)BIC-like Criteria for Bayesian Model Selection
Jingnan Xue, Ye Luo and Faming Liang∗
June 1, 2017
Abstract
Markov chain Monte Carlo (MCMC) has been an indispensable tool for Bayesian analysis of complex
statistical models even for high-dimensional problems. However, there still lacks a consistent criterion
for selecting models based on the outputs of MCMC. The existing deviance information criterion (DIC)
is known to be inconsistent and non-invariant for reparameterization. This paper proposes an Average
BIC-like (ABIC) model selection criterion and an Average EBIC-like (AEBIC) model selection criterion
for low and high-dimensional problems, respectively; establishes their consistency under mild conditions;
and illustrates their applications using generalized linear models. The proposed criteria overcome shortcomings of DIC. The numerical results indicate that the proposed criteria can significantly outperform
DIC as well as the MLE-based criteria, such as AIC, BIC and EBIC, in terms of model selection accuracy.
Keywords: Asymptotic Consistency; Deviance Information Criterion; High-Dimensional Data; Generalized Linear Model; Markov Chain Monte Carlo.
∗
To whom correspondence should be addressed: F. Liang. Liang is Professor, Department of Biostatistics, University of
Florida, Gainesville, FL 32611, email:
[email protected]; J. Xue is Graduate Student, Department of Statistics, Texas A&M
University, College station, TX 77843; email:
[email protected]; Y. Luo is Assistant Professor, Department of Economics,
University of Florida, Gainesville, FL 32611, email:
[email protected].
1
1
Introduction
Model selection is a fundamental part of the statistical modeling process, and it has been an active research area since 1970s. The most famous model selection criteria are perhaps AIC(Akaike, 1974) and
BIC(Schwarz, 1978). The former is commonly called Akaike Information Criterion after Hirotogu Akaike;
and the latter is called Bayesian Information Criterion or Schwarz Information Criterion after Gideon
Schwarz. Both criteria are boiled down to a trade-off between goodness-of-fit and model complexity: A
good model fits the data well with a minimum number of variables/parameters. To be more precise, AIC
is defined by
AIC(S) = −2Ln (Xn |βbS ) + 2|S|/n,
where Xn = {X (1) , X (2) , . . . , X (n) } denote a set of n iid observations, S denotes a specific model, |S|
denotes the number of parameters of the model, βs denotes the vector of parameters of the model, βbS
denotes the maximum likelihood estimate (MLE) of βS , and Ln (·) denotes the averaged log-likelihood
function, i.e.,
n
Ln (Xn |βS ) =
1X
log f (x(i) |βS ),
n
(1)
i=1
where f (·) denote the probability mass/density function of the model. BIC is defined by
BIC(S) = −2Ln (Xn |βbS ) + |S| log(n)/n,
which penalizes model complexity more than AIC and so will favor simpler models.
AIC and BIC represent two very different approaches for model selection. AIC is essentially a datadriven criterion which is to select a model that explains the data well and will make good short-term
predictions. The penalty term 2|S| has been shown to be asymptotically equivalent to leave-one-out crossvalidation. Based on this observation, different cross-validation-based criteria have been proposed in the
literature, see e.g., generalized cross validation (Wahba and Craven, 1979) and WAIC (Watanabe, 2010).
AIC and the cross-validation-based criteria are not asymptotically consistent. BIC is derived under the
Bayesian framework, for which the differences provide an approximation to the log(Bayes factor) under
specific “unit information” prior assumptions (Kass and Raftery, 1995). Under specific conditions, BIC
has been shown to be asymptotically consistent, see e.g. Nishii (1984) and Haughton (1988).
Since all these AIC, BIC and cross-validation criteria require an exhaustive evaluation of all possible
models for the problem, they are difficult to be applied when the model space is large. When the time
entered into 1990s, the use of Markov chain Monte Carlo (MCMC) made it possible to fit arbitrarily complex
Bayesian models with an arbitrarily large model space. Note that the AIC, BIC and cross-validation criteria
essentially fail for these problems. Hence, there was a pressing need for a model selection criterion that is
able to select an appropriate model based on the outputs of MCMC. This need was partially addressed by
2
Deviance information criterion (Spiegelhalter et al., 2002), which is defined by
DIC(S) = −2EβS |Xn ,S Ln (Xn |βS ) + [−2L̄(Xn |βS ) + 2Ln (Xn |β̄S )],
where EβS |Xn ,S denotes the expectation with respect to the posterior distribution of βS for a given model
S, and β̄S = EβS |Xn ,S (βS ) denotes the posterior mean of βS . When the sample size n is sufficiently
large, it can be shown that [−2L̄(Xn |βS ) + 2Ln (Xn |β̄S )] ≈ |S|/n, which measures the model complexity
and is called “effective number of parameters” in Spiegelhalter et al. (2002), and EβS |Xn ,S Ln (Xn |βS ) ≈
Ln (Xn |βbS ). Therefore, DIC can be viewed as an approximation to AIC with posterior samples. Twelve
years later, Spiegelhalter et al. (2014) acknowledged that DIC has some problems, e.g., it is not invariant
to reparameterization, it is not asymptotically consistency, it is not based on a proper predictive criterion,
and it has a weak theoretical justification.
To alleviate the difficulties suffered by DIC, we proposed a new model selection criterion, the so-called
average BIC-like Criterion,
ABIC(S) = −2nEβS |Xn ,S Ln (Xn |βS ) + |S| log(n),
(2)
which shares the same penalty for the model complexity with BIC and is abbreviated as “average-BIC” or
“ABIC”. Under appropriate conditions, we can show that ABIC is asymptotically consistent.
During the past two decades, the research in model selection has been reflourished due to the emergence
of high-dimensional data where the sample size n is typically smaller than the dimension p. To address the
ill-posed issue in modeling high-dimensional data, a variety of variable selection methods have been proposed, in particular, for linear and generalized linear models. To name a few, they include Lasso(Tibshirani,
1994), SCAD(Fan and Li, 2001), elastic net (Zou and Hastie, 2005), Dantzig selector (Candes and Tao,
2007), MCP (Zhang, 2010), rLasso (Song and Liang, 2015a), and EBIC (Chen and Chen, 2008, 2012; Chen
and Luo, 2013). Like AIC and BIC, all these methods require to find the global optimizer for a penalized
log-likelihood function. However, as shown by Ge et al. (2016) and Huo and Chen (2010), except for Lasso,
finding the global optimizer for the other methods is strongly NP-hard due to the concaveness of their
penalty functions. Therefore, the models selected by these methods are often sub-optimal. Although Lasso
is computationally attractive, it is often inconsistent as the required representative condition (Zhao and
Yu, 2006) is hard to be satisfied. Practically, Lasso tends to select a model including many false variables.
On the other hand, MCMC continue to work as an indispensable tool in Bayesian modeling for high
dimensional problems. As indicated by recent publications, see e.g., Johnson and Rossell (2012), Liang
et al. (2013) and Song and Liang (2015b), Bayesian methods can often outperform frequentist methods
in identifying important variables for high-dimensional problems. Hence, there is now also a need to
develop an appropriate criterion for selecting models based on the outputs of MCMC for high-dimensional
3
problems. To address this issue, we extend EBIC to AEBIC, which is defined by
AEBIC(S) = −2nEβS |Xn ,S Ln (Xn |βS ) + |S| log(n) + 2γ|S| log(p).
(3)
The asymptotic consistency of this criterion can be established under some sparsity and regularity conditions. Here we would like to point out that our proof for the asymptotic consistency is general, which is
not restricted to generalized linear models.
The remainder of this article is organized as follows. Section 2 provides an informal derivation of ABIC
and AEBIC. Section 3 establishes the consistency properties of ABIC and AEBIC with the proofs deferred
to the Appendix. Section 4 evaluates the finite sample performance of ABIC and AEBIC through some
simulation studies and a real data example. Section 5 concludes the article with a brief discussion.
2
Intuitions on ABIC and AEBIC
This section gives an informal derivation and intuitive explanation for ABIC and AEBIC. Let Xn =
{X (i) : i = 1, . . . , n} denote a dataset of n iid observations. For example, for high-dimensional regression,
X (i) = (y (i) , z (i) ), where y is the response variable, and z is a high-dimensional random vector containing
pn predictors. Under the high dimensional setting, pn can increase with the sample size n. Let S ⊂
{1, 2, . . . , pn } denote a generic model, let S ∗ denote the true model, and let S := {S : S ⊂ {1, 2, ..., pn }}
denote the space of all possible models. For high-dimensional problems, the true model can be sparse, i.e.,
|S ∗ | = o(n) holds. The problem of model selection is to identify S ∗ from S based on the observed data.
Let’s consider the problem under the Bayesian framework, which is to choose a model with the maximum posterior probability. Let π(S) denote the prior distribution imposed on the model space S and let
π(βS ) denote the prior distribution of βS . Then the posterior probability of the model S is given by
where
P (S|Xn ) = P
m(Xn |S) =
Z
m(Xn |S)π(S)
,
S∈P m(Xn |S)π(S)
exp (nLn (Xn |βS )) π(βS )dβS .
By choosing appropriate priors, it can be shown that P (S ∗ |Xn ) converges to 1 in probability as n goes to
infinity (Liang et al., 2013; Shin et al., 2015), which is the so-called global model consistency in Bayesian
variable selection (Johnson and Rossell, 2012). This further implies that arg maxS P (S|Xn ) = S ∗ holds in
probability as n → ∞.
However, in most cases, it is impossible to calculate P (S|Xn ) exactly, so we need to approximate it.
The approximation can be done using MCMC or the Laplace method. We note that the latter has been
4
used in deriving BIC and EBIC. Since
arg max P (S|Xn ) = arg max m(Xn |S)π(S) = arg max log{m(Xn |S)π(S)},
S
S
S
we only need to care about log{m(Xn |s)π(S)}, which can be re-expressed as
log{m(Xn |S)π(S)} = log
Z
exp (nLn (Xn |βS )) π(βS )dβS + log π(S).
(4)
To deal with the log-integral term in (4), we expand Ln (Xn |βS ) at βbS , the MLE of βS . That is,
2
bS )
1
∂
L
(X
|
β
n
n
′
Ln (Xn |βS ) ≈ Ln (Xn |βbS ) + (βS − βbS )
(βS − βbS )
2
∂βS ∂βST
1
′ b
b
b
b
= Ln (Xn |βS ) − (βS − βS ) I(Xn , βS ) (βS − βbS ),
2
(5)
b n , βbS ) = − ∂ 2 Ln (Xn T|βbS ) is the averaged observed information matrix. Therefore,
where I(X
∂β ∂β
S
S
Z
exp (nLn (Xn |βS )) π(βS )dβS
Z
1
′
b
b
b
b
b
b
≈ exp nLn (Xn |βS ) π(βS ) exp − (βS − βS ) nI(Xn , βS ) (βS − βS ) dβS
2
|S|
b n , βbS )|− 21 ,
= exp nLn (Xn |βbS ) π(βbS )(2π/n) 2 |I(X
where the approximation is valid provided that the prior π(βS ) is ”flat” over the neighborhood of βbS and
Ln (Xn |βS ) is ignorable outside the neighborhood of βbS . Finally,
log{m(Xn |S)π(S)} = log
Z
exp (nLn (Xn |βS )) π(βS )dβS + log π(S)
|S|
|S|
1
b n , βbS )| + log π(S)
≈ nLn (Xn |βbS ) + log π(βbS ) +
log(2π) −
log(n) − log |I(X
2
2
2
|S|
log(n) + log π(S),
≈ nLn (Xn |βbS ) −
2
where the second approximation is justified by the fact that when n is sufficiently large, log(2π) is compab n , βbS )| converges to a constant, and log π(βbS ) can be controlled at a level
rably ignorable than log(n), |I(X
of O(1).
For the prior π(S), if the setting of Liang et al. (2013) is adopted, then we have
π(S) = λn|S| (1 − λn )pn −|S| ,
5
(6)
where λn denotes the prior probability of each variable to be included in S and takes a value in the form
λn =
1
√ ,
1 + pγn 2π
(7)
for some parameter γ > 0. If the setting of Chen and Chen (2008) is adopted, then we have
π(S) ∝
pn
|S|
−γ
.
(8)
For the case that |S| is far smaller than Pn , it is easy to verify that log π(S) = −γ|S| log(pn ) + O(1) holds
for both prior settings. Therefore,
|S|
log(n) − γ|S| log(pn ),
log{m(Xn |S)π(S)} ≈ nLn (Xn |βbS ) −
2
(9)
which exactly equals to − 12 EBIC.
In this article, we propose a new method to approximate the log-integral term in (4). Starting from (9),
we further approximate Ln (Xn |βbS ) by E{β |X ,S} Ln (Xn |βS )+ |S| , where E{β |X ,S} denotes the expectation
S
2n
n
S
n
with respect to the posterior distribution of βS . This approximation has been discussed in Spiegelhalter
et al. (2002) and can be verified as follows. It follows from (5), for sufficiently large n, we have
exp (nLn (Xn |βS )) π(βS )
∝ exp (nLn (Xn |βS )) π(βS )
m(Xn |S)
1
′
b n , βbS ) (βS − βbS ) ,
≈ exp nLn (Xn |βbS ) π(βbS ) exp − (βS − βbS ) nI(X
2
P (βS |Xn , S) =
in the neighborhood of βbS where Ln (Xn |βS ) is dominant; that is, the posterior of βs is asymptotically
−1
b
b
b
normal with mean βS and covariance matrix nI(Xn , βS )
. Therefore,
E{βS |Xn ,S} Ln (Xn |βS )
1
b n , βbS ) (βS − βbS )}
≈ E{βS |Xn ,S} {Ln (Xn |βbS ) − (βS − βbS )′ I(X
2
1
′ b
b
b
b
= Ln (Xn |βS ) − E{βS |Xn ,S} (βS − βS ) I(Xn , βS ) (βS − βbS )
2
1
= Ln (Xn |βbS ) − |S|/n,
2
6
and log{m(Xn |S)π(S)} can be approximated as follows
|S|
log{m(Xn |S)π(S)} ≈ nLn (Xn |βbS ) −
log(n) − γ|S| log(pn )
2
|S| |S|
−
log(n) − γ|S| log(pn )
≈ nE{βS |Xn ,S} Ln (Xn |βS ) +
2
2
|S|
log(n) − γ|S| log(pn ).
≈ nE{βS |Xn ,S} Ln (Xn |βS ) −
2
Under the large sample setting,
|S|
2
is negligible because it is of lower order of
|S|
2
log(n).
As with EBIC, we define AEBIC as -2 times the approximated log{m(Xn |S)π(S)}, i.e.,
AEBIC(s) = −2nE{βS |Xn ,S} Ln (Xn |βS ) + |S| log(n) + 2γ|S| log(pn ).
(10)
For many problems, E{βS |Xn ,S} Ln (Xn |βS ) doesn’t have a closed form, so we use MCMC samples to approximate it. This leads to the following model selection procedure based on the outputs of MCMC:
(a) Run a MCMC algorithm to simulate T samples {(βS (t) , S (t) ) : t = 1, 2, . . . , T } from the posterior
distribution π(βS , S|Xn ).
(b) Let ST denote the set of distinct models appeared in {S (t) : t = 1, 2, . . . , T }. For each S ∈ ST ,
calculate
\
AEBIC(S)
=
−2n
#{t : S (t) = S}
X
{t:S (t) =S}
Ln (Xn |βS (t) , S (t) ) + |S| log(n) + 2γ|S| log(pn ).
\
(c) Set Sb = arg minS∈ST AEBIC(S)
as the estimator for the true model S ∗ .
Remark:
• Compared to AIC, BIC and EBIC, AEBIC has at least two merits: First, it avoids to calculate MLE,
which can be very difficult for some problems. Second, it conducts a systematic search over the
model space via MCMC; however, AIC, BIC and EBIC often employ a heuristic method to search
for candidate models, and the search is often limited and incomplete.
• Compared to DIC, AEBIC has also at least two merits: First, AEBIC is invariant to reparameterization, for which the penalty term involves only the model size instead of model parameters; while
this is not for DIC. Second, AEBIC is asymptotically consistent, while DIC is not. The asymptotic
consistency of AEBIC for high-dimensional model selection will be rigorously proved in the next
section.
• For low-dimensional problems, the penalty term 2γ|S| log(pn ) can be ignored, and AEBIC is reduced
to ABIC.
7
3
Asymptotic Consistency of AEBIC
For any model S ⊂ {1, 2, ..., pn }, we let βS denote the associated vector of parameters, and let f (x|βS )
denote the likelihood function for a single data point x. Let S ∗ denote the true model, and let βS ∗ denote the
parameters of the true model. Assume |supp(βS ∗ )| = s0 = o(n) holds. Let Xn = {X (1) , . . . , X (n) } denote
P
a set of n iid observations, and let Ln (Xn |βS ) := En [log(f (x(i) |βS ))] = n1 ni=1 log(f (x(i) |βS )) denote the
empirical log-likelihood function, which is an analog of L(βS ) := E[log(f (X|βS ))]. Denote the maximizer
of Ln (Xn |βS ) and L(βS ) as β̂S and βS,0 , respectively. For any fixed S with size s := |S| such that s/n → 0,
√
we know that under some general consistency conditions of M-estimator, n(β̂S −βS,0 ) −→ N (0, V ), where
−1
2
(X|βS ))
is the inverse of the information matrix for model S. Let q denote the
]
|
V := E[ ∂ log(f
β
=β
T
S
S,0
∂β ∂β
S
S
maximum size of the models under consideration, which is allowed to increase with n and p. Also, we assume
that the size of the true model is no greater than q, i.e., s0 ≤ q. Let P(q) := {S|S ⊂ {1, 2, ..., pn }, |S| ≤ q}
denote the collection of all models with size smaller than or equal to q.
To establish consistency of AEBIC, we assume the following set of conditions:
(A1) The information matrix IS = E[ ∂
2
log f (X|βS )
|βS =βS,0 ]
2
∂βS
has bounded minimum and maximum eigenval-
ues for all |S| ≤ q. There exists an envelop function J(x) such that J(x) ≥ maxi,j∈S,S∈P(q) | ∂
and E[J 2 (x)] ≤ C for some generic constant C > 0.
2
log f (x|βS )
|
∂βi ∂βj
(A2) log(f (x|βS ) is three times differentiable and locally L1 -Lipschitz. There exists a function Jf (x) such
that Jf (x) ≥ maxS∈P(q) | log(f (x|βS ))| and E[Jf (x)2 ] ≤ C for some generic constant C > 0. The
third order derivative of log(f (x|βS )) is uniformly continuous and bounded on βS and for all S ⊂ P(q).
∂ log(f (x|βS ))
.
∂βS
2
E[|J1 (x)| ] < C1
Define f1 (x|βS ) :=
There exists a function J1 (x) > |f1 (x|βS,0 )| for all x and S ∈ P(q)
such that (i)
for some generic constant C1 > 0, and (ii) supx∈X |J1 (x)| < C2 q,
where X is the domain of observations Xn , and C2 is some generic constant. The domain X satisfies
(i)
(i)
supi=1,2,...,n;j=1,2,...,pn |xj | < M for some generic constant M , where xj denotes the jth component
of x(i) .
(A3) For any S ∈ P(q), the prior density function π(βS |S) is uniformly Lipschitz and bounded from the
above, i.e., there exists a generic constant L > 0 such that |π(βS |S) − π(βS′ |S)| ≤ L||βS − βS′ || for
any βS , βS′ ∈ RS . Also, there exists a constant c > 0 such that minS∈P(q) π(βS,0 |S) > c.
(A4) There exists an envelope function F (x) > 0 such that f (x|βS ) < F (x) for all βS with S ∈ P(q), and
E[F (x)] ≤ ∞.
(A5) The domain of βS , denoted by N (S), is compact, i.e., there exists a constant M > 0 such that
|βS |∞ ≤ M for any model S ∈ P(q). There exists an ǫ > 0 such that Bǫ (βS,0 ) ⊂ N (S), where Bǫ (x)
denotes the ǫ-ball of x.
8
(A6) For any model S ∈ P(q), there exists a unique maximizer βS,0 of L(βS ). Furthermore, for any η > 0,
2
supβS ∈B
/ η (βS,0 ) L(βS ) ≤ L(βS,0 ) + (cη ∧ 1) for some constant c > 0, where a ∧ b = min(a, b).
1
) 4 → ∞ as n → ∞.
(A7) mini |βi |/( q log(p)
n
Condition (A1) is primarily an extension of the sparse eigenvalue condition stated in Bickel et al.
(2009). Condition (A2) is required for uniform convergence of the sample average of the loglikelihood
function over different models S ∈ P(q). Here we assume that all the observations in Xn can be included
in a compact set. This assumption is more or less a technical assumption, which has often been used in
theoretical studies of Bayesian methods, see e.g. Jiang (2007). The conditions (A3)-(A5) are regularity
conditions for f , π and the domain of βS . Condition (A6) imposes a restriction on the behavior of L(βS )
which enforces consistency of the maximum likelihood estimator uniformly over all S ∈ P(q). Condition
(A7) is for model selection property.
First, we establish a uniform law of large numbers (ULLN) for Ln (Xn |βS ), which is modified from
Lemma 1 of Foygel and Drton (2011). The proofs of all lemmas and theorems are deferred to the Appendix.
Lemma 1. (ULLN) If conditions (A1)-(A6) hold and
q 2 log(p)
n
to 1, uniformly for all S ∈ P(q) and βS ∈ R|S| ,
→ 0 as n → ∞, then with probability going
(i) Ln (Xn |βS ) →p L(βS ), where →p denotes convergence in probability.
(ii) β̂S →p βS,0 .
√
(iii) For any kβS −βS,0 k2 ≤ ln / q, where ln is a decreasing sequence that converges to 0, IˆS (βS ) converges
to IS (βS ) in Frobenius norm. Consequently, the maximum and minimum eigenvalue of IˆS (βS ) is
bounded away from 0 and bounded from the above uniformly for all S ∈ P(q).
1
3
q log(p) 4
(iv) kβ̂S − βS,0 k2 = O
. If q log(p)
→ 0, then IˆS (β̂S )’s eigenvalues are bounded away from 0
n
n
and bounded from the above uniformly for all S ∈ P(q).
Together with condition (A1), statement (iv) of Lemma 1 implies that IˆS (β̂S ) has uniformly bounded
minimum and maximum eigenvalues for all S ∈ P(q).
By Lemma 1, uniformly for any S and for any βS ∈
/ Bǫn (β̂S ), Ln (Xn |βS ) − Ln (Xn |β̂S ) = L(βS ) −
2 f (x |β̂ )
∂
i S
]kβ .
L(βS,0 ) + op (1). Define IˆS (βS ) := En [
2
∂ β̂S
S
Lemma 2 (Approximation). Suppose the conditions (A1)-(A6) hold. If
q 3 log(p)
n
→ 0 as n → ∞, then
for any sufficiently large n and a decreasing sequence ln that converges to 0, with probability going to 1,
uniformly over all S ∈ P(q), we have
(i) Ln (Xn |βS ) − Ln (β̂S ) ≤p L(βS ) − L(βS,0 ) + O(
q
9
q log(p)
).
n
(ii) For any βS such that ||βS − β̂S || ≤
ln
√
q,
then
1
Ln (Xn |βS )−Ln (Xn |β̂S ) = − (βS −β̂S )T IˆS (β̂S )(βS −β̂S )+Op
2
√
3
qkβS − β̂S k ∨
r
q 2 log(p)
kβS − β̂S k2
n
Recall that the posterior density function of βS , given model S, is given by
P (βS |Xn , S) = exp{nLn (Xn |βS )}π(βS )/m(Xn |S),
where m(Xn |S) :=
R
βS
exp{nLn (Xn |βS )}π(βS )dβS . Without loss of generality, we assume that π(S) = 0
for any s 6∈ P(q). Based on the previous two lemmas, we can now establish our main results. Theorem
1 shows that the posterior mean of log P (βS |Xn , S) can be approximated by log P (β̂S |Xn , S), where β̂S
denotes the MLE of βS . This theorem forms a theoretical foundation for the consistency of AEBIC.
Theorem 1 (Posterior mean approximation of the log-likelihood function). Suppose the conditions (A1)(A6) hold and
q 4 log2 (n) log(p)
n
→ 0 as n → ∞. Then for sufficiently large n, with probability going to 1,
uniformly over all S ∈ P(q), we have
s
4 log2 n log(p)
|S|
|S|
q
,
+O
EβS |Xn ,S [Ln (Xn |βS )] = Ln (Xn |β̂S ) −
2n
n
n
where β̂S denotes the MLE of βS .
Lemma 3 establishes an upper bound for the maximum likelihood values of overfitted models.
Lemma 3. Assume the conditions (A1)-(A6) hold. For any model S such that |S| ≤ q, S ∗ ⊂ S and
q 4 log2 (p)
n
→ 0, the following statement holds uniformly for such models with probability going to 1,
2n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ )) ≤ (1 + η)|S\S ∗ | log(nζ p),
where η > 0 is an arbitrarily small constant, ζ >
1
5
is any fixed constant, and β̂S ∗ denotes the MLE of βS ∗ .
Finally, we have the following theorem for the consistency of AEBIC in model selection.
Theorem 2 (Model Selection Consistency). Assume that the conditions (A1)-(A7) are satisfied. If γ > 1,
q 6 log2 (n)
n log(p)
→ 0, and
q 4 log2 (p)
n
→ 0, then for sufficiently large n, with probability going to 1, it holds that for
all models S with |S| ≤ q and S 6= S ∗ ,
AEBIC(S ∗ ) < AEBIC(S),
where AEBIC(S) is as defined in (10).
10
!
.
4
Applications of ABIC and AEBIC to Generalized Linear Models
In this section we consider the applications of ABIC and AEBIC to generalized linear models (GLMs) (McCullagh and Nelder, 1989). Let X = (y, z), where y denotes the response variable and z = (z1 , z2 , . . . , zp )
denote a vector of predictors. Suppose that the distribution function of X is given by
f (y|z, β) = exp {a(θ)y + b(θ) + c(y)} ,
(11)
where a(·) and b(·) are continuously differentiable functions of θ, c(y) is a constant function of y, θ is the
so-called natural parameter that relates y to the predictors via a linear function θ = z T β = β1 z1 +· · ·+βp zp ,
where β1 , . . . , βp are regression coefficients. Here, the intercept term has been treated as a special predictor
included in z. The mean function u = E(y|z) = −b′ (θ)/a′ (θ) := ψ(θ), where ψ(θ) is the inverse of a chosen
link function. This class of GLMs includes regression models with the responses that are binary, Poisson,
and Gaussian (with known variance), whose respective link functions are logit, log and linear.
For this class of GLMs, we follow Liang et al. (2013) to assume that each model S is subject to the
following prior distribution
π(S) ∝ λn|S| (1 − λn )pn −|S| I(|S| ≤ q),
(12)
where λn is chosen according to (7). Further, we let βS be subject to a Gaussian prior distribution
N (0, σS2 I|S| ), where I|S| denotes a |S| × |S|-matrix and
σS2 =
1 C0 /|S|
e
,
2π
(13)
for some positive constant C0 . As argued in Liang et al. (2013), under such prior setting, we have
log π(βS ) = O(1), which satisfies the requirement of unit prior information in derivation of BIC (Ando,
2010). The unit prior information is also implicitly required in condition (A3). Then, with an appropriate
choice of q, which works as a sparsity constraint for the true model, we can verify that the GLM (11)
satisfies the conditions (A1)-(A7). We note that under a set of slightly weaker but less general conditions,
we have also proved the consistency of AEBIC for GLMs. Refer to the supplementary material of the
paper for the detail.
4.1
Low-Dimensional Case
We first work on some low-dimensional problems. In this case, ABIC, which is defined in (2), is used as
the model selection criterion.
Logistic Regression
Consider a logistic regression with three active predictors
logit P (y = 1|z) = 2z1 + z2 + 3z3 ,
11
(14)
for which the total number of candidate predictors is p = 50 and the sample size is n = 500. All candidate
predictors are generated from the standard Gaussian distribution. For the prior π(S), we set γ = 0 in (7)
and ignored the upper bound q; and for the prior π(βS ), we set C0 = 10 in (13).
To simulate from the posterior distribution of (S, βS ), we implemented the reversible jump MCMC
algorithm (Green, 1995), which includes three types of moves, birth, death and parameter updating. In
the birth step, we randomly choose a predictor excluded from the current model S, and include it into
S to produce a larger model S ′ . The regression coefficient of the new predictor is generated from the
Gaussian distribution N (0, 102 ). In the death step, we randomly choose a predictor included in the current
model S, and remove it from S to produce a smaller model S ′ . In the parameter updating step, we keep
S unchanged, but randomly choose one coefficient βi , i ∈ S, to update with a Gaussian random walk
proposal for which the variance is set to 0.52 . To accelerate the convergence of the simulation, an advanced
MCMC algorithm, such as stochastic approximation Monte Carlo (Liang et al., 2007), can be used. The
algorithm was run for 106 iterations, where the first 105 iterations were discarded for the burn-in process
and the posterior samples were collected at every 10th iteration from the remaining part of the run. Based
on the collected samples, ABIC was calculated for different models according to the procedure given in
Section 2. Based on the same set of posterior samples, we also calculated AIC,BIC and DIC for different
models. For simplicity, we only considered the models with the sampling frequency exceeding 100.
The simulation was repeated for 100 times, where a different dataset was generated at each time. The
results were summarized in Table 1. The performance of the methods was evaluated using three metrics:
(i) mean (µ|Sbi | ) and standard deviation (σ|Sbi | ) of |Sbi |, where Sbi denotes the model selected in the ith run;
P
P100 ∗ b
P100 b
∗
∗
b
(ii) recall= [ 100
i=1 |S ∩ Si |]/[|S | · 100]; and (iii) precision= [ i=1 |S ∩ Si |]/[ i=1 |Si |].
Table 1: Comparison of AIC, BIC, DIC and ABIC for low dimensional logistic regression.
µ|Sbi | (σ|Sbi | )
Recall
Precision
AIC
8.86(1.99)
1.000
0.339
BIC
3.72(0.96)
1.000
0.806
DIC
9.04(2.15)
1.000
0.332
ABIC
3.33(0.57)
1.000
0.901
Table 1 indicates that the simulation results agree well with the theoretical results: BIC and ABIC
are consistent for model selection, while AIC and DIC are not and they tend to select larger models. It is
remarkable that ABIC also significantly outperforms BIC in terms of model selection accuracy, i.e. µ|Sbi |
and precision.
Linear Regression
Next we consider a linear regression with five active predictors
y = z1 + 2.2z2 − 1.6z3 + 2z4 − 1.4z5 + ǫ,
12
(15)
where ǫ follows the standard normal distribution. For this example, we also set p = 50 and n = 500 and
generated all predictors from the standard normal distribution. In addition, we specified the same prior
distributions and employed the same MCMC algorithm as for the logistic regression example. Table 2
summarizes the results for 100 independent runs. For this example, similar conclusions can be made as
for the logistic example: AIC and DIC tend to select larger models, BIC and ABIC tend to select correct
models, and ABIC outperforms all the other criteria.
Table 2: Comparison of AIC, BIC, DIC and ABIC for low dimensional linear regression.
µ|Sbi | (σ|Sbi | )
Recall
Precision
4.2
AIC
8.70(1.15)
1.000
0.575
BIC
5.59(0.82)
1.000
0.894
DIC
9.05(1.08)
1.000
0.552
ABIC
5.35(0.74)
1.000
0.934
High Dimensional Case
For the high-dimensional case, AEBIC is used as the model selection criterion.
Logistic Regression
Consider the logistic regression (14) again. Here we increased p from 50 to 2000,
while keeping the sample size n = 500 unchanged. For the prior π(S), we tried four different γ values
0.5, 0.6, 0.7, and 0.8, and fixed the upper bound q to 50. For the prior π(βS ), we still set C0 = 10. The
simulations were done as for the previous examples. Based on the collected samples, we calculated both
EBIC and AEBIC for the models with the sampling frequency greater than 20. AIC, BIC and DIC are
not applicable to high-dimensional problems. Table 3 summarizes the results for 100 independent runs.
Table 3: Comparison of EBIC and AEBIC for high-dimensional logistic regression.
EBIC
AEBIC
γ
µ|Sbi | (σ|Sbi | )
Recall
Precision
µ|Sbi | (σ|Sbi | )
Recall
Precision
0.5
3.60(0.96)
1.000
0.833
3.39(0.89)
1.000
0.885
0.6
3.21(0.46)
1.000
0.935
3.11(0.31)
1.000
0.965
0.7
3.15(0.43)
1.000
0.949
3.06(0.28)
1.000
0.977
0.8
3.07(0.30)
1.000
0.977
3.02(0.20)
0.997
0.990
The comparison indicates that as γ increases, both EBIC and AEBIC tend to select more sparse models.
This is due to that a larger value of γ imposes a heavier penalty on the model size. It is remarkable that
AEBIC is uniformly better than EBIC for all values of γ in terms of model size and true discovery rate.
Linear Regression
We reconsidered the linear regression (15) with an enlarged value of p = 2000,
while keeping other settings unchanged. We adopted the same prior setting as for the high-dimensional
logistic regression example. For simulation, the MH algorithm was slightly modified for this example. The
13
modification is in the birth step, where we no longer choose new predictors uniformly, instead, we choose
each predictor with a probability proportional to |corr(y, xk )|. Here corr(·) denotes the Pearson correlation
of two random variables. This modification can significantly accelerate the convergence of the simulation.
Table 4 summarizes the results for 100 independent runs. The comparison indicates again that AEBIC
outperforms EBIC in terms of model size and true discovery rate.
Table 4: Comparison of EBIC and AEBIC for high-dimensional linear regression.
EBIC
AEBIC
4.3
γ
µ|Sbi | (σ|Sbi | )
Recall
Precision
µ|Sbi | (σ|Sbi | )
Recall
Precision
0.5
5.30(0.54)
1.000
0.943
5.16(0.42)
1.000
0.969
0.6
5.15(0.39)
1.000
0.971
5.06(0.24)
1.000
0.988
0.7
5.07(0.26)
1.000
0.986
5.05(0.22)
1.000
0.990
0.8
5.03(0.17)
1.000
0.994
5.01(0.10)
1.000
0.998
A Real Data Example
We considered a dataset collected by Singh et al. (2002), which consists of the expression values of 6033
genes on 102 samples, 52 prostate cancer patients and 50 controls. The dataset is downloadable at http:
//statweb.stanford.edu/~ckirby/brad/LSI/datasets-and-programs/datasets.html, and has been
analyzed in quite a few papers, see e.g., Chen and Chen (2012), Efron (2009), and Liang et al. (2013).
Our aim is to identify important genes that are associated with the prostate cancer via high-dimensional
logistic regression.
For the prior π(S), we set γ = 0.7 and fix the upper bound q to 50. For the prior π(βS ), we set
C0 = 10. The MH algorithm was slightly modified from the one used before. The modification is in
the birth step, where we no longer choose new predictors uniformly, instead, we choose each predictor
with a probability proportional to the weight wk = Null Deviance − Deviancek + 0.1, where k indexes the
genes, “Null Deviance” denotes the deviance of the null model which includes the intercept term only, and
Deviancek denotes the deviance of the model which includes the intercept term and gene k only.
The algorithm was run for 107 iterations, where the first 106 iterations were discarded for the burn-in
process and the samples were collected at every 10th iterations from the remaining part of the run. Based
on the collected samples, AEBIC were calculated for each model with the sampling frequency greater than
10. For this example, we observed multiple models with very similar AEBIC values. As in Chen and Chen
(2012), we ranked genes based on their appearance frequencies in top 100 models. It is interesting to point
out that our rank has much overlap with the rank given in Chen and Chen (2012), which was established
based on EBIC but with a different model searching procedure. Table 5 gives top 10 genes reported in
Chen and Chen (2012) and their corresponding ranks by our method.
14
Table 5: Top 10 genes ranked by EBIC in Chen and Chen (2012) and their corresponding ranks by AEBIC,
where the genes are labeled by their column numbers in the dataset.
Gene
EBIC
AEBIC
5
v610
1
1
v1720
2
15
v332
3
8
v364
4
6
v1068
5
10
v914
6
5
v3940
7
3
v1077
8
7
v4331
9
4
v579
10
19
Conclusion
In this paper, we have proposed ABIC and AEBIC as general Bayesian model selection criteria for low
and high-dimensional problems, respectively; established their consistency under mild conditions; and
illustrated their applications using generalized linear models. The numerical results indicate that the
proposed criteria can significantly outperform the existing ones, such as AIC, BIC, DIC and EBIC, in
terms model selection accuracy.
Compared to AIC, BIC and EBIC, the proposed criteria avoid to calculate MLEs, which can be very
difficult for some problems. Also, the MCMC simulations required by the proposed criteria provide a
systematic way to search for the best ABIC or AEBIC models. While the AIC, BIC or EBIC methods
often employ a heuristic method to search over the model space, and the search is often limited and
incomplete. ABIC and AEBIC have also overcome many limitations of DIC. As explained previously, they
both are invariant to reparameterization and asymptotically consistent, while DIC is not. Given these
attractive features, we expect that ABIC and AEBIC will be widely adopted by researchers for Bayesian
model selection in the near future.
Acknowledgement
Liang’s research was partially supported by grants DMS-1612924 and R01-GM117597.
Appendix A
Technical Details
In the appendix, we use Xn = {X (1) , . . . , X (n) } to denote a set of n observations, use Pn to denote the
empirical measure of the observations, use Ln (Xn |βS ) to denote the averaged log-likelihood function as
defined in (1), use S(Xn , βS ) to denote the score function ∂Ln (Xn |βS )/∂βS , and use IˆS (βS )(or IˆS ) to denote
the observed information matrix −∂ 2 Ln (Xn |βS )/∂βS ∂βST . For simplicity, we also depress the subscript of
pn and replace it by p.
A.1
Proof of Lemma 1
Part (i) Define FS (M ) := {log f (x|βS )1(F ≤ M ) : βS ∈ R|S| }, where F is defined in (A4). Define
Fq (M ) := ∪|S|≤q,S⊂{1,2,...,p} FS (M ). It is easy to see that for any s ≤ s0 , by the Lipschitz condition
15
of log f (x|β) and condition (A6), the covering number of FS (M ) relative to the L1 (Pn )-norm, which is
denoted by N(ǫ, FS (M ), L1 (Pn )), satisfies
N (ǫ, FS (M ), L1 (Pn )) ≤ C0
1
,
ǫ|S|
for some generic constant C0 > 0, where Pn denotes the empirical measure of n observations.
Therefore, the covering number of Fq (M ) satisfies
N (ǫ, Fq (M ), L1 (Pn )) ≤
q
X
Cps C0
s=1
1
,
ǫs
where Cps is the combinatorial number of choosing s from p. Thus,
log N (ǫ, Fs (M ), L1 (Pn )) ≤
q
X
log Cps + log(C0 ) + q log(1/ǫ) - q log p + q log(1/ǫ) = o(n).
s=1
Therefore, by Uniform Law of Large Numbers (Theorem 2.4.3 of Van der Vaart and Wellner (1997)),
|Ln (Xn |βS ) − L(βS )| →p 0,
holds uniformly for all S ∈ P(q).
Part (ii) It follows directly from the conclusion of part (i) and condition (A7).
Part (iii)
First, by the continuity condition stated in condition (A2), it is easy to see that uniformly
over all S ∈ P(q), kIS (βS ) − IS (βS,0 )k∞ ≤ LkβS − βS,0 k holds for some constant L. Hence,
√
√
kIˆS (βS ) − IˆS (βS,0 )k2 ≤ qkIˆS (βS ) − IˆS (βS,0 )k∞ ≤ L q||βS − βS,0 ||.
For any βS such that
kβS −βS,0 k
q
(16)
→ 0, the eigenvalue of IˆS (βS ) is bounded away from 0 and bounded from
the above uniformly if IˆS (βS,0 )’s eigenvalues have the same property. Therefore, it suffices to show a ULLN
that IˆS (βS,0 ) converges in probability to IS (βS,0 ) uniformly in Frobenius norm, for any S ∈ P(q).
Again, we show it by first examining the convergence of IˆS (βS,0 ) to IS (βS,0 ) under L∞ norm. By con-
dition (A1), J(x) performs as an envelop function of
of
∂2
log(f (x|βS,0 ))
∂βi ∂βj
∂ 2 ∂ 2 log f (x|βS,0 )
.
∂βi ∂βj
Therefore, by the Lipschitz condition
and the existence of function J, the covering number of FI (q) := {
S, S ∈ P(q)} satisfies
log N (ǫ, FI (q), L2 (Pn )) - log
for any ǫ > 0.
16
q
X
s=1
s
2
Cps
!
+ q log(1/ǫ),
∂ 2 log(f (x|βS,0 ))
|i, j
∂βi ∂βj
∈
√
By Theorem 2.5.2 of Van der Vaart and Wellner (1997), P ∗ ( n maxS∈P(q) kIˆS (βS,0 − IS (βS,0 )k∞ ) >
p
R1 p
x) ≤ C x2 ǫ=0 log N (ǫ, FI (q), L2 (Pn )) ≤ C1 x2 q log(p), where P ∗ (A) denotes the outer probability of an
p
event A and C, C1 are some constant that only depends on theqfunction J. Setting x = K q log(p) for any
1
ˆ
K > 0, we have P ∗ (maxS∈P(q) ||IˆS (βS,0 ) − IS (βS,0 )||∞ ) > K q log(p)
) < 2C
n
K , i.e., maxS∈P(q) ||IS (βS,0 ) −
q
).
IS (βS,0 )||∞ ) = Op ( q log(p)
n
Since the Frobenius norm of IˆS (βS,0 ) − IS (βS,0 ) satisfies
kIˆS (βS,0 ) − IS (βS,0 )k2 ≤
√
qkIˆS (βS,0 ) − IS (βS,0 )k∞ = Op
r
q 2 log(p)
n
!
,
(17)
uniformly for all S ∈ P(q), we have
√
kIˆS (βS ) − IS (βS,0 )k2 - qkβS − βS,0 k +
r
q 2 log(p)
,
n
by combining inequality (16) and inequality (17). By the assumption that q 2 log(p)/n → 0, the statement
(iii) holds.
Part (iv) By condition (A6), L(β̂S ) ≥ L(βS,0 ) + c||β̂S − βS,0 ||2 . On the other hand, Ln (β̂S ) ≤ Ln (βS,0 ).
By the Lipschitz condition of log f (x|βS ) and the existence of function Jf , the covering number of
F(q) := {log f (x|βS ) : |S| ≤ q} satisfies
log N (ǫ, F(q), L2 (Pn )) - log(
q
X
Cps ) + q log(1/ǫ),
s=1
for any ǫ > 0.
√
By Theorem 2.5.2 of Van der Vaart and Wellner (1997), P ∗ (maxS∈P(q) n(Ln (Xn |βS ) − L(βS )| > x) ≤
p
R1 p
C2 x2 ǫ=0 log N (ǫ, F(q), L2 (Pn )) ≤ C3 x2 q log(p), where P ∗ (A) denotes the outer probability of an event
p
A and C2 , C3 are some constant that only depends on the function Jf . Setting x = K q log(p) for any
K > 0, we have
P ∗ ( max |Ln (Xn |βS ) − L(βS )| > K
S∈P(q)
q
r
q log(p)
2C3
)≤
,
n
K
i.e., maxS∈P(q) (Ln (Xn |βS ) − L(βS ) = O( q log(p)
). Consequently,
n
q
q
q log(p)
q log(p)
) ≤ L(βS,0 ) + O(
). This implies that
Ln (βS,0 ) + Op (
n
n
2
ckβ̂S − βS,0 k = Op (
i.e., ||β̂S − βS,0 || ≤
q log(p)
n
1
4
.
17
r
q log(p)
),
n
L(β̂S ) ≤ Ln (β̂S ) + Op (
q
q log(p)
)
n
≤
(18)
If
q 3 log(p)
n
→ 0, then
q log(p)
n
1
4
= o( √1q ), i.e., the condition for statement (iii) holds. Hence, IˆS (β̂S )’s
eigenvalues are bounded away from 0 and bounded from the above uniformly over all S ∈ P(q).
A.2
Proof of Lemma 2
Part (i) By the optimality of β̂S , we have Ln (Xn |βS ) − Ln (β̂S ) ≤ Ln (Xn |βS ) − Ln (βS,0 ). By Uniform
Central Limit Theorem
q obtained in the proof of Statement (iv) of Lemma 1, Ln (Xn |βS ) − Ln (βS,0 ) ≤
). Therefore,
L(βS ) − L(βS,0 ) + O( q log(p)
n
Ln (Xn |βS ) − Ln (β̂S ) ≤ L(βS ) − L(βS,0 ) + O(
r
q log(p)
).
n
Part (ii) Consider the local Taylor expansion of Ln (Xn |βS ) around β̂S ,
1
Ln (Xn |βS ) = Ln (Xn |β̂S ) − (βS − β̂S )T IˆS (βS∗ )(βS − β̂S ),
2
where βS∗ is a point between βS and β̂S . By statement (iii) of Lemma 1, IˆS (βS∗ ) = IˆS (β̂S ) + op (1) uniformly
with respect to the Frobenius norm when ||βS − β̂S || = o( √1q ).
The proof of statement (iii) of Lemma 1 in fact provides a bound for the Frobenius norm of IˆS (βS∗ ) −
ˆS (β̂S ). Therefore, Ln (Xn |βS ) − Ln (Xn |β̂S ) = − 1 (βS − β̂S )T IˆS (βS − β̂S ) + ||βS − β̂S ||2 O(√q||βS − β̂S || ∨
Iq
2
q 2 log(p)
).
n
A.3
Proof of Theorem 1
For sufficiently large n, with probability going to 1, all statements in Lemma 1 and Lemma 2 hold and the
following proof is based on these statements. First, we note that
EβS |Xn ,S [Ln (Xn |βS )] − Ln (Xn |β̂S ) =
Z
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS ,
where P (βS |Xn , S) = exp(nLn (Xn |βS ))π(βS )/m(S) and m(S) =
R
βS
(19)
exp(nLn (Xn |βS ))π(βS )dβS .
We split the integral domain
in (19) into three regions: a small neighborhood around β̂S , denoted
q
Kq log(n)
by BǫK (β̂S ), where ǫK =
; the area between the small neighborhood BǫK (β̂S ) and a larger
n
3
1
8
→ 0; and the
neighborhood Bǫn (β̂S ), denoted by BǫK (β̂S ) − Bǫn (β̂S ), where ǫn = √lnq and ln = q log(p)
n
rest area Bǫcn (β̂S ).
For the neighborhood BǫK (β̂S ), since q 3 log(p)/n → 0, we have ǫK = o( √1q ). Applying Lemma 2, we
obtain that, with probability going to 1, for any βS ∈ BǫK (β̂S ),
1
Ln (Xn |βS ) − Ln (Xn |β̂S ) = − (βS − β̂S )T IˆS (βS − β̂S ) + rn (βS )||βS − β̂S ||22 ,
2
18
√
where |rn (βS )| ≤ C( qkβS − β̂S k2 ∨
Similarly, we have
q
q 2 log(p)
)
n
=C
q
q 2 log(p)
n
uniformly for some constant C > 0.
1
exp(nLn (Xn |βS ))π(βS ) = exp(nLn (Xn |β̂S )) exp(−n (βS −β̂S )T IˆS (βS −β̂S )) exp(nrn (βS )||βS −β̂S ||22 )π(βS ),
2
where nrn (βS )||βS − β̂S ||22 ≤ n
q
2
4
log(p)
)).
O( q log (n)
n
q
q 2 log(p) Kq log(n)
n
n
-
q
q 4 log2 (n) log(p)
n
and thus exp(nrn ||βS − β̂S ||22 ) = (1 +
Also, for any βS ∈ BǫK , by condition (A3), we have
π(βS ) = π(β̂S ) + O(||βS − β̂S ||2 ) = π(β̂S )(1 + O(
Since
q
q 4 log2 (n) log(p)
n
dominates
q
q log(n)
,
n
r
q log(n)
).
n
we finally have
n
exp(nLn (Xn |βS ))π(βS ) = π(β̂S ) exp nLn (Xn |β̂S ) − (βS − β̂S )T IˆS (βS − β̂S )
2
s
q 4 log2 (n) log(p)
) .
× 1 + O(
n
For any βS ∈ BǫK (β̂S ) − Bǫn (β̂S ), since ǫn =
√
|rn (βS )| ≤ q||βS − β̂S ||2 → 0. Therefore
ln
√
q
(20)
= o( √1q ), we can apply Lemma 2 again and get
1
Ln (Xn |βS ) − Ln (Xn |β̂S ) = − (βS − β̂S )T IˆS (βS − β̂S ) + rn (βS )||βS − β̂S ||22
2
1
c1
= − (βS − β̂S )′ IˆS (βS − β̂S )(1 + o(1)) ≤ − ||βS − β̂S ||2 ,
2
4
where c1 > 0 is the lower bound of the minimum eigenvalue of IˆS for all S ∈ P(q). Hence, for sufficiently
large n, we have
exp(nLn (Xn |βS ))π(βS ) ≤ exp(nLn (Xn |β̂S )) exp(−
nc1
||βS − β̂S ||2 )π(βS ).
4
For any βS ∈ Bǫcn (β̂S ), by statement (i) of Lemma 2, we have
Ln (Xn |βS ) − Ln (Xn |β̂S ) ≤ L(βS ) − L(βS,0 ) + O(
r
q log(p)
).
n
By condition (A6), we also have
L(βS ) − L(βS,0 ) ≤ −c min(
inf
βS ∈Bǫcn (βS )
19
kβS − β̂S k2 , 1) ≤ −
cln2
.
q
(21)
Note that the quantity
ln2 /q
Hence, for any βS ∈
dominates
Bǫc (β̂S ),
q
q log(p)
n
because ln =
we have
q 3 log(p)
n
1
8
.
cln2
)(1 + o(1)))}
q
c′ nln2
),
≤ π(βS ) exp(nLn (Xn |β̂S )) exp(−
q
exp(nLn (Xn |βS ))π(βS ) ≤ exp(nLn (Xn |β̂S ))π(βS ) exp{−n(
(22)
for some generic constant c′ > 0.
Now let’s calculate m(S) by dividing it into three parts according to the previous partition of the
integral region, that is,
m(S) :=
=
+
Z
Z
Z
exp(nLn (Xn |βS ))π(βS )dβS
exp(nLn (Xn |βS ))π(βS )dβS +
βS ∈BǫK (β̂S )
βS ∈Bǫc (β̂S )
Z
βS ∈Bǫn (β̂S )−BǫK (β̂S )
exp(nLn (Xn |βS ))π(βS )dβS
(23)
exp(nLn (Xn |βS ))π(βS )dβS
= (Int1) + (Int2) + (Int3).
The first term of equation (23) satisfies:
Int1 : =
Z
βS ∈BǫK (β̂S )
exp(nLn (Xn |βS ))π(βS )dβS
s
q 4 log2 (n) log(p)
≤ exp(nLn (Xn |β̂S ))π(β̂S )(1 + O(
))
n
Z
n
exp(− (βS − β̂S )T IˆS (βS − β̂S ))dβS ,
Kq
log(n)
2
(βS −β̂S )T IˆS (βS −β̂S )≤c2
n
√
where c2 is the upper bound of the maximum eigenvalue of IˆS . Let ξ = nIˆ1/2 (βS − β̂S ), then we have
Z
Kq log(n)
(βS −β̂S )T IˆS (βS −β̂S )≤c2
n
1
2π |S|
= ( ) 2 |IˆS |− 2
n
Z
n
exp(− (βS − β̂S )T IˆS (βS − β̂S ))dβS
2
(2π)−
|S|
2
ξ T ξ≤c2 Kq log(n)
1
1
2π |S|
exp(− ξ T ξ)dξ ≤ ( ) 2 |IˆS |− 2 ,
2
n
where the last inequality is derived from the fact that (2π)−
|S|
2
exp(− 21 ξ T ξ) exactly equals the density
function of N (0, I|S| ). Now we obtain the upper bound of Int1
Int1 ≤ exp(nLn (Xn |β̂S ))π(β̂S )(1 + O(
20
s
q 4 log2 (n) log(p) 2π |S| ˆ − 1
))( ) 2 |IS | 2 .
n
n
Similarly, we have
s
q 4 log2 (n) log(p)
Int1 : ≥ exp(nLn (Xn |β̂S ))π(β̂S )(1 − O(
))
n
Z
n
exp(− (βS − β̂S )T IˆS (βS − β̂S ))dβS ,
Kq
log(n)
2
(βS −β̂S )T IˆS (βS −β̂S )≤c1
n
√
where c1 is the lower bound of the minimum eigenvalue of IˆS . Let ξ = nIˆ1/2 (βS − β̂S ), this time we have
Z
Kq log(n)
(βS −β̂S )T IˆS (βS −β̂S )≤c1
n
n
exp(− (βS − β̂S )T IˆS (βS − β̂S ))dβS
2
Z
|S|
2π |S| ˆ − 1
1
) 2 | IS | 2
(2π)− 2 exp(− ξ T ξ)dξ
n
2
ξ T ξ≤c1 Kq log(n)
Z
|S|
1
1
2π |S|
2π |S|
1
1
≥ ( ) 2 |IˆS |− 2
(2π)− 2 exp(− ξ T ξ)dξ ≥ ( ) 2 |IˆS |− 2 (1 − |S| ),
n
2
n
n
ξ T ξ≤5|S| log(n)
= (
where the first inequality holds if we choose K > 5c1 , and the second inequality is based on the Lemma 1
of Foygel Barber et al. (2015). Now we obtain the lower bound of Int1
Int1 ≥ exp(nLn (Xn |β̂S ))π(β̂S )(1 − O(
Since
q
q 4 log2 (n) log(p)
n
dominates
1
,
n|S|
s
q 4 log2 (n) log(p) 2π |S| ˆ − 1
1
))( ) 2 |IS | 2 (1 − |S| ).
n
n
n
therefore we have
Int1 = exp(nLn (Xn |β̂S ))π(β̂S )(
2π
)
n
|S|
2
1
|IˆS |− 2 (1 + O(
s
q 4 log2 (n) log(p)
)).
n
The second term of equation (23) satisfies
Int2 : =
Z
βS ∈Bǫn −BǫK
exp(nLn (Xn |βS ))π(βS )dβS
≤ exp(nLn (Xn |β̂S ))
Z
π(βS ) exp(−
βS ∈Bǫn −BǫK
nc1 Kq log(n)
≤ exp(nLn (Xn |β̂S )) exp(−
)
4
n
≤ C5 exp(nLn (Xn |β̂S ))
1
nKqc1 /4
21
.
Z
nc1
||βS − β̂S ||2 )dβS
4
π(βS )dβS
βS ∈Bǫn −BǫK
The third term of equation (23) satisfies
Int3 : =
Z
βS ∈Bǫcn (β̂S )
exp(nLn (Xn |βS ))π(βS )dβS
Z
c′ nln2
≤ exp(nLn (Xn |β̂S )) exp(−
) π(βS )dβS
q
c′ nln2
).
≤ exp(nLn (Xn |β̂S )) exp(−
q
Combining them together, we have
m(S) = Int1 + Int2 + Int3
s
q 4 log2 (n) log(p)
2π |S| ˆ − 1
= exp(nLn (Xn |β̂S ))π(β̂S )( ) 2 |IS | 2 (1 + O(
))
n
n
1
c′ nln2
))
+ exp(nLn (Xn |β̂S ))O( Kqc /4 ) + exp(nLn (Xn |β̂S ))O(exp(−
q
n 1
s
2π |S| ˆ − 1
q 4 log2 (n) log(p)
2
2
= exp(nLn (Xn |β̂S ))π(β̂S )( ) |IS |
1 + O(
)
n
n
n |S| ˆ − 1
1
c′ nln2
n |S| ˆ − 1
−1
−1
2
2
2
2
)) .
+π (β̂S )( ) |IS | O( Kqc /4 ) + π (β̂S )( ) |IS | O(exp(−
2π
2π
q
n 1
By choosing K > 2/c1 , it can be verified that
π −1 (β̂S )(
n
)
2π
|S|
2
1
1
|IˆS |− 2 O(
nKqc1
) = o(
/4
s
and
1
c′ nln2
n |S|
)) = o(
π −1 (β̂S )( ) 2 |IˆS |− 2 O(exp(−
2π
q
q 4 log2 (n) log(p)
),
n
s
q 4 log2 (n) log(p)
).
n
Therefore, we finally have
2π |S| ˆ − 1
m(S) = exp(nLn (Xn |β̂S ))π(β̂S )( ) 2 |IS | 2 1 + O(
n
s
q 4 log2 (n) log(p)
) .
n
Now we return to the calculation of EβS |Xn ,S [Ln (Xn |β̂S )] − Ln (Xn |β̂S )). That is,
Z
+
+
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS = −
Z
Z
Z
B ǫK
P (βS |Xn , S)rn (βS )||βS − β̂S ||2 dβS +
Bǫcn
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
Z
B ǫK
Bǫn −BǫK
= (I1) + (I2) + (I3) + (I4).
22
1
P (βS |Xn , S) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
(24)
The first component of equation (24) is
Z
−1
(βS − β̂S )′ IˆS (βS − β̂S )dβS
2
B ǫK
s
1
q 4 log2 (n) log(p)
=
exp(nLn (Xn |β̂S ))π(β̂S )(1 + O(
))
m(S)
n
Z
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S ))(−1) (βS − β̂S )′ IˆS (βS − β̂S )dβS .
2
2
B ǫK
1
(I1) =
m(S)
Let ξ =
exp(nLn (Xn |βS ))π(βS )
√ ˆ1/2
nI (βS − β̂S ), then we have
Z
≤
Z
= (
B ǫK
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S )) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
2
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S )) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
2
Kq log(n)
(βS −β̂S )T IˆS (βS −β̂S )≤c2
n
2π |S| ˆ − 1
) 2 | IS | 2
n
Z
(2π)−
ξ T ξ≤c2 Kq log(n)
|S|
2
1 1
1
2π |S|
1
exp(− ξ T ξ) ξ T ξdξ ≤ ( ) 2 |IˆS |− 2 |S|.
2
2n
n
2n
We also have
Z
≥
Z
B ǫK
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S )) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
2
Kq log(n)
(βS −β̂S )T IˆS (βS −β̂S )≤c1
n
1 1
2π |S|
= ( ) 2 |IˆS |− 2
n
2n
2π
= ( )
n
|S|
2
Z
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S )) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
2
(2π)−
|S|
2
ξ T ξ≤c1 Kq log(n)
1 1
|S| −
|IˆS |− 2
2n
Z
1
exp(− ξ T ξ)ξ T ξdξ
2
(2π)
−
|S|
2
ξ T ξ≥c1 Kq log(n)
1 T
T
exp(− ξ ξ)ξ ξdξ .
2
For sufficiently large n, exp( 41 ξ T ξ) ≥ ξ T ξ, so
Z
≤
Z
ξ T ξ≥c
(2π)−
|S|
2
(2π)−
|S|
2
1 Kq log(n)
ξ T ξ≥c1 Kq log(n)
≤ (2π)−
|S|
2
4(π)|S|/2
Γ(|S|/2)
[c1
1
exp(− ξ T ξ)ξ T ξdξ
2
1
exp(− ξ T ξ)dξ
4
Kq log(n)]|S|−1
1/4
1
exp(− c1 Kq log(n)) = o(
4
s
q 4 log2 (n) log(p)
),
n
where the second inequality is based on the Lemma 2 of Foygel and Drton (2011). Therefore,
Z
1
1
exp(−n (βS − β̂S )T IˆS (βS − β̂S )) (βS − β̂S )′ IˆS (βS − β̂S )dβS
2
2
B ǫK
s
1 |S|
2π |S|
q 4 log2 (n) log(p)
= ( ) 2 |IˆS |− 2
(1 + o(
)),
n
2n
n
23
and
s
1
q 4 log2 (n) log(p) 2π |S| ˆ − 1 −|S|
exp(nLn (Xn |β̂S ))π(β̂S )(1 + O(
))( ) 2 |IS | 2
m(S)
n
n
2n
q
2
|S|
4
1
log(p)
ˆS |− 2 −|S|
2 |I
))( 2π
exp(nLn (Xn |β̂S ))π(β̂S )(1 + O( q log (n)
n
n )
2n
=
q
2
|S|
4
1
q log (n) log(p)
2π 2 ˆ − 2
)
1 + O(
exp(nLn (Xn |β̂S ))π(β̂S )( n ) |IS |
n
s
|S|
q 4 log2 (n) log(p)
= − (1 + O(
)).
2n
n
(I1) =
For the second component of equation (24), we can show
|I2| =
≤ C6
Z
B ǫK
q log(n)
n
P (βS |Xn , S)rn (βS )kβS − β̂S k2 dβS
r
q 2 log(p)
n
Z
B ǫK
P (βS |Xn , S)dβS =
C6
n
s
q 4 log2 (n) log(p)
(1 + o(1)),
n
for some generic constant C6 .
For any βS ∈ Bǫn (β̂S ) − BǫK (β̂S ),
1
Ln (Xn |βS ) − Ln (Xn |β̂S ) = − (βS − β̂S )T IˆS (βS − β̂S ) + rn (βS )||βS − β̂S ||22
2
1
= − (βS − β̂S )′ IˆS (βS − β̂S )(1 + o(1)) ≥ −c2 ||βS − β̂S ||2 ,
2
where c2 > 0 is the upper bound of the maximum eigenvalue of IˆS for all S ∈ P(q).
Then for the third component of equation (24), we have
|I3| =
≤
Z
Bǫn −BǫK
sup
βS ∈Bǫn −BǫK
l2 1
≤ c2 n
q m(S)
= c2
ln2
q
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
|Ln (Xn |βS ) − Ln (Xn |β̂S )|
Z
Bǫn −BǫK
π −1 (β̂S )(
Z
Bǫn −BǫK
P (βS |Xn , S)dβS
exp(nLn (Xn |βS ))π(βS )dβS
1
n |S| ˆ − 1
) 2 |IS | 2 O( Kqc /4 ) 1 + O(
2π
n 1
s
q 4 log2 (n) log(p)
n
By choosing K > 4/c1 , it can be verified that
s
2
4
1
1 q log (n) log(p)
1
n |S|
.
π −1 (β̂S )( ) 2 |IˆS |− 2 O( Kqc /4 ) = o
1
2π
n
n
n
24
) .
(25)
Therefore
|I3| =
Z
Bǫn −BǫK
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
s
2
4
1 q log (n) log(p)
= o
.
n
n
To deal with the last component of equation (24), we should first recall for βS ∈ Bǫcn ,
nLn (Xn |βS ) − nLn (Xn |β̂S ) ≤ −
c′ nln2
.
q
Therefore, for sufficiently large n, we have
exp(0.5|nLn (Xn |βS ) − nLn (Xn |β̂S )|) > |nLn (Xn |βS ) − nLn (Xn |β̂S )|.
Now the last component of equation (24) becomes
|I4| =
=
=
1
m(S)
Z
Z
Bǫcn
Bǫcn
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
exp(nLn (Xn |βS ))π(βS )(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
exp(nLn (Xn |β̂S ))
nm(S)
Z
Z
Bǫcn
π(βS )[exp(nLn (Xn |βS ) − nLn (Xn |β̂S ))](nLn (Xn |βS ) − nLn (Xn |β̂S ))dβS
Bǫcn
π(βS )[exp(0.5nLn (Xn |βS ) − 0.5nLn (Xn |β̂S ))]dβS
≤
exp(nLn (Xn |β̂S ))
nm(S)
≤
c′ nln2
exp(nLn (Xn |β̂S ))
exp(−
)
nm(S)
2q
≤ exp(−
c′ nln2
)
2q
nπ(β̂S )( 2π
n )
|S|
2
Z
Bǫcn
π(βS )dβS
1
1
|IˆS |− 2 1 + O(
q
q 4 log2 (n) log(p)
)
n
.
Recall we have already verified
π −1 (β̂S )(
c′ nln2
s
n |S| ˆ − 1
) 2 |IS | 2 O(exp(−
)) = o
2π
q
q 4 log2 (n) log(p)
n
.
Therefore we finally have
Z
Bǫcn
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
25
s
4 log2 (n) log(p)
1
q
.
= o
n
n
Combining all the steps above, we conclude
EβS |Xn ,S [Ln (Xn |βS )] − Ln (Xn |β̂S )] =
Z
P (βS |Xn , S)(Ln (Xn |βS ) − Ln (Xn |β̂S ))dβS
s
2
4
|S|
q log (n) log(p)
= −
1 + O(
) .
2n
n
A.4
Proof of Lemma 3
For sufficiently large n, with probability going to 1, all statements in Lemma 1 and Lemma 2 hold and the
following proof are based on those statements. Without loss of generality, for any S ∗ ⊂ S, we can extend
βS ∗ to a vector in RS by adding zeros in each j ∈ S\S ∗ . We denote such a vector by βS ∗ ,S .
Applying Lemma 1, for any S ∗ ⊂ S and |S| ≤ q, we have
||β̂S − βS,0 ||2 = O((
q log(p) 1
1
) 4 ) = o( √ ),
n
q
and
||β̂S ∗ − βS ∗ ,0 ||2 = O((
q log(p) 1
1
) 4 ) = o( √ ).
n
q
Hence,
1
kβ̂S − β̂S ∗ ,S k2 = o( √ ).
q
Applying Lemma 2, we obtain
n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ ))
1
√
= (β̂S − β̂S ∗ ,S )′ nIˆS (β̂S )(β̂S − β̂S ∗ ,S ) + nO( q||β̂S − β̂S ∗ ,S ||3 ∨
2
r
q 2 log(p)
||β̂S − β̂S ∗ ,S ||2 ).
n
(26)
It follows from Lemma 1 that the smallest eigenvalue of IˆS (β̂S ) is uniformly bounded away from 0. Therefore, it is easy to verify
√
nO( q||β̂S − β̂S ∗ ,S ||3 ∨
r
q 2 log(p)
1
||β̂S − β̂S ∗ ,S ||2 ) = o( (β̂S − β̂S ∗ ,S )′ nIˆS (β̂S )(β̂S − β̂S ∗ ,S )),
n
2
which implies the residual term of (26) does not affect our results asymptotically and can thus be ignored.
From the proof of statement (iii) of Lemma 1, we know
Sn (β̂S ) − Sn (βS,0 ) = −IˆS (β̂S )(β̂S − βS,0 ) + O(
26
r
q 2 log(p)
)||β̂S − βS,0 ||2 .
n
(27)
Since the score function Sn (βS ) :=
∂Ln (Xn |βS )
∂βS
equals to 0 when βS = β̂S ,
−Sn (βS,0 ) = −IˆS (β̂S )(β̂S − βS,0 ) + O(
r
q 2 log(p)
)||β̂S − βS,0 ||2 .
n
Recall that the minimum eigenvalue of IˆS (β̂S ) is uniformly bounded away from 0, so
O(
r
q 2 log(p)
)||β̂S − βS,0 ||2 = o(IS (β̂S )(β̂S − βS,0 )),
n
which implies that the residual term of (27) does not affect our results asymptotically and can be ignored.
In the following, for notational simplicity, we rewrite IˆS (β̂S ) as IˆS and rewrite IS (βS,0 ) as IS . Then we
have
√
√
√
n(β̂S − βS,0 ) = (1 + o(1)) nIˆS−1 Sn (βS,0 ) = (1 + o(1)) nIS−1 Sn (βS,0 ),
(28)
holds uniformly for all S ⊃ S ∗ , and S ∈ P(q).
The same argument holds for S ∗ as well, i.e.,
√
√
√ −1
n(β̂S ∗ − βS ∗ ,0 ) = (1 + o(1)) nIˆS−1
nIS ∗ Sn (βS ∗ ,0 ).
∗ Sn (βS ∗ ,0 ) = (1 + o(1))
(29)
For a vector x, denote the subvector of x indexed by S as xS . Without loss of generality, denote
∗
∗
x = (xS , xS\S ) for any x ∈ RS . By combining equation (28) and equation (29), we have
√
Denote v =
√
√
√
n(β̂S − β̂S ∗ ,S ) = (1 + o(1)) IS−1 nSn (βS,0 ) − (IS−1
nSn (βS ∗ ,0 ), 0) .
∗
∗
(30)
∗
nSn (βS,0 ) and v = (v S , v S\S ) := (V1 , V2 ). Then v has the following properties
E(v) = 0,
E[vv ′ ] = nIS ,
∗
∗
E[v S (v S )′ ] = nIS ∗ .
Therefore, equation (30) can be re-expressed as
√
n(β̂S − β̂S ∗ ,S ) = (1 + o(1))IS−1 (V1 , V2 ) − IS (IS−1
∗ V1 , 0)
V
0
0
1
.
= (1 + o(1))IS−1
V2
−E[V2 V1′ ]E[V1 V1′ ]−1 I2
Go back to the major component of equation (26): For a matrix M , define M (S1 , S2 ) as the submatrix of
27
M indexed by (S1 , S2 ). Then,
1
(β̂S − β̂S ∗ ,S )′ nIˆS (β̂S )(β̂S − β̂S ∗ ,S )
2
1
= (1 + o(1)) (β̂S − β̂S ∗ ,S )′ nIS (β̂S − β̂S ∗ ,S )
2
′
0
0
V
0
0
1 + o(1) ′ ′
1
I −1 IS I −1
=
[V1 , V2 ]
S
S
2
V2
−E[V2 V1′ ]E[V1 V1′ ]−1 I2
−E[V2 V1′ ]E[V1 V1′ ]−1 I2
=
1 + o(1)
(V2 − E[V2 V1′ ]E[V1 V1′ ]−1 V1 )′ Σ(S, S ∗ )(V2 − E[V2 V1′ ]E[V1 V1′ ]−1 V1 ),
2
where Σ(S, S ∗ ) := IS−1 (S\S ∗ , S\S ∗ ) is the submatrix of IS−1 indexed by (S\S ∗ , S\S ∗ ).
Define
1
H(S, S ∗ ) := Σ(S, S ∗ ) 2 (V2 − E[V2 V1′ ]E[V1 V1′ ]−1 V1 ).
Then
1
(1 + o(1)) ′
(β̂S − β̂S ∗ ,S )′ nIˆS (β̂S )(β̂S − β̂S ∗ ,S ) =
H (S, S ∗ )H(S, S ∗ ).
2
2
To prove the main results, we only need to show that for sufficiently large n and any η > 0 and ζ > 1/5,
H ′ (S, S ∗ )H(S, S ∗ )
≤ 2(1 + η)log(nζ p),
∗|
|S\S
∗
S⊃S ,S∈P(q)
sup
holds with probability going to 1.
p
Define AS,n := 2(1 + η) log(nζ p)|S\S ∗ |, then
Pr
H ′ (S, S ∗ )H(S, S ∗ )
≥ 2(1 + η) log(nζ p)
|S\S ∗ |
= P r (||H(S, S ∗ )||2 ≥ AS,n ) .
p
First, for any u ∈ RS , ||u||2 = 1, we consider to bound P r u′ H(S, S ∗ ) ≥ (1 + ǫ)2|S\S ∗ | log(nζ p) .
Rewrite
n
1
1 X ′
1
1
u H(S, S ) := √
u Σ(S, S ∗ ) 2 ( √ V2,i − E[V2 V1′ ]E[V1 V1′ ]−1 √ V1,i ),
n
n
n
′
∗
i=1
where
√1 V1,i
n
=
∂ log f (xi |βS,0 )
∂βS ∗
and
√1 V2,i
n
=
∂ log f (xi |βS,0 )
.
∂βS\S ∗
For simplicity, we further define
1
1
1
Yi (S) := u′ Σ(S, S ∗ ) 2 ( √ V2,i − E[V2 V1′ ]E[V1 V1′ ]−1 √ V1,i ).
n
n
Then it becomes
n
1 X
Yi (S).
u′ H(S, S ∗ ) = √
n
i=1
It is easy to see that
E[Yi (S)] = 0.
28
Moreover, since
Σ(S, S ∗ ) = n E[V2 V2′ ] − E[V2 V1′ ]E[V1 V1′ ]−1 E[V1 V2′ ]
then
−1
,
1
1
1
1
E[( √ V2,i − E[V2 V1′ ]E[V1 V1′ ]−1 √ V1,i )( √ V2,i − E[V2 V1′ ]E[V1 V1′ ]−1 √ V1,i )′ ] = Σ(S, S ∗ )−1 ,
n
n
n
n
which implies
E[Yi (s)2 ] = u′ IS\S ∗ u = 1.
1
By condition (A2), |Yi (S)| ≤ c32 C2
p
q 3 := K
p
q 3 , where c3 is a generic constant which bounds the
maximum eigenvalue of Σ(S, S ∗ ) from the above. By condition (A1), such a constant exists.
By applying the Bernstein’s Inequality, we have
P r(|
n
X
i=1
for any x > 0. Plugging in x =
P r(u′ H(S, S ∗ ) ≥
q
Yi | > x) ≤ 2 exp(−
√ p
n 2(1 + ǫ)|S\S ∗ | log(nζ p), we get
2(1 + ǫ)|S\S ∗ | log(nζ p)) = P r(|
By the assumption that
1
x2
p
),
2 n + K q 3 x/3
q 5 log(p)
n
K
n
X
i=1
Yi | > x) ≤ 2 exp(
−1 2(1 + ǫ)n|S\S ∗ | log(nζ p)
p
).
2
n + K q3x
→ 0, we also have
p
√ q
q 3 x ≤ K n 2(1 + ǫ)q 4 log(nζ p) = o(n).
Hence, for sufficiently large n,
P r(u′ H(S, S ∗ ) ≥
q
2(1 + ǫ)|S\S ∗ | log(nζ p)) ≤ 2 exp(−(1 + ǫ/2)|S\S ∗ | log(nζ p)).
To link this inequality to our final objective, P r(||H(S, S ∗ )||2 ≥ AS,n ), we rely on Lemma 2 in Chen and
Chen (2012), which states that for a given δn > 0, there exists a finite set of unit vectors US (δn ) ⊂ RS\S
∗
∗
such that for all v ∈ RS\S , ||v||2 ≤ (1+δn ) maxu∈US (δn ) u′ v holds. Since US (δn ) is a finite set, following the
proof of Lemma 2 of Chen and Chen (2012), the cardinality of U(δn ) is bounded by |US (δn )| ≤ K( δ1n )|S\S
for some generic constant K > 0.
∗|
√
Hence, for a fixed model S with S ⊃ S ∗ and |S| = d ≤ q, we can pick δn such that (1 + δn ) 1 + ǫn =
29
√
1 + η.
∗
P {||H(S, S )||2 ≥ AS,n } ≤
X
q
P (u H(S, S ) ≥ 2(1 + ǫ)|S\S ∗ | log(nζ p))
′
u∈U (δn )
∗
≤ |U (δn )|2 exp(−(1 + ǫ/2)|S\S ∗ | log(nζ p)).
Then
P {∃ S with S ⊃ S ∗ and |S| ≤ q, ||H(S, S ∗ )||2 ≥ AS,n }
X
≤
P {||H(S, S ∗ )||2 ≥ AS,n }
S⊃S ∗ and |S|≤q
X
≤
≤
≤
≤
S⊃S ∗ and |S|≤q
q−|S ∗ |
X
d′ =1
q−|S ∗ |
X
d′ =1
q−|S ∗ |
X
d′ =1
|US (δn )|2 exp(−(1 + ǫ/2)|S\S ∗ | log(nζ p))
p
|US (δn )|2 exp(−(1 + ǫ/2)|S\S ∗ | log(nζ p))
d′
2 exp(d′ log(p) + log(|US (δn )|) − (1 + ǫ/2)d′ log(nζ p))}
2 exp(−(ǫ/2)d′ log(p) − d′ log(nζ ) + log(|US (δn )|)).
Since log(|US (δn )|) ≤ log(K) + |S\S ∗ | log(
√
|S\S ∗ |
),
δn
for any fixed δn , we have
exp(−(ǫ/2)d′ log(p) − d′ log(nζ ) + log(|US (δn )|)) ≤ C exp(−d′ (ǫ/2 log(p) + log(nζ ) − log(d′ ))),
for some constant C > 0. Since
q 5 log(p)
n
1
→ 0, we have log(d′ ) = o(n 5 ). So if ζ > 15 , then
1
C exp(−d′ (ǫ/2 log(p) + log(nζ ) − log(d′ ))) ≤ C exp(−d′ log(nζ− 5 )).
Therefore,
P {∃ S with S ⊃ S ∗ and |S| ≤ q, ||H(S, S ∗ )||2 ≥ AS,n }
≤
≤
q−|S ∗ |
X
d′ =1
∞
X
d′ =1
2 exp(−(ǫ/2)d′ log(p) − d′ log(nζ ) + log(|US (δn )|))
1
1
2C exp(−d′ log(nζ− 5 )) ≤ Cn−(ζ− 5 ) = o(1).
30
A.5
Proof of Theorem 2
The proof of Theorem 2 is based on the proof of Theorem 2 of Foygel and Drton (2011). For sufficiently
large n, with probability going to 1, all statements in Lemma 1, Lemma 2 and Lemma 3 hold.
Let S ∗ be the true model and define A0 = {S : S ∗ ⊂ S, |S| ≤ q} and A1 = {S : S ∗ * S, |S| ≤ q}, then
we consider two cases.
Case 1: S ∈ A0 .
In this case, we have
AEBIC(S) − AEBIC(S ∗ )
= −2nEβS |Xn ,S [Ln (Xn |βS )] + |S| log(n) + 2γ|S| log(p)
− − 2nEβS∗ |Xn ,S ∗ [Ln (Xn |βS ∗ )] + |S ∗ | log(n) + 2γ|S ∗ | log(p)
= −2nEβS |Xn ,S [Ln (Xn |βS )] + 2nEβS∗ |Xn ,S ∗ [Ln (Xn |βS ∗ )] + |S\S ∗ | log(n) + 2γ|S\S ∗ | log(p).
From Theorem 1, we know
EβS |Xn ,S [Ln (Xn |βS )] = Ln (Xn |β̂S ) −
|S|
|S|
+ O(
2n
n
s
q 4 log2 (n) log(p)
),
n
and
|S ∗ |
|S ∗ |
EβS∗ |Xn ,S ∗ [Ln (Xn |βS ∗ )] = Ln (Xn |β̂S ∗ ) −
+ O(
2n
n
s
q 4 log2 (n) log(p)
).
n
Therefore,
AEBIC(S) − AEBIC(S ∗ )
s
4 log2 (n) log(p)
q
+ |S\S ∗ | log(n) + 2γ|S\S ∗ | log(p) − 2n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ )).
= |S\S ∗ | + O q
n
From Lemma 3, we also know
2n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ )) ≤ 2(1 + η)|S\S ∗ | log(nζ p),
for any small η and sufficiently large n. Hence,
AEBIC(S) − AEBIC(S ∗ )
s
2
4
q log (n) log(p)
+ |S\S ∗ | log(n) + 2γ|S\S ∗ | log(p) − 2(1 + η)|S\S ∗ | log(nζ p).
≥ |S\S ∗ | + O q
n
31
∗
For γ > 1, pick η = min( γ−1
2 , ξ), where ξ > 0 is a small constant, then 2γ|S\S
q| log(p) − 2(1 +
q 4 log2 (n) log(p)
∗
,
η)|S\S ∗ | log(p) ≥ γ−1
2 |S\S | log(p). In addition, it is obvious that log(p) dominates O q
n
as
q 6 log2 (n)
n log(p)
→ 0. Hence, for sufficiently large n, we have
s
2γ|S\S ∗ | log(p) − 2(1 + η)|S\S ∗ | log(p) + O q
q 4 log2 (n) log(p)
n
> 0.
1
Choose ζ ∈ ( 51 , 2+2ξ
), for sufficiently large n, we also have
|S\S ∗ |(1 + log(n)) > 2(1 + η)|S\S ∗ | log(nζ ).
Combining them together, we finally proved AEBIC(S) − AEBIC(S ∗ ) > 0 for sufficiently large n.
Case 2: S ∈ A1 .
Let S ′ = S ∪S ∗ and β̂S,S ′ denote a vector corresponding to model S ′ , which is generated
by augmenting β̂S with zeros in S ′ \S. Following a similar procedure as in case 1, we have
AEBIC(S) − AEBIC(S ∗ )
= − 2nEβS |Xn ,S [Ln (Xn |βS )] + 2nEβS∗ |Xn ,S ∗ [Ln (Xn |βS )] + (|S| − |S ∗ |) log(n) + 2γ(|S| − |S ∗ |) log(p)
= − 2n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ )) + (|S| − |S ∗ |) + (|S| − |S ∗ |) log(n)
s
4 log2 (n) log(p)
q
+ 2γ(|S| − |S ∗ |) log(p) + O q
n
≥ − 2n(Ln (Xn |β̂S ) − Ln (Xn |β̂S ∗ )) − q(log(n) + 2γ log(p) + 1 + o(1))
≥ − 2n(Ln (Xn |β̂S,S ′ ) − Ln (Xn |β̂S ∗ )) − q(log(n) + 2γ log(p) + 2).
(31)
By optimality, Ln (Xn |β̂S ∗ ) ≥ Ln (Xn |βS ∗ ,0 ) = Ln (Xn |βS ′ ,0 ), since βS ∗ ,0 equals to βS ′ ,0 up to components
of zeros in S ′ \S ∗ . Then
−2n(Ln (Xn |β̂S,S ′ ) − Ln (Xn |β̂S ∗ )) ≥ −2n(Ln (Xn |β̂S,S ′ ) − Ln (Xn |βS ′ ,0 )).
From the proof of Lemma 1, we have
r
q log(p)
),
r n
q log(p)
).
|Ln (Xn |βS ∗ ,0 ) − L(βS ∗ ,0 )| = O(
n
|Ln (Xn |β̂S ) − L(β̂S )| = O(
32
By condition (A6) and (A7), we have
L(βS ′ ,0 ) − L(β̂S,S ′ ) ≥ min(1, c||β̂S,S ′ − βS ′ ,0 ||22 ) ≥ min(1, c[min |βS ∗ ,0 |]2 )
j
r
q log(p) 1 2
q log(p)
)4 ] ) = c
.
≻ min(1, c[(
n
n
Combining them together, we obtain
−2n(Ln (Xn |β̂S,S ′ ) − Ln (Xn |βS ′ ,0 ))
≥ −2n |Ln (Xn |β̂S,S ′ ) − L(β̂S,S ′ )| + |Ln (Xn |βS ′ ,0 ) − L(βS ′ ,0 )| − (L(β̂S ′ ,0 ) − L(β̂S,S ′ )
r
q log(p) p
≻ n
= nq log(p).
n
Back to inequality (31), for sufficiently large n, we have
−2n(Ln (Xn |β̂S,S ′ ) − Ln (Xn |β̂S ∗ )) − q(log(n) + 2γ log(p) + 2) > 0,
since the assumption
q log(p)
n
→ 0 leads to
√
nq log(p)
q log(p)
→ ∞. Hence, AEBIC(S) > AEBIC(S ∗ ) holds.
References
Akaike, H. (1974), “A new look at the statistical model identification,” Automatic Control, IEEE Transactions on, 19, 716–723.
Ando, T. (2010), Bayesian Model Selection and Statistical Modeling, New York: Chapman & Hall.
Bickel, P., Ritov, Y., and Tsybakov, A. (2009), “Simultaneous analysis of Lasso and Dantzig selector,”
Annals of Statistics, 37, 1705–1732.
Candes, E. and Tao, T. (2007), “The Dantzig selector: Statistical estimation when p is much larger than
n,” Ann. Statist., 35, 2313–2351.
Chen, J. and Chen, Z. (2008), “Extended Bayesian information criteria for model selection with large
model spaces,” Biometrika, 95, 759–771.
— (2012), “Extended BIC for small-n-large-p sparse GLM,” Statistica Sinica, 22, 555–574.
Chen, Z. and Luo, S. (2013), “Selection consistency of EBIC for GLIM with non-canonical links and
diverging number of parameters,” Statistics and Its Interface, 6, 275–284.
Efron, B. (2009), “Empirical Bayes Estimates for Large-Scale Prediction Problems,” Journal of the American Statistical Association, 104, 1015–1028.
33
Fan, J. and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360.
Foygel, R. and Drton, M. (2011), “Bayesian model choice and information criteria in sparse generalized
linear models,” ArXiv e-prints.
Foygel Barber, R., Drton, M., and Tan, K. M. (2015), “Laplace Approximation in High-dimensional
Bayesian Regression,” ArXiv e-prints.
Ge, D., Wang, Z., Ye, Y., and Yin, H. (2016), “Strong NP-hardness result for regularized Lq -minimization
problems with concave penalty functions,” arXiv:1501.00622v3.
Green, P. (1995), “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, 82, 711–732.
Haughton, D. M. A. (1988), “On the Choice of a Model to Fit Data from an Exponential Family,” Ann.
Statist., 16, 342–355.
Huo, X. and Chen, J. (2010), “Complexity of penalized likelihood estimation,” Journal of Statistical
Computation and Simulation, 80, 747–759.
Jiang, W. (2007), “Bayesian variable selection for high dimensional generalized linear models: Convergence
rates of the fitted densities,” Ann. Statist., 35, 1487–1511.
Johnson, V. E. and Rossell, D. (2012), “Bayesian Model Selection in High-Dimensional Settings,” Journal
of the American Statistical Association, 107, 649–660.
Kass, R. and Raftery, A. (1995), “Bayes factors and model uncertainty,” Journal of the American Statistical
Association, 90, 773–795.
Liang, F., Liu, C., and Carroll, R. (2007), “Stochastic Approximation in Monte Carlo Computation,”
Journal of the American Statistical Association, 102, 305–320.
Liang, F., Song, Q., and Yu, K. (2013), “Bayesian Subset Modeling for High-Dimensional Generalized
Linear Models,” Journal of the American Statistical Association, 2013.
McCullagh, P. and Nelder, J. A. (1989), Generalized linear models (Second edition), London: Chapman &
Hall.
Nishii, R. (1984), “Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression,”
Ann. Statist., 12, 758–765.
Schwarz, G. (1978), “Estimating the Dimension of a Model,” Ann. Statist., 6, 461–464.
34
Shin, M., Bhattacharya, A., and Johnson, V. E. (2015), “Scalable Bayesian Variable Selection Using
Nonlocal Prior Densities in Ultrahigh-Dimensional Settings,” ArXiv e-prints.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A.,
D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R., and Sellers, W. R.
(2002), “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, 1, 203 – 209.
Song, Q. and Liang, F. (2015a), “High-Dimensional Variable Selection With Reciprocal L1-Regularization,”
Journal of the American Statistical Association, 110, 1607–1620.
— (2015b), “A Split-and-Merge Bayesian Variable Selection Approach for Ultra-high dimensional Regression,” Journal of the Royal Statistical Society, Series B, 77, 947–972.
Spiegelhalter, D., Best, N., Carlin, B., and Germany, B. (2014), “The deviance information criterion: 12
years on,” Journal of the Royal Statistical Society, Series B, 76, 485–493.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002), “Bayesian measures of model
complexity and fit,” Journal of the Royal Statistical Society Series B, 64, 583–639.
Tibshirani, R. (1994), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical
Society, Series B, 58, 267–288.
Wahba, G. and Craven, P. (1979), “Smoothing Noisy Data with Spline Functions. Estimating the Correct
Degree of Smoothing by the Method of Generalized Cross-Validation.” Numerische Mathematik, 31,
377–404.
Watanabe, S. (2010), “Asymptotic equivalence of Bayes cross validation and widely applicable information
criterion in singular learning theory,” Journal of Machine Learning Research, 11, 3571–3594.
Zhang, C.-H. (2010), “Nearly unbiased variable selection under minimax concave penalty,” Ann. Statist.,
38, 894–942.
Zhao, P. and Yu, B. (2006), “On model selection consistency of Lasso,” Journal of Machine Learning
Research, 7, 2541–2563.
Zou, H. and Hastie, T. (2005), “Regularization and Variable Selection via the Elastic Net,” Journal of the
Royal Statistical Society. Series B (Statistical Methodology), 67, 301–320.
35