Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

Shinkyu, Akira

doi:10.1007/s13171-021-00261-4

Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

Published: 23 September 2021

Volume 85, pages 485–511, (2023)
Cite this article

Sankhya A Aims and scope Submit manuscript

Akira Shinkyu ORCID: orcid.org/0000-0001-5616-8142¹

482 Accesses
Explore all metrics

Abstract

Varying coefficient models have flexibility and interpretability, and they are widely used in data analysis. Although feature screening procedures have been proposed for ultra-high dimensional varying coefficient models, none of these procedures includes structure identification. That is, existing feature screening procedures for varying coefficient models do not give us information on which covariates have constant coefficients and which covariates have non-constant coefficients among selected covariates. Hence, these procedures do not explicitly select partially linear varying coefficient models, which are much simpler than general varying coefficient models. Motivated by this issue, we propose a forward selection procedure for simultaneous feature screening and structure identification in varying coefficient models. Unlike existing feature screening procedures, our method classifies all covariates into three groups: covariates with constant coefficients, covariates with non-constant coefficients, and covariates with zero coefficients. Thus, our procedure can explicitly select partially linear varying coefficient models. Our procedure selects covariates sequentially until the extended BIC (EBIC) increases. We establish the screening consistency for our method under some conditions. Numerical studies and real data examples support the utility of our procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection for High-Dimensional Varying Coefficient Models via Ordinary Least Squares Projection

Article 31 March 2023

Model identification and selection for single-index varying-coefficient models

Article 10 June 2020

Unified Variable Selection for Varying Coefficient Models with Longitudinal Data

Article 11 November 2022

Availability of Data and Material

The data and material used in this paper is available from the public data repository at http://doi.org/10.18129/B9.bioc.seventyGeneData.

References

Breheny, P. and Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187.
Article MATH Google Scholar
Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
Article MATH Google Scholar
Cheng, M.Y., Feng, S., Li, G. and Lian, H. (2018). Greedy forward regression for variable screening. Austral. New Zealand J. Stat. 60, 2–42.
Article MATH Google Scholar
Cheng, M.Y., Honda, T., Li, J. and Peng, H. (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Stat. 42, 1819–1849.
Article MATH Google Scholar
Cheng, M.Y., Honda, T. and Zhang, J.T. (2016). Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc.111, 1209–1221.
Article Google Scholar
Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 106, 544–557.
Article MATH Google Scholar
Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc.109, 1270–1284.
Article MATH Google Scholar
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360.
Article MATH Google Scholar
Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Series B 70, 849–911.
Article MATH Google Scholar
Greene, W.H. (2012). Econometric Analysis, 7th edn. Pearson Education, Harlow.
Google Scholar
Honda, T., Ing, C.K. and Wu, W.Y. (2019). Adaptively weighted group Lasso for semiparametric quantile regression models. Bernoulli 25, 3311–3338.
Article MATH Google Scholar
Honda, T. and Lin, C.-T. (2021). Forward variable selection for sparse ultra-high dimensional generalized varying coefficient models. Japanese J. Stat. Data Sci. 4, 151–179.
Article MATH Google Scholar
Honda, T. and Yabe, R. (2017). Variable selection and structure identification for varying coefficient Cox models. J. Multivar. Anal. 161, 103–122.
Article MATH Google Scholar
Horn, R.A. and Johnson, C.R. (2013). Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge.
MATH Google Scholar
Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gpttardo, R., Hahne, F., Hansen, K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain, V., Oles, A.K., Reyes, H., Shannon, A., Smyth, P., Tenebaum, G.K., Waldron, D., Morgan, L. and Pages, M (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121.
Article Google Scholar
Josse, J. and Husson, F. (2016). missMDA: a package for handling missing values in mulitvariate data analysis. J. Stat. Softw. 7, 1, 1–31.
Google Scholar
Lee, E.R., Noh, H. and Park, B.U. (2014). Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc.109, 216–229.
Article MATH Google Scholar
Li, G., Peng, H., Zhang, J. and Zhu, L. (2012a). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.
Article MATH Google Scholar
Li, R., Zhong, W. and Zhu, L. (2012b). Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139.
Article MATH Google Scholar
Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Assoc. 109, 266–274.
Article MATH Google Scholar
Liu, J.Y., Zhong, W. and Li, R.Z. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 2033–2054.
Article MATH Google Scholar
Luigi, M., Bahman, A., Donald, G. and Jeffrey, T.L. (2013). A simple and reproducible breast cancer prognostic test. BMC Genomics 14.
Luo, S. and Chen, Z. (2014). Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space. J. Amer. Stat. Assoc. 109, 1229–1240.
Article MATH Google Scholar
Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.
Article MATH Google Scholar
Serban, N. (2011). A space-time varying coefficient model: the equity of service accessibility. Ann. Appl. Stat. 5, 2024–2051.
Article MATH Google Scholar
Song, R., Yi, F. and Zou, H. (2014). On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat. Sin. 24, 1735–1752.
MATH Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B 8, 267–288.
MATH Google Scholar
van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes. Springer, New York.
Book MATH Google Scholar
van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R. and Friend, S.H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.
Article Google Scholar
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524.
Article MATH Google Scholar
Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 104, 747–757.
Article MATH Google Scholar
Wang, L., Li, H. and Huang, J.Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Am. Stat. Assoc. 103, 1556–1569.
Article MATH Google Scholar
Wang, K. and Lin, L. (2016). Robust structure identification and variable selection in partial linear varying coefficient models. J. Stat. Plann. Inf. 174, 153–168.
Article MATH Google Scholar
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B 68, 49–67.
Article MATH Google Scholar
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.
Article MATH Google Scholar
Zhong, W., Duan, S. and Zhu, L. (2020). Forward additive regression for ultrahigh-dimensional nonparametric additive models. Stat. Sin. 30, 175–192.
MATH Google Scholar
Zhu, L.P., Li, L., Li, R. and Zhu, L.X. (2011). Model-free feature screening for ultra-high dimensional data. J. Am. Stat. Assoc. 106, 1464–1475.
Article MATH Google Scholar

Download references

Acknowledgements

The author would like to thank Professor Toshio Honda for providing a code to generate the orthonormal basis and giving helpful advice. He also would like to appreciate the Editor and the anonymous three reviewers for their constructive comments, which give significant improvement on presentation of this paper.

Funding

The author received no financial support for this research.

Author information

Authors and Affiliations

Graduate School of Economics, Kobe University, 2-1 Rokkodai-cho, Nada-ku, Kobe, 657-8501, Japan
Akira Shinkyu

Authors

Akira Shinkyu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akira Shinkyu.

Ethics declarations

Conflict of Interests

The author declares no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs

To prove main theorems, we need Lemma A.1.

Lemma A.1.

Suppose that Assumptions 1-3 hold with M = O(n^ξ) for some 0 < ξ < 1/10. Then, there exists some sequence of positive numbers {δ_1n} tending to zero sufficiently slowly such that with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), \\ $$

(A.1)

for any $S\subset \mathcal {F}$ satisfying |S|≤ M.

Proof of Lemma A.1.

Note that there exists some positive constant α such that ξ + α < 1/10. Let δ_1n = n^−α. We will show

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\min}}{L}\delta_{1n}\right)\rightarrow 0. $$

(A.2)

Note that eigenvalues of $\widehat {\boldsymbol {{\varSigma }}}_{S}$ and Σ_S are nonnegative because these matrices are positive semidefinite. We also note that the k th largest singular value is equal to the k th largest eigenvalue for any positive semidefinite matrix. Hence, from Corollary 7.3.5 of Horn and Johnson (2013), we get

$$ \left|\lambda_{k}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{k}(\boldsymbol{{\varSigma}}_{S}) \right|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S} \| \ \ \text{for } k=1, \ldots, |\mathcal{F}_{c}\cap S|+|\mathcal{F}_{v}\cap S|(L-1), $$

(A.3)

where $\{\lambda _{k}(\widehat {\boldsymbol {{\varSigma }}}_{S})\}$ and {λ_k(Σ_S)} are non increasingly ordered eigenvalues of $\widehat {\boldsymbol {{\varSigma }}}_{S}$ and Σ_S, respectively. Thus, we obtain

$$ |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})- \lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S}\|. $$

(A.4)

Note that the numbers of row and column of $\widehat {\boldsymbol {{\varSigma }}}_{S}$ and Σ_S are equal to $|\mathcal {F}_{c}\cap S|+|\mathcal {F}_{v}\cap S|(L-1)$. Since $|\mathcal {F}_{v}\cap S|\leq M$ and $|\mathcal {F}_{c} \cap S|\leq M$, we have $\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|\leq ML\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|_{\infty }$ by the Cauchy Schwarz inequality, and

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|> \frac{\tau_{\min}}{L}\delta_{1n} \right)\leq P\left( \|\widehat{\boldsymbol{{\varSigma}}}_{S}- \boldsymbol{{\varSigma}}_{S}\|_{\infty}> \frac{\tau_{\min}}{ML^{2}}\delta_{1n} \right), $$

(A.5)

for any S satisfying |S|≤ M. Under Assumption 3, Lemmas A.1 and A.2 of Fan et al. (2014) hold. Thus, for k, j = 1,…,p, s, l = 1,…,L, and m ≥ 2, we have

$$ \begin{array}{@{}rcl@{}} &&E\left( |X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]|^{m}\right) \\ &\leq& 2^{m}e{K}^{m}m!E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}], \end{array} $$

(A.6)

where K is a positive constant. Recall that B(z) = A₀B₀(z), and $C_{3}\leq \lambda _{\max \limits }({\boldsymbol {A}_{0}^{T}}\boldsymbol {A}_{0})\leq C_{4}$ for some positive constants C₃ and C₄. See Appendix A of Honda and Yabe (2017). Hence, for l = 1,…,L, we can denote $B_{l}(z)={\sum }_{s=1}^{L}a_{ls}B_{0s}(z)$ where a_ls is the (l, s) element of A₀. Note that ${\sum }_{k=1}^{L}B_{0k}(z)=1$ due to the properties of the B-spline basis. Therefore, for s, l = 1,…,L, we obtain

$$ \begin{array}{@{}rcl@{}} |B_{s}(Z_{i})B_{l}(Z_{i})|&=&\left|\left\{\sum\limits_{j=1}^{L}a_{sj}B_{0j}(Z_{i})\right\}\left\{\sum\limits_{k=1}^{L}a_{lk}B_{0k}(Z_{i})\right\}\right| \\ &\leq& \max_{1\leq j\leq L}|a_{sj}|\times \max_{1\leq k\leq L}|a_{lk}| \leq \|\boldsymbol{A}_{0}\|_{\infty}^{2} \leq \|\boldsymbol{A}_{0}\|^{2}\leq C_{4}. \end{array} $$

(A.7)

As a result, we get $E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}] \leq {C_{4}}^{m}$ and the right hand side of the last inequality in Eq. A.6 can be bounded by $\leq {(2KC_{4})^{m-2}m!8K^{2}{C_{4}}^{2}e}{2}^{-1}$ for any m ≥ 2. Thus, from Bernstein’s inequality (see Lemma 2.2.11 of van der Vaart and Wellner (1996)), we get

$$ \begin{array}{@{}rcl@{}} &&P\left( \left|\frac{1}{n}\sum\limits_{i=1}^{n}\{X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]\}\right|>\frac{\tau_{\min}}{ML^{2}}\delta_{1n}\right)\\ &\leq& 2\exp\left( \frac{-n\tau_{\min}\delta_{1n}}{16M^{2}L^{4}K^{2}{C_{4}^{2}}e(\tau_{\min}\delta_{1n})^{-1}+4ML^{2}KC_{4}}\right). \end{array} $$

(A.8)

Since M = O(n^ξ), M ≤ ν_mn^ξ for some ν_m > 0. Thus, the right hand side of Eq. A.8 can be bounded by

$$ \leq 2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right), $$

(A.9)

where $K^{*}=\max \limits \{16K^{2}{C_{4}^{2}}e{\nu _{m}^{2}}{C_{L}^{4}}\tau _{\min \limits }^{-1}, 4KC_{4}\nu _{m} {C_{L}^{2}}\}$. Let A = {j : 1j ∈ S}. Note that {j : − 1j ∈ S}⊂{j : 1j ∈ S}. Since |A|≤|S|≤ M, the right hand side of Eq. A.5 can be further bounded by

$$ \begin{array}{@{}rcl@{}} &\leq& \sum\limits_{k,j \in A} \sum\limits_{1\leq s,l\leq L}2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-2\alpha}\}}\right) \\ &\leq& 2{\nu_{m}^{2}}{C_{L}^{2}}n^{2\xi+2/5}\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right). \end{array} $$

(A.10)

The right hand side of the last inequality in Eq. A.10 converges to zero as $n \rightarrow \infty $. Thus, Eq. A.2 holds. We can also get

$$ P\left( |\lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\max}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\max}}{L}\delta_{1n}\right)\rightarrow 0 , $$

(A.11)

in the same way. Eqs. A.2, A.11, and Assumption 2 imply

$$ P\left( \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n})\right)\rightarrow 1, $$

(A.12)

for any S satisfying |S|≤ M. This completes the proof of Lemma A.1. □

Corollary A.1 is derived by Lemma A.1 immediately.

Corollary A.1.

Suppose that Assumptions 1-3 hold with M = O(n^ξ) for some 0 < ξ < 1/10. Then, with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq\lambda_{\min}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq\lambda_{\max}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), $$

(A.13)

for any $S\subset \mathcal {F}$ and $E \subset \mathcal {F}$ satisfying |E ∪ S|≤ M, where Q_S is the orthogonal projection defined in Eq. A.20.

Proof of Corollary A.1.

Note that W_E∪S = (W_E, W_S). By (A-74) in Greene (2012), $({\boldsymbol {W}_{E}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{E})^{-1}$ is the upper left block of $({\boldsymbol {W}}_{E\cup S}^{T}$ W_E∪S)^− 1. Notice that $\lambda _{\max \limits }(\boldsymbol {A}^{-1})=\lambda _{\min \limits }(\boldsymbol {A})^{-1}$ and $\lambda _{\min \limits }(\boldsymbol {A}^{-1})=$ $\lambda _{\max \limits }(\boldsymbol {A})^{-1}$ for any symmetric and invertible matrix A. Hence, we get

$$ \lambda_{\min}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S})\leq \lambda_{\min}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}), $$

and

$$ \lambda_{\max}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E})\leq \lambda_{\max}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S}). $$

By Lemma A.1, we have

$$ \frac{n\tau_{\min}}{L}(1-\delta_{1n})\!\leq\! \lambda_{\min}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S})\!\leq\! \lambda_{\max}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S}) \!\leq\! \frac{n\tau_{\max}}{L}(1+\delta_{1n}) $$

(A.14)

with probability tending to one. Hence, the proof of Corollary A.1 is completed. □

Proof of Theorem 1.

We can prove Theorem 1 almost in the same way as in Cheng et al. (2016). However, since there are some differences because we are also considering structure identification, we give the proof of Theorem 1 below. Firstly, we can represent Eq. 2.1 by

$$ \begin{array}{@{}rcl@{}} Y_{i}&=&\sum\limits_{j=1}^{p}X_{ij}g_{cj}+\sum\limits_{j=1}^{p}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}=\sum\limits_{j\in \mathcal{S}_{c}}X_{ij}g_{cj}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}\\ &=&\sum\limits_{j\in\mathcal{S}_{c}}X_{ij}B_{1}(Z_{i})\gamma_{1j}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}+\delta_{i} +\epsilon_{i}, \end{array} $$

(A.15)

where

$$ \delta_{i}=\sum\limits_{j \in\mathcal{S}_{v}}X_{ij}\{g_{vj}(Z_{i})-\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}\}, $$

(A.16)

as in Honda et al. (2019). If γ_j is suitably chosen, there exists some positive $C_{g}^{\prime }$ such that

$$ \sum\limits_{j=1}^{p}\|g_{j} -\boldsymbol{B}^{T}\boldsymbol{\gamma}_{j}\|_{\infty} \leq C_{g}^{\prime}L^{-2}, $$

(A.17)

under Assumption 4. Note that |g_cj|² = |γ_1j|²L^− 1 and $\|g_{vj}\|^{2}=\|\boldsymbol {\gamma }_{-1j}\|^{2}L^{-1}+O(L^{-4})$. See Appendix A of Honda and Yabe (2017) for properties of the orthonormal basis. Then, there exist some positive constants C₁ and C₂ such that $C_{1}L \|g_{vj}\|^{2} \leq \|{\boldsymbol {\gamma }}_{-1j}\|^{2} \leq C_{2}L\|g_{vj}\|^{2}$. Under Assumption 6, we have

$$ |\delta_{i}|\leq C_{X}C_{g}^{\prime}L^{-2}\rightarrow 0. $$

(A.18)

The matrix form of Eq. A.15 is represented by

$$ \boldsymbol{Y}=\sum\limits_{j\in\mathcal{S}_{c}}\boldsymbol{W}_{1j}{\gamma}_{1j}+\sum\limits_{j\in\mathcal{S}_{v}}\boldsymbol{W}_{-1j}\boldsymbol{\gamma}_{-1j}+\boldsymbol{\delta}+\boldsymbol{\epsilon} =\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} + \boldsymbol{\delta} + \boldsymbol{\epsilon}. $$

(A.19)

Let us denote

$$ \boldsymbol{P}_{S}=\boldsymbol{W}_{S}({\boldsymbol{W}_{S}^{T}} \boldsymbol{W}_{S})^{-1} {\boldsymbol{W}_{S}^{T}},\ \ \boldsymbol{Q}_{S}= \boldsymbol{I} -\boldsymbol{P}_{S}, $$

(A.20)

$$ \widetilde{\boldsymbol{W}}_{lS}= \boldsymbol{Q}_{S}\boldsymbol{W}_{l}, \boldsymbol{H}_{lS} =\widetilde{\boldsymbol{W}}_{lS}(\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}\widetilde{\boldsymbol{W}}_{lS}^{T}, $$

for any S satisfying $|S \cup \mathcal {S}_{0}|\leq M$ and $\mathcal {S}_{0} \not \subset S$. We can easily show that H_lSQ_S = H_lS and $n \hat {\sigma ^{2}_{S}} - n\hat \sigma ^{2}_{S(l)} =\|\boldsymbol {H}_{lS} \boldsymbol {Y}\|^{2}$. From the triangle inequality, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|\geq \max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}}\|- \|\boldsymbol{H}_{lS}\boldsymbol{\delta}\| - \| \boldsymbol{H}_{lS}\boldsymbol{\epsilon} \|. $$

(A.21)

We evaluate each part in the right hand side of Eq. A.21. Since H_lS is a projection matrix, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\delta}\| \leq \|\boldsymbol{\delta} \| \leq \sqrt{n} C_{X} C_{g}^{\prime} L^{-2}. $$

(A.22)

Hence, we have ∥H_lSδ∥ = O_P(n^1/10). Next, we evaluate the third term in Eq. A.21. Under Assumption 5, Proposition 3 of Zhang (2010) holds. Thus, we have

$$ P\left( \frac{\|\boldsymbol{H}_{lS}\boldsymbol{\epsilon}\|}{m{\sigma^{2}_{1}}}^{2} \geq \frac{1+x}{\{1-2/(e^{x/2}\sqrt{1+x}-1)\}^{2}_{+}} \right)\leq e^{-mx/2}(1+x)^{m/2}, $$

(A.23)

for all x > 0 and for any $l\in \mathcal {F}\cap S^{c}$, where m = rank(H_lS). We take $x=\log n$. If $l\in \mathcal {F}_{c}$, then rank(H_lS) = 1 and $\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(\log n)$, which implies $\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2}=O_{P}(L\log n)$. If $l \in \mathcal {F}_{v}$, then rank(H_lS) = (L − 1) and $\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(L\log n)$. Thus, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\epsilon}\|=O_{P}(L^{1/2}(\log n)^{1/2} )=O_{P}(n^{1/10}(\log n)^{1/2}). $$

(A.24)

Next, we evaluate $\max \limits _{l \in \mathcal {S}_{0} \cap S^{c}}\| \boldsymbol {H}_{lS}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}}\|$. Notice that

$$ \|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min} ((\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}) \|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}. $$

(A.25)

Note that $\lambda _{\min \limits }((\widetilde {\boldsymbol {W}}_{lS}^{T} \widetilde {\boldsymbol {W}}_{lS})^{-1})=\lambda _{\min \limits }(({\boldsymbol {W}_{l}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{l})^{-1})$ and |{l}∪ S|≤ M. From Corollary A.1, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{L}{n \tau_{\max}(1+\delta_{1n})} \max_{l \in \mathcal{S}_{0}\cap S^{c}}\|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}, $$

(A.26)

with probability tending to one. We evaluate $\max \limits _{l \in \mathcal {S}_{0}\cap S^{c}} \|\widetilde {\boldsymbol {W}}_{lS}^{T}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}} \|^{2} $. We can see that

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c} } \|\widetilde{\boldsymbol{W}}_{lS}^{T}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} =\max_{l \in \mathcal{S}_{0} \cap S^{c} } \|{\boldsymbol{W}_{l}^{T}} \boldsymbol{Q}_{S} \boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2}. $$

Notice that $|g_{cj}|^{2}+\|g_{vj}\|^{2}_{2}=\|g_{j}\|^{2}_{2} \leq \|g_{j}\|_{\infty }^{2}$. From Cauchy-Schwarz inequality, we get

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \leq \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\| \| \boldsymbol{W}^{T}_{\mathcal{S}_{0}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|\\ &=&\left( \sum\limits_{j\in\mathcal{S}_{c}}|\gamma_{1j}|^{2} +\sum\limits_{j \in \mathcal{S}_{v}}\| \boldsymbol{\gamma}_{-1j}\|^{2} \right)^{1/2} \left( \sum\limits_{j\in \mathcal{S}_{0}\cap S^{c} } \|{\boldsymbol{W}^{T}_{j}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}\\ &\leq& \left( L\sum\limits_{j\in\mathcal{S}_{c}} |g_{cj}|^{2} +C_{2}L\sum\limits_{j \in \mathcal{S}_{v}} \|g_{vj}\|_{2}^{2} \right)^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} 2p_{0} \right)^{1/2}\\ &\leq& 2^{1/2}p_{0}^{1/2} L^{1/2}C_{g}(1+C_{2})^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}} \|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}, \end{array} $$

(A.27)

with probability tending to one. Thus, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}}{2p_{0}L{C_{g}^{2}}(1+C_{2})}. $$

(A.28)

Since $|\mathcal {S}_{0}\cup S|\leq M$, we have

$$ \|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min}\left( \boldsymbol{W}_{\mathcal{S}_{0}}^{T}\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\right) \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2} \geq \frac{n\tau_{\min}}{L}(1-\delta_{1n})\|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$

(A.29)

by Corollary A.1. Notice that

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\geq \frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2} \|\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\\ &&\geq\frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2}\left( L\max_{j \in \mathcal{S}_{c}}|g_{cj}|^{2}+LC_{1}\max_{j \in \mathcal{S}_{v}}\|g_{vj}\|^{2}_{2}\right)^{2}\\ &&\geq {n^{2}\tau_{\min}^{2}}(1-\delta_{1n})^{2}C_{\min}^{2}(1+C_{1})^{2}, \end{array} $$

(A.30)

with probability tending to one. Note that $\min \limits \{\max \limits _{j \in \mathcal {S}_{c}}|g_{cj}|^{2}, \max \limits _{j \in \mathcal {S}_{v}}\|g_{vj}$ $\|_{2}^{2}\}>C_{\min \limits }$ under Assumption 4. From Eqs. A.26, A.28, and A.30, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS} \boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{n(1-\delta_{1n})^{2} }{(1+\delta_{1n})} \frac{\tau_{\min}^{2}C_{\min}^{2} (1+C_{1})^{2}}{\tau_{\max} {C_{g}^{2}}(1+C_{2})2p_{0}}, $$

(A.31)

with probability tending to one. Equation A.31 implies that the first term in the right hand side of Eq. A.21 dominates the other two terms. Thus, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|^{2}\geq 4^{-1}\max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$

(A.32)

with probability tending to one. By taking {δ_2n} as δ_2n = 1 − (1 − δ_1n)²(1 + δ_1n)^− 1, we get

$$ \max_{l\in \mathcal{F}\cap S^{c}}\{ n \hat{\sigma}_{S}^{2} -n \hat{\sigma}^{2}_{S\cup\{l\}}\}\geq n(1-\delta_{2n})\frac{\tau_{\min}^{2} C_{\min}^{2}(1+C_{1})^{2}}{4\tau_{\max}{C_{g}^{2}}(1+C_{2})2p_{0}} =n(1-\delta_{2n})D^{*}, $$

(A.33)

for any S satisfying $|S\cup \mathcal {S}_{0}|\leq M$ with probability tending to one, and the Proof of Theorem 1 is completed. □

Proof of Theorem 2.

Assume Theorem 2 is false. Then, we have $\mathcal {S}_{v} \not \subset F_{k}$ or $\mathcal {S}_{c} \not \subset E_{k}$ for k = 1,⋯ ,T^∗. This implies that $\mathcal {S}_{0} \not \subset S_{k}$ and $|S_{k} \cup \mathcal {S}_{0}|\leq M$ for k = 1,⋯ ,T^∗. Note that we have $n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}} \geq \hat {\sigma }_{S_{k}(l)}^{2}$ and $\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{l\}} \leq n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}}$ for any $l \in \mathcal {F}\cap {S_{k}^{c}}$, and $\log (1+x)\geq x(1+x)^{-1}$ for all x > − 1. Hence, we have

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{l\}) =n\log \left( \frac{\hat\sigma^{2}_{S_{k}}}{\hat\sigma^{2}_{S_{k}\cup\{l\}}} \right) -(\log n +2\eta \log p)\\ &&\geq n\log \left( 1+ \frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{1}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} \right) -(\log n +2\eta \log p)\\ &&\geq \frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), \end{array} $$

(A.34)

for any $l \in \mathcal {F}_{c}\cap {S_{k}^{c}}$. Hence, if $j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}$, we obtain

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\}) \!\geq\! \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}} - n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), $$

(A.35)

since $j^{*}=\arg \min \limits _{l \in \mathcal {F}\cap {S_{k}^{c}}} \hat {\sigma }^{2}_{S_{k}\cup \{l\}}$. If $j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}$, we can also show that

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\})\geq \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(L-1)(\log n +2\eta \log p), $$

(A.36)

in the same way. The second terms of the right hand side in Eqs. A.35 and A.36 are o(n) under Assumption 1. In addition, we have $\frac {1}{n}{\sum }_{i=1}^{n}{Y_{i}^{2}} \xrightarrow {P} E[Y^{2}]$. Therefore, Theorem 1 implies that the first terms of the right hand side in Eqs. A.35 and A.36 dominate the second terms, and we get

$$ \text{EBIC}(S_{k}) \geq \text{EBIC}(S_{k}\cup\{j^{*}\}) \ \ \text{for } k =1,\cdots, T^{*}, $$

(A.37)

with probability tending to one. Equation A.37 implies that our forward selection procedure has selected some index at each k th step for k = 1,⋯ ,T^∗ + 1. Thus, we have the monotone increasing sequence and the non increasing sequence $\{S_{k} \}_{k=1}^{T^{*}+1}$ and $\{\hat \sigma _{S_{k}}\}_{k=1}^{T^{*}+1}$, respectively. Note that

$$ \frac{1}{n}\sum\limits_{i=1}^{n} (Y_{i} -\bar{Y})^{2} \geq \hat\sigma^{2}_{S_{1}} -\hat\sigma^{2}_{S_{T^{*}+1}} =\sum\limits_{k=1}^{T^{*}}(\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k+1}}) =\sum\limits_{k=1}^{T^{*}} \max_{l \in \mathcal{F}\cap {S^{c}_{k}}}\{ \hat\sigma^{2}_{S_{k}} -\hat\sigma^{2}_{S_{k}\cup\{l\}}\}, $$

(A.38)

where $\bar {Y}=n^{-1}{\sum }_{i=1}^{n}Y_{i}$. Theorem 1 implies $\text {Var}(Y)\geq {\sum }_{k=1}^{T^{*}}(1-\delta _{2n})D^{*} = T^{*}(1-\delta _{2n})D^{*}$ with probability tending to one. Thus, we have

$$ T^{*}\leq\frac{\text{Var}(Y)}{(1-\delta_{2n})D^{*}} <\frac{\text{Var}(Y)}{(1-2\delta_{2n})D^{*}}\leq T^{*}, $$

(A.39)

with probability tending to one, and a contradiction occurs. This completes the proof. □

Proof of Theorem 3.

Since $\mathcal {S}_{0} \subset S_{k}$, we have

$$ \hat{\sigma}_{S_{k}}^{2}=n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\delta}\|^{2}+2n^{-1}\boldsymbol{\delta}^{T}\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}+n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}\|^{2}. $$

We can see that the first term and the second term of the right hand side are o_P(1) by Eq. A.22 and Cauchy-Schwarz inequality. Note that $\|\boldsymbol {Q}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=\|\boldsymbol {\epsilon }\|^{2}-\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}$, and $\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=O_{P}(m\log n)$ by Proposition 3 of Zhang (2010), where $m=\text {rank}(\boldsymbol {P}_{S_{k}})= |\mathcal {F}_{c}\cap S_{k}|+|\mathcal {F}_{v}\cap S_{k}|(L-1)$. Notice that |S_k|≤ 2k ≤ 2M, and M ≤ ν_mn^ξ for some ν_m > 0, since M = O(n^ξ). Thus, we have $n^{-1}m\log n \leq n^{-1}2ML\log n \leq 2C_{L}\nu _{m} n^{\xi +1/5-1}\log n \rightarrow 0$ as $n\rightarrow \infty ,$ and we get $\|\boldsymbol {P}_{S_{k}}\|^{2}/n=o_{P}(1)$. Since $\|\boldsymbol {\epsilon }\|^{2}/n\xrightarrow {P}\sigma ^{2}$, we obtain $\hat {\sigma }_{S_{k}}^{2}=\sigma ^{2}+o_{P}(1)$. We can also show that $\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=\sigma ^{2}+o_{P}(1)$ for any $l \in \mathcal {F}\cap {S_{k}^{c}}$ in the same way. Hence, we have $\hat {\sigma }_{S_{k}}^{2}-\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=o_{P}(1)$ for any $l \in \mathcal {F}\cap {S_{k}^{c}}$. Now, let $j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}$. Since $-1<-(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{j^{*}\}})(\hat \sigma ^{2}_{S_{k}})^{-1}$ and $\log (1+x)\geq x(1+x)^{-1}$ for all x > − 1, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})\\ &&=n \log\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right) +(\log n + 2 \eta \log p)\\ &&\geq \left( -\frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p)\\ &&=\left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p). \end{array} $$

(A.40)

Note that $\|\boldsymbol {H}_{lS_{k}}\boldsymbol {Y}\|^{2} \leq 2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\delta }\|^{2} +2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\epsilon }\|^{2}$. From Eq. A.22 and Proposition 3 of Zhang (2010), we have

$$ \max_{l \in \mathcal{F}\cap {S_{k}^{c}}} \{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}} \} = O_{P}(n^{1/5}\log n). $$

(A.41)

The first term of the right hand side in the last equality of Eq. A.41 is $O_{P}(n^{1/5}\log n)$ because its denominator converges to one in probability. On the other hand, the second term in the same equality is $O_{P}(n^{\xi _{0}})$, and $O_{P}(n^{1/5-\xi _{0}}\log n)=o_{P}(1)$. Hence, we have

$$ \begin{array}{@{}rcl@{}} \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) &\geq& (\log n +2\eta \log p)\{1- O_{P}(n^{1/5-\xi_{0}}\log n)\}\\ &=& (\log n +2\eta \log p )\{1-o_{P}(1)\}, \end{array} $$

(A.42)

when $j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}$. Next, let $j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}$. By the same argument, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) \\ &&\geq \left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}\\ &&\quad +(L-1)(\log n + 2 \eta \log p) \\ &&=(L-1)(\log n +2\eta \log p)\{1-O_{P}(n^{-\xi_{0}}\log n)\}\\ &&=(L-1)(\log n +2\eta \log p)\{1-o_{P}(1)\} . \end{array} $$

(A.43)

Eqs. A.42 and A.43 imply

$$ \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})>0 \ \ \text{for } j^{*}\in \mathcal{F}\cap {S_{k}^{c}}, $$

(A.44)

with probability tending to one. Eq. A.44 implies that our forward selection procedure stops variable selection at the k th step. The Proof of Theorem 3 is completed. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shinkyu, A. Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models. Sankhya A 85, 485–511 (2023). https://doi.org/10.1007/s13171-021-00261-4

Download citation

Received: 13 April 2020
Accepted: 05 August 2021
Published: 23 September 2021
Issue Date: February 2023
DOI: https://doi.org/10.1007/s13171-021-00261-4

Keywords

AMS (2000) subject classification

62G08

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

Abstract

Access this article

Similar content being viewed by others

Feature Selection for High-Dimensional Varying Coefficient Models via Ordinary Least Squares Projection

Model identification and selection for single-index varying-coefficient models

Unified Variable Selection for Varying Coefficient Models with Longitudinal Data

Availability of Data and Material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s Note

Appendix A: Proofs

Lemma A.1.

Proof of Lemma A.1.

Corollary A.1.

Proof of Corollary A.1.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Rights and permissions

About this article

Cite this article

Keywords

AMS (2000) subject classification

Navigation

Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

Abstract

Access this article

Similar content being viewed by others

Feature Selection for High-Dimensional Varying Coefficient Models via Ordinary Least Squares Projection

Model identification and selection for single-index varying-coefficient models

Unified Variable Selection for Varying Coefficient Models with Longitudinal Data

Availability of Data and Material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s Note

Appendix A: Proofs

Appendix A: Proofs

Lemma A.1.

Proof of Lemma A.1.

Corollary A.1.

Proof of Corollary A.1.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

AMS (2000) subject classification

Search

Navigation