Skip to main content
Log in

Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

  • Published:
Sankhya A Aims and scope Submit manuscript

Abstract

Varying coefficient models have flexibility and interpretability, and they are widely used in data analysis. Although feature screening procedures have been proposed for ultra-high dimensional varying coefficient models, none of these procedures includes structure identification. That is, existing feature screening procedures for varying coefficient models do not give us information on which covariates have constant coefficients and which covariates have non-constant coefficients among selected covariates. Hence, these procedures do not explicitly select partially linear varying coefficient models, which are much simpler than general varying coefficient models. Motivated by this issue, we propose a forward selection procedure for simultaneous feature screening and structure identification in varying coefficient models. Unlike existing feature screening procedures, our method classifies all covariates into three groups: covariates with constant coefficients, covariates with non-constant coefficients, and covariates with zero coefficients. Thus, our procedure can explicitly select partially linear varying coefficient models. Our procedure selects covariates sequentially until the extended BIC (EBIC) increases. We establish the screening consistency for our method under some conditions. Numerical studies and real data examples support the utility of our procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Availability of Data and Material

The data and material used in this paper is available from the public data repository at http://doi.org/10.18129/B9.bioc.seventyGeneData.

References

  • Breheny, P. and Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187.

    Article  MATH  Google Scholar 

  • Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.

    Article  MATH  Google Scholar 

  • Cheng, M.Y., Feng, S., Li, G. and Lian, H. (2018). Greedy forward regression for variable screening. Austral. New Zealand J. Stat. 60, 2–42.

    Article  MATH  Google Scholar 

  • Cheng, M.Y., Honda, T., Li, J. and Peng, H. (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Stat. 42, 1819–1849.

    Article  MATH  Google Scholar 

  • Cheng, M.Y., Honda, T. and Zhang, J.T. (2016). Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc.111, 1209–1221.

    Article  Google Scholar 

  • Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 106, 544–557.

    Article  MATH  Google Scholar 

  • Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc.109, 1270–1284.

    Article  MATH  Google Scholar 

  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360.

    Article  MATH  Google Scholar 

  • Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Series B 70, 849–911.

    Article  MATH  Google Scholar 

  • Greene, W.H. (2012). Econometric Analysis, 7th edn. Pearson Education, Harlow.

    Google Scholar 

  • Honda, T., Ing, C.K. and Wu, W.Y. (2019). Adaptively weighted group Lasso for semiparametric quantile regression models. Bernoulli 25, 3311–3338.

    Article  MATH  Google Scholar 

  • Honda, T. and Lin, C.-T. (2021). Forward variable selection for sparse ultra-high dimensional generalized varying coefficient models. Japanese J. Stat. Data Sci. 4, 151–179.

    Article  MATH  Google Scholar 

  • Honda, T. and Yabe, R. (2017). Variable selection and structure identification for varying coefficient Cox models. J. Multivar. Anal. 161, 103–122.

    Article  MATH  Google Scholar 

  • Horn, R.A. and Johnson, C.R. (2013). Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge.

    MATH  Google Scholar 

  • Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gpttardo, R., Hahne, F., Hansen, K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain, V., Oles, A.K., Reyes, H., Shannon, A., Smyth, P., Tenebaum, G.K., Waldron, D., Morgan, L. and Pages, M (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121.

    Article  Google Scholar 

  • Josse, J. and Husson, F. (2016). missMDA: a package for handling missing values in mulitvariate data analysis. J. Stat. Softw. 7, 1, 1–31.

    Google Scholar 

  • Lee, E.R., Noh, H. and Park, B.U. (2014). Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc.109, 216–229.

    Article  MATH  Google Scholar 

  • Li, G., Peng, H., Zhang, J. and Zhu, L. (2012a). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.

    Article  MATH  Google Scholar 

  • Li, R., Zhong, W. and Zhu, L. (2012b). Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139.

    Article  MATH  Google Scholar 

  • Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Assoc. 109, 266–274.

    Article  MATH  Google Scholar 

  • Liu, J.Y., Zhong, W. and Li, R.Z. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 2033–2054.

    Article  MATH  Google Scholar 

  • Luigi, M., Bahman, A., Donald, G. and Jeffrey, T.L. (2013). A simple and reproducible breast cancer prognostic test. BMC Genomics 14.

  • Luo, S. and Chen, Z. (2014). Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space. J. Amer. Stat. Assoc. 109, 1229–1240.

    Article  MATH  Google Scholar 

  • Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.

    Article  MATH  Google Scholar 

  • Serban, N. (2011). A space-time varying coefficient model: the equity of service accessibility. Ann. Appl. Stat. 5, 2024–2051.

    Article  MATH  Google Scholar 

  • Song, R., Yi, F. and Zou, H. (2014). On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat. Sin. 24, 1735–1752.

    MATH  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B 8, 267–288.

    MATH  Google Scholar 

  • van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes. Springer, New York.

    Book  MATH  Google Scholar 

  • van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R. and Friend, S.H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.

    Article  Google Scholar 

  • Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524.

    Article  MATH  Google Scholar 

  • Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 104, 747–757.

    Article  MATH  Google Scholar 

  • Wang, L., Li, H. and Huang, J.Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Am. Stat. Assoc. 103, 1556–1569.

    Article  MATH  Google Scholar 

  • Wang, K. and Lin, L. (2016). Robust structure identification and variable selection in partial linear varying coefficient models. J. Stat. Plann. Inf. 174, 153–168.

    Article  MATH  Google Scholar 

  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B 68, 49–67.

    Article  MATH  Google Scholar 

  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.

    Article  MATH  Google Scholar 

  • Zhong, W., Duan, S. and Zhu, L. (2020). Forward additive regression for ultrahigh-dimensional nonparametric additive models. Stat. Sin. 30, 175–192.

    MATH  Google Scholar 

  • Zhu, L.P., Li, L., Li, R. and Zhu, L.X. (2011). Model-free feature screening for ultra-high dimensional data. J. Am. Stat. Assoc. 106, 1464–1475.

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The author would like to thank Professor Toshio Honda for providing a code to generate the orthonormal basis and giving helpful advice. He also would like to appreciate the Editor and the anonymous three reviewers for their constructive comments, which give significant improvement on presentation of this paper.

Funding

The author received no financial support for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akira Shinkyu.

Ethics declarations

Conflict of Interests

The author declares no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs

Appendix A: Proofs

To prove main theorems, we need Lemma A.1.

Lemma A.1.

Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, there exists some sequence of positive numbers {δ1n} tending to zero sufficiently slowly such that with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), \\ $$
(A.1)

for any \(S\subset \mathcal {F}\) satisfying |S|≤ M.

Proof of Lemma A.1.

Note that there exists some positive constant α such that ξ + α < 1/10. Let δ1n = nα. We will show

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\min}}{L}\delta_{1n}\right)\rightarrow 0. $$
(A.2)

Note that eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are nonnegative because these matrices are positive semidefinite. We also note that the k th largest singular value is equal to the k th largest eigenvalue for any positive semidefinite matrix. Hence, from Corollary 7.3.5 of Horn and Johnson (2013), we get

$$ \left|\lambda_{k}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{k}(\boldsymbol{{\varSigma}}_{S}) \right|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S} \| \ \ \text{for } k=1, \ldots, |\mathcal{F}_{c}\cap S|+|\mathcal{F}_{v}\cap S|(L-1), $$
(A.3)

where \(\{\lambda _{k}(\widehat {\boldsymbol {{\varSigma }}}_{S})\}\) and {λk(ΣS)} are non increasingly ordered eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS, respectively. Thus, we obtain

$$ |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})- \lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S}\|. $$
(A.4)

Note that the numbers of row and column of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are equal to \(|\mathcal {F}_{c}\cap S|+|\mathcal {F}_{v}\cap S|(L-1)\). Since \(|\mathcal {F}_{v}\cap S|\leq M\) and \(|\mathcal {F}_{c} \cap S|\leq M\), we have \(\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|\leq ML\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|_{\infty }\) by the Cauchy Schwarz inequality, and

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|> \frac{\tau_{\min}}{L}\delta_{1n} \right)\leq P\left( \|\widehat{\boldsymbol{{\varSigma}}}_{S}- \boldsymbol{{\varSigma}}_{S}\|_{\infty}> \frac{\tau_{\min}}{ML^{2}}\delta_{1n} \right), $$
(A.5)

for any S satisfying |S|≤ M. Under Assumption 3, Lemmas A.1 and A.2 of Fan et al. (2014) hold. Thus, for k, j = 1,…,p, s, l = 1,…,L, and m ≥ 2, we have

$$ \begin{array}{@{}rcl@{}} &&E\left( |X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]|^{m}\right) \\ &\leq& 2^{m}e{K}^{m}m!E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}], \end{array} $$
(A.6)

where K is a positive constant. Recall that B(z) = A0B0(z), and \(C_{3}\leq \lambda _{\max \limits }({\boldsymbol {A}_{0}^{T}}\boldsymbol {A}_{0})\leq C_{4}\) for some positive constants C3 and C4. See Appendix A of Honda and Yabe (2017). Hence, for l = 1,…,L, we can denote \(B_{l}(z)={\sum }_{s=1}^{L}a_{ls}B_{0s}(z)\) where als is the (l, s) element of A0. Note that \({\sum }_{k=1}^{L}B_{0k}(z)=1\) due to the properties of the B-spline basis. Therefore, for s, l = 1,…,L, we obtain

$$ \begin{array}{@{}rcl@{}} |B_{s}(Z_{i})B_{l}(Z_{i})|&=&\left|\left\{\sum\limits_{j=1}^{L}a_{sj}B_{0j}(Z_{i})\right\}\left\{\sum\limits_{k=1}^{L}a_{lk}B_{0k}(Z_{i})\right\}\right| \\ &\leq& \max_{1\leq j\leq L}|a_{sj}|\times \max_{1\leq k\leq L}|a_{lk}| \leq \|\boldsymbol{A}_{0}\|_{\infty}^{2} \leq \|\boldsymbol{A}_{0}\|^{2}\leq C_{4}. \end{array} $$
(A.7)

As a result, we get \(E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}] \leq {C_{4}}^{m}\) and the right hand side of the last inequality in Eq. A.6 can be bounded by \(\leq {(2KC_{4})^{m-2}m!8K^{2}{C_{4}}^{2}e}{2}^{-1}\) for any m ≥ 2. Thus, from Bernstein’s inequality (see Lemma 2.2.11 of van der Vaart and Wellner (1996)), we get

$$ \begin{array}{@{}rcl@{}} &&P\left( \left|\frac{1}{n}\sum\limits_{i=1}^{n}\{X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]\}\right|>\frac{\tau_{\min}}{ML^{2}}\delta_{1n}\right)\\ &\leq& 2\exp\left( \frac{-n\tau_{\min}\delta_{1n}}{16M^{2}L^{4}K^{2}{C_{4}^{2}}e(\tau_{\min}\delta_{1n})^{-1}+4ML^{2}KC_{4}}\right). \end{array} $$
(A.8)

Since M = O(nξ), Mνmnξ for some νm > 0. Thus, the right hand side of Eq. A.8 can be bounded by

$$ \leq 2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right), $$
(A.9)

where \(K^{*}=\max \limits \{16K^{2}{C_{4}^{2}}e{\nu _{m}^{2}}{C_{L}^{4}}\tau _{\min \limits }^{-1}, 4KC_{4}\nu _{m} {C_{L}^{2}}\}\). Let A = {j : 1jS}. Note that {j : − 1jS}⊂{j : 1jS}. Since |A|≤|S|≤ M, the right hand side of Eq. A.5 can be further bounded by

$$ \begin{array}{@{}rcl@{}} &\leq& \sum\limits_{k,j \in A} \sum\limits_{1\leq s,l\leq L}2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-2\alpha}\}}\right) \\ &\leq& 2{\nu_{m}^{2}}{C_{L}^{2}}n^{2\xi+2/5}\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right). \end{array} $$
(A.10)

The right hand side of the last inequality in Eq. A.10 converges to zero as \(n \rightarrow \infty \). Thus, Eq. A.2 holds. We can also get

$$ P\left( |\lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\max}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\max}}{L}\delta_{1n}\right)\rightarrow 0 , $$
(A.11)

in the same way. Eqs. A.2A.11, and Assumption 2 imply

$$ P\left( \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n})\right)\rightarrow 1, $$
(A.12)

for any S satisfying |S|≤ M. This completes the proof of Lemma A.1. □

Corollary A.1 is derived by Lemma A.1 immediately.

Corollary A.1.

Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq\lambda_{\min}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq\lambda_{\max}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), $$
(A.13)

for any \(S\subset \mathcal {F}\) and \(E \subset \mathcal {F}\) satisfying |ES|≤ M, where QS is the orthogonal projection defined in Eq. A.20.

Proof of Corollary A.1.

Note that WES = (WE, WS). By (A-74) in Greene (2012), \(({\boldsymbol {W}_{E}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{E})^{-1}\) is the upper left block of \(({\boldsymbol {W}}_{E\cup S}^{T}\) WES)− 1. Notice that \(\lambda _{\max \limits }(\boldsymbol {A}^{-1})=\lambda _{\min \limits }(\boldsymbol {A})^{-1}\) and \(\lambda _{\min \limits }(\boldsymbol {A}^{-1})=\) \(\lambda _{\max \limits }(\boldsymbol {A})^{-1}\) for any symmetric and invertible matrix A. Hence, we get

$$ \lambda_{\min}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S})\leq \lambda_{\min}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}), $$

and

$$ \lambda_{\max}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E})\leq \lambda_{\max}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S}). $$

By Lemma A.1, we have

$$ \frac{n\tau_{\min}}{L}(1-\delta_{1n})\!\leq\! \lambda_{\min}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S})\!\leq\! \lambda_{\max}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S}) \!\leq\! \frac{n\tau_{\max}}{L}(1+\delta_{1n}) $$
(A.14)

with probability tending to one. Hence, the proof of Corollary A.1 is completed. □

Proof of Theorem 1.

We can prove Theorem 1 almost in the same way as in Cheng et al. (2016). However, since there are some differences because we are also considering structure identification, we give the proof of Theorem 1 below. Firstly, we can represent Eq. 2.1 by

$$ \begin{array}{@{}rcl@{}} Y_{i}&=&\sum\limits_{j=1}^{p}X_{ij}g_{cj}+\sum\limits_{j=1}^{p}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}=\sum\limits_{j\in \mathcal{S}_{c}}X_{ij}g_{cj}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}\\ &=&\sum\limits_{j\in\mathcal{S}_{c}}X_{ij}B_{1}(Z_{i})\gamma_{1j}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}+\delta_{i} +\epsilon_{i}, \end{array} $$
(A.15)

where

$$ \delta_{i}=\sum\limits_{j \in\mathcal{S}_{v}}X_{ij}\{g_{vj}(Z_{i})-\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}\}, $$
(A.16)

as in Honda et al. (2019). If γj is suitably chosen, there exists some positive \(C_{g}^{\prime }\) such that

$$ \sum\limits_{j=1}^{p}\|g_{j} -\boldsymbol{B}^{T}\boldsymbol{\gamma}_{j}\|_{\infty} \leq C_{g}^{\prime}L^{-2}, $$
(A.17)

under Assumption 4. Note that |gcj|2 = |γ1j|2L− 1 and \(\|g_{vj}\|^{2}=\|\boldsymbol {\gamma }_{-1j}\|^{2}L^{-1}+O(L^{-4})\). See Appendix A of Honda and Yabe (2017) for properties of the orthonormal basis. Then, there exist some positive constants C1 and C2 such that \(C_{1}L \|g_{vj}\|^{2} \leq \|{\boldsymbol {\gamma }}_{-1j}\|^{2} \leq C_{2}L\|g_{vj}\|^{2}\). Under Assumption 6, we have

$$ |\delta_{i}|\leq C_{X}C_{g}^{\prime}L^{-2}\rightarrow 0. $$
(A.18)

The matrix form of Eq. A.15 is represented by

$$ \boldsymbol{Y}=\sum\limits_{j\in\mathcal{S}_{c}}\boldsymbol{W}_{1j}{\gamma}_{1j}+\sum\limits_{j\in\mathcal{S}_{v}}\boldsymbol{W}_{-1j}\boldsymbol{\gamma}_{-1j}+\boldsymbol{\delta}+\boldsymbol{\epsilon} =\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} + \boldsymbol{\delta} + \boldsymbol{\epsilon}. $$
(A.19)

Let us denote

$$ \boldsymbol{P}_{S}=\boldsymbol{W}_{S}({\boldsymbol{W}_{S}^{T}} \boldsymbol{W}_{S})^{-1} {\boldsymbol{W}_{S}^{T}},\ \ \boldsymbol{Q}_{S}= \boldsymbol{I} -\boldsymbol{P}_{S}, $$
(A.20)
$$ \widetilde{\boldsymbol{W}}_{lS}= \boldsymbol{Q}_{S}\boldsymbol{W}_{l}, \boldsymbol{H}_{lS} =\widetilde{\boldsymbol{W}}_{lS}(\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}\widetilde{\boldsymbol{W}}_{lS}^{T}, $$

for any S satisfying \(|S \cup \mathcal {S}_{0}|\leq M\) and \(\mathcal {S}_{0} \not \subset S\). We can easily show that HlSQS = HlS and \(n \hat {\sigma ^{2}_{S}} - n\hat \sigma ^{2}_{S(l)} =\|\boldsymbol {H}_{lS} \boldsymbol {Y}\|^{2}\). From the triangle inequality, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|\geq \max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}}\|- \|\boldsymbol{H}_{lS}\boldsymbol{\delta}\| - \| \boldsymbol{H}_{lS}\boldsymbol{\epsilon} \|. $$
(A.21)

We evaluate each part in the right hand side of Eq. A.21. Since HlS is a projection matrix, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\delta}\| \leq \|\boldsymbol{\delta} \| \leq \sqrt{n} C_{X} C_{g}^{\prime} L^{-2}. $$
(A.22)

Hence, we have ∥HlSδ∥ = OP(n1/10). Next, we evaluate the third term in Eq. A.21. Under Assumption 5, Proposition 3 of Zhang (2010) holds. Thus, we have

$$ P\left( \frac{\|\boldsymbol{H}_{lS}\boldsymbol{\epsilon}\|}{m{\sigma^{2}_{1}}}^{2} \geq \frac{1+x}{\{1-2/(e^{x/2}\sqrt{1+x}-1)\}^{2}_{+}} \right)\leq e^{-mx/2}(1+x)^{m/2}, $$
(A.23)

for all x > 0 and for any \(l\in \mathcal {F}\cap S^{c}\), where m = rank(HlS). We take \(x=\log n\). If \(l\in \mathcal {F}_{c}\), then rank(HlS) = 1 and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(\log n)\), which implies \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2}=O_{P}(L\log n)\). If \(l \in \mathcal {F}_{v}\), then rank(HlS) = (L − 1) and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(L\log n)\). Thus, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\epsilon}\|=O_{P}(L^{1/2}(\log n)^{1/2} )=O_{P}(n^{1/10}(\log n)^{1/2}). $$
(A.24)

Next, we evaluate \(\max \limits _{l \in \mathcal {S}_{0} \cap S^{c}}\| \boldsymbol {H}_{lS}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}}\|\). Notice that

$$ \|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min} ((\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}) \|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}. $$
(A.25)

Note that \(\lambda _{\min \limits }((\widetilde {\boldsymbol {W}}_{lS}^{T} \widetilde {\boldsymbol {W}}_{lS})^{-1})=\lambda _{\min \limits }(({\boldsymbol {W}_{l}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{l})^{-1})\) and |{l}∪ S|≤ M. From Corollary A.1, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{L}{n \tau_{\max}(1+\delta_{1n})} \max_{l \in \mathcal{S}_{0}\cap S^{c}}\|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}, $$
(A.26)

with probability tending to one. We evaluate \(\max \limits _{l \in \mathcal {S}_{0}\cap S^{c}} \|\widetilde {\boldsymbol {W}}_{lS}^{T}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}} \|^{2} \). We can see that

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c} } \|\widetilde{\boldsymbol{W}}_{lS}^{T}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} =\max_{l \in \mathcal{S}_{0} \cap S^{c} } \|{\boldsymbol{W}_{l}^{T}} \boldsymbol{Q}_{S} \boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2}. $$

Notice that \(|g_{cj}|^{2}+\|g_{vj}\|^{2}_{2}=\|g_{j}\|^{2}_{2} \leq \|g_{j}\|_{\infty }^{2}\). From Cauchy-Schwarz inequality, we get

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \leq \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\| \| \boldsymbol{W}^{T}_{\mathcal{S}_{0}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|\\ &=&\left( \sum\limits_{j\in\mathcal{S}_{c}}|\gamma_{1j}|^{2} +\sum\limits_{j \in \mathcal{S}_{v}}\| \boldsymbol{\gamma}_{-1j}\|^{2} \right)^{1/2} \left( \sum\limits_{j\in \mathcal{S}_{0}\cap S^{c} } \|{\boldsymbol{W}^{T}_{j}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}\\ &\leq& \left( L\sum\limits_{j\in\mathcal{S}_{c}} |g_{cj}|^{2} +C_{2}L\sum\limits_{j \in \mathcal{S}_{v}} \|g_{vj}\|_{2}^{2} \right)^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} 2p_{0} \right)^{1/2}\\ &\leq& 2^{1/2}p_{0}^{1/2} L^{1/2}C_{g}(1+C_{2})^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}} \|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}, \end{array} $$
(A.27)

with probability tending to one. Thus, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}}{2p_{0}L{C_{g}^{2}}(1+C_{2})}. $$
(A.28)

Since \(|\mathcal {S}_{0}\cup S|\leq M\), we have

$$ \|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min}\left( \boldsymbol{W}_{\mathcal{S}_{0}}^{T}\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\right) \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2} \geq \frac{n\tau_{\min}}{L}(1-\delta_{1n})\|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$
(A.29)

by Corollary A.1. Notice that

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\geq \frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2} \|\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\\ &&\geq\frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2}\left( L\max_{j \in \mathcal{S}_{c}}|g_{cj}|^{2}+LC_{1}\max_{j \in \mathcal{S}_{v}}\|g_{vj}\|^{2}_{2}\right)^{2}\\ &&\geq {n^{2}\tau_{\min}^{2}}(1-\delta_{1n})^{2}C_{\min}^{2}(1+C_{1})^{2}, \end{array} $$
(A.30)

with probability tending to one. Note that \(\min \limits \{\max \limits _{j \in \mathcal {S}_{c}}|g_{cj}|^{2}, \max \limits _{j \in \mathcal {S}_{v}}\|g_{vj}\) \(\|_{2}^{2}\}>C_{\min \limits }\) under Assumption 4. From Eqs. A.26A.28, and A.30, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS} \boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{n(1-\delta_{1n})^{2} }{(1+\delta_{1n})} \frac{\tau_{\min}^{2}C_{\min}^{2} (1+C_{1})^{2}}{\tau_{\max} {C_{g}^{2}}(1+C_{2})2p_{0}}, $$
(A.31)

with probability tending to one. Equation A.31 implies that the first term in the right hand side of Eq. A.21 dominates the other two terms. Thus, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|^{2}\geq 4^{-1}\max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$
(A.32)

with probability tending to one. By taking {δ2n} as δ2n = 1 − (1 − δ1n)2(1 + δ1n)− 1, we get

$$ \max_{l\in \mathcal{F}\cap S^{c}}\{ n \hat{\sigma}_{S}^{2} -n \hat{\sigma}^{2}_{S\cup\{l\}}\}\geq n(1-\delta_{2n})\frac{\tau_{\min}^{2} C_{\min}^{2}(1+C_{1})^{2}}{4\tau_{\max}{C_{g}^{2}}(1+C_{2})2p_{0}} =n(1-\delta_{2n})D^{*}, $$
(A.33)

for any S satisfying \(|S\cup \mathcal {S}_{0}|\leq M\) with probability tending to one, and the Proof of Theorem 1 is completed. □

Proof of Theorem 2.

Assume Theorem 2 is false. Then, we have \(\mathcal {S}_{v} \not \subset F_{k}\) or \(\mathcal {S}_{c} \not \subset E_{k}\) for k = 1,⋯ ,T. This implies that \(\mathcal {S}_{0} \not \subset S_{k}\) and \(|S_{k} \cup \mathcal {S}_{0}|\leq M\) for k = 1,⋯ ,T. Note that we have \(n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}} \geq \hat {\sigma }_{S_{k}(l)}^{2}\) and \(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{l\}} \leq n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}}\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\), and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1. Hence, we have

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{l\}) =n\log \left( \frac{\hat\sigma^{2}_{S_{k}}}{\hat\sigma^{2}_{S_{k}\cup\{l\}}} \right) -(\log n +2\eta \log p)\\ &&\geq n\log \left( 1+ \frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{1}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} \right) -(\log n +2\eta \log p)\\ &&\geq \frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), \end{array} $$
(A.34)

for any \(l \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Hence, if \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\), we obtain

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\}) \!\geq\! \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}} - n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), $$
(A.35)

since \(j^{*}=\arg \min \limits _{l \in \mathcal {F}\cap {S_{k}^{c}}} \hat {\sigma }^{2}_{S_{k}\cup \{l\}}\). If \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\), we can also show that

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\})\geq \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(L-1)(\log n +2\eta \log p), $$
(A.36)

in the same way. The second terms of the right hand side in Eqs. A.35 and A.36 are o(n) under Assumption 1. In addition, we have \(\frac {1}{n}{\sum }_{i=1}^{n}{Y_{i}^{2}} \xrightarrow {P} E[Y^{2}]\). Therefore, Theorem 1 implies that the first terms of the right hand side in Eqs. A.35 and A.36 dominate the second terms, and we get

$$ \text{EBIC}(S_{k}) \geq \text{EBIC}(S_{k}\cup\{j^{*}\}) \ \ \text{for } k =1,\cdots, T^{*}, $$
(A.37)

with probability tending to one. Equation A.37 implies that our forward selection procedure has selected some index at each k th step for k = 1,⋯ ,T + 1. Thus, we have the monotone increasing sequence and the non increasing sequence \(\{S_{k} \}_{k=1}^{T^{*}+1}\) and \(\{\hat \sigma _{S_{k}}\}_{k=1}^{T^{*}+1}\), respectively. Note that

$$ \frac{1}{n}\sum\limits_{i=1}^{n} (Y_{i} -\bar{Y})^{2} \geq \hat\sigma^{2}_{S_{1}} -\hat\sigma^{2}_{S_{T^{*}+1}} =\sum\limits_{k=1}^{T^{*}}(\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k+1}}) =\sum\limits_{k=1}^{T^{*}} \max_{l \in \mathcal{F}\cap {S^{c}_{k}}}\{ \hat\sigma^{2}_{S_{k}} -\hat\sigma^{2}_{S_{k}\cup\{l\}}\}, $$
(A.38)

where \(\bar {Y}=n^{-1}{\sum }_{i=1}^{n}Y_{i}\). Theorem 1 implies \(\text {Var}(Y)\geq {\sum }_{k=1}^{T^{*}}(1-\delta _{2n})D^{*} = T^{*}(1-\delta _{2n})D^{*}\) with probability tending to one. Thus, we have

$$ T^{*}\leq\frac{\text{Var}(Y)}{(1-\delta_{2n})D^{*}} <\frac{\text{Var}(Y)}{(1-2\delta_{2n})D^{*}}\leq T^{*}, $$
(A.39)

with probability tending to one, and a contradiction occurs. This completes the proof. □

Proof of Theorem 3.

Since \(\mathcal {S}_{0} \subset S_{k}\), we have

$$ \hat{\sigma}_{S_{k}}^{2}=n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\delta}\|^{2}+2n^{-1}\boldsymbol{\delta}^{T}\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}+n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}\|^{2}. $$

We can see that the first term and the second term of the right hand side are oP(1) by Eq. A.22 and Cauchy-Schwarz inequality. Note that \(\|\boldsymbol {Q}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=\|\boldsymbol {\epsilon }\|^{2}-\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}\), and \(\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=O_{P}(m\log n)\) by Proposition 3 of Zhang (2010), where \(m=\text {rank}(\boldsymbol {P}_{S_{k}})= |\mathcal {F}_{c}\cap S_{k}|+|\mathcal {F}_{v}\cap S_{k}|(L-1)\). Notice that |Sk|≤ 2k ≤ 2M, and Mνmnξ for some νm > 0, since M = O(nξ). Thus, we have \(n^{-1}m\log n \leq n^{-1}2ML\log n \leq 2C_{L}\nu _{m} n^{\xi +1/5-1}\log n \rightarrow 0\) as \(n\rightarrow \infty ,\) and we get \(\|\boldsymbol {P}_{S_{k}}\|^{2}/n=o_{P}(1)\). Since \(\|\boldsymbol {\epsilon }\|^{2}/n\xrightarrow {P}\sigma ^{2}\), we obtain \(\hat {\sigma }_{S_{k}}^{2}=\sigma ^{2}+o_{P}(1)\). We can also show that \(\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=\sigma ^{2}+o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\) in the same way. Hence, we have \(\hat {\sigma }_{S_{k}}^{2}-\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\). Now, let \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Since \(-1<-(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{j^{*}\}})(\hat \sigma ^{2}_{S_{k}})^{-1}\) and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})\\ &&=n \log\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right) +(\log n + 2 \eta \log p)\\ &&\geq \left( -\frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p)\\ &&=\left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p). \end{array} $$
(A.40)

Note that \(\|\boldsymbol {H}_{lS_{k}}\boldsymbol {Y}\|^{2} \leq 2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\delta }\|^{2} +2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\epsilon }\|^{2}\). From Eq. A.22 and Proposition 3 of Zhang (2010), we have

$$ \max_{l \in \mathcal{F}\cap {S_{k}^{c}}} \{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}} \} = O_{P}(n^{1/5}\log n). $$
(A.41)

The first term of the right hand side in the last equality of Eq. A.41 is \(O_{P}(n^{1/5}\log n)\) because its denominator converges to one in probability. On the other hand, the second term in the same equality is \(O_{P}(n^{\xi _{0}})\), and \(O_{P}(n^{1/5-\xi _{0}}\log n)=o_{P}(1)\). Hence, we have

$$ \begin{array}{@{}rcl@{}} \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) &\geq& (\log n +2\eta \log p)\{1- O_{P}(n^{1/5-\xi_{0}}\log n)\}\\ &=& (\log n +2\eta \log p )\{1-o_{P}(1)\}, \end{array} $$
(A.42)

when \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Next, let \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\). By the same argument, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) \\ &&\geq \left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}\\ &&\quad +(L-1)(\log n + 2 \eta \log p) \\ &&=(L-1)(\log n +2\eta \log p)\{1-O_{P}(n^{-\xi_{0}}\log n)\}\\ &&=(L-1)(\log n +2\eta \log p)\{1-o_{P}(1)\} . \end{array} $$
(A.43)

Eqs. A.42 and A.43 imply

$$ \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})>0 \ \ \text{for } j^{*}\in \mathcal{F}\cap {S_{k}^{c}}, $$
(A.44)

with probability tending to one. Eq. A.44 implies that our forward selection procedure stops variable selection at the k th step. The Proof of Theorem 3 is completed. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shinkyu, A. Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models. Sankhya A 85, 485–511 (2023). https://doi.org/10.1007/s13171-021-00261-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-021-00261-4

Keywords

AMS (2000) subject classification

Navigation