Abstract
Varying coefficient models have flexibility and interpretability, and they are widely used in data analysis. Although feature screening procedures have been proposed for ultra-high dimensional varying coefficient models, none of these procedures includes structure identification. That is, existing feature screening procedures for varying coefficient models do not give us information on which covariates have constant coefficients and which covariates have non-constant coefficients among selected covariates. Hence, these procedures do not explicitly select partially linear varying coefficient models, which are much simpler than general varying coefficient models. Motivated by this issue, we propose a forward selection procedure for simultaneous feature screening and structure identification in varying coefficient models. Unlike existing feature screening procedures, our method classifies all covariates into three groups: covariates with constant coefficients, covariates with non-constant coefficients, and covariates with zero coefficients. Thus, our procedure can explicitly select partially linear varying coefficient models. Our procedure selects covariates sequentially until the extended BIC (EBIC) increases. We establish the screening consistency for our method under some conditions. Numerical studies and real data examples support the utility of our procedure.
Similar content being viewed by others
Availability of Data and Material
The data and material used in this paper is available from the public data repository at http://doi.org/10.18129/B9.bioc.seventyGeneData.
References
Breheny, P. and Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187.
Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
Cheng, M.Y., Feng, S., Li, G. and Lian, H. (2018). Greedy forward regression for variable screening. Austral. New Zealand J. Stat. 60, 2–42.
Cheng, M.Y., Honda, T., Li, J. and Peng, H. (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Stat. 42, 1819–1849.
Cheng, M.Y., Honda, T. and Zhang, J.T. (2016). Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc.111, 1209–1221.
Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 106, 544–557.
Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc.109, 1270–1284.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360.
Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Series B 70, 849–911.
Greene, W.H. (2012). Econometric Analysis, 7th edn. Pearson Education, Harlow.
Honda, T., Ing, C.K. and Wu, W.Y. (2019). Adaptively weighted group Lasso for semiparametric quantile regression models. Bernoulli 25, 3311–3338.
Honda, T. and Lin, C.-T. (2021). Forward variable selection for sparse ultra-high dimensional generalized varying coefficient models. Japanese J. Stat. Data Sci. 4, 151–179.
Honda, T. and Yabe, R. (2017). Variable selection and structure identification for varying coefficient Cox models. J. Multivar. Anal. 161, 103–122.
Horn, R.A. and Johnson, C.R. (2013). Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge.
Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gpttardo, R., Hahne, F., Hansen, K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain, V., Oles, A.K., Reyes, H., Shannon, A., Smyth, P., Tenebaum, G.K., Waldron, D., Morgan, L. and Pages, M (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121.
Josse, J. and Husson, F. (2016). missMDA: a package for handling missing values in mulitvariate data analysis. J. Stat. Softw. 7, 1, 1–31.
Lee, E.R., Noh, H. and Park, B.U. (2014). Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc.109, 216–229.
Li, G., Peng, H., Zhang, J. and Zhu, L. (2012a). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.
Li, R., Zhong, W. and Zhu, L. (2012b). Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139.
Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Assoc. 109, 266–274.
Liu, J.Y., Zhong, W. and Li, R.Z. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 2033–2054.
Luigi, M., Bahman, A., Donald, G. and Jeffrey, T.L. (2013). A simple and reproducible breast cancer prognostic test. BMC Genomics 14.
Luo, S. and Chen, Z. (2014). Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space. J. Amer. Stat. Assoc. 109, 1229–1240.
Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.
Serban, N. (2011). A space-time varying coefficient model: the equity of service accessibility. Ann. Appl. Stat. 5, 2024–2051.
Song, R., Yi, F. and Zou, H. (2014). On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat. Sin. 24, 1735–1752.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B 8, 267–288.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes. Springer, New York.
van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R. and Friend, S.H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524.
Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 104, 747–757.
Wang, L., Li, H. and Huang, J.Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Am. Stat. Assoc. 103, 1556–1569.
Wang, K. and Lin, L. (2016). Robust structure identification and variable selection in partial linear varying coefficient models. J. Stat. Plann. Inf. 174, 153–168.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B 68, 49–67.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.
Zhong, W., Duan, S. and Zhu, L. (2020). Forward additive regression for ultrahigh-dimensional nonparametric additive models. Stat. Sin. 30, 175–192.
Zhu, L.P., Li, L., Li, R. and Zhu, L.X. (2011). Model-free feature screening for ultra-high dimensional data. J. Am. Stat. Assoc. 106, 1464–1475.
Acknowledgements
The author would like to thank Professor Toshio Honda for providing a code to generate the orthonormal basis and giving helpful advice. He also would like to appreciate the Editor and the anonymous three reviewers for their constructive comments, which give significant improvement on presentation of this paper.
Funding
The author received no financial support for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The author declares no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proofs
Appendix A: Proofs
To prove main theorems, we need Lemma A.1.
Lemma A.1.
Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, there exists some sequence of positive numbers {δ1n} tending to zero sufficiently slowly such that with probability tending to one, we have
for any \(S\subset \mathcal {F}\) satisfying |S|≤ M.
Proof of Lemma A.1.
Note that there exists some positive constant α such that ξ + α < 1/10. Let δ1n = n−α. We will show
Note that eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are nonnegative because these matrices are positive semidefinite. We also note that the k th largest singular value is equal to the k th largest eigenvalue for any positive semidefinite matrix. Hence, from Corollary 7.3.5 of Horn and Johnson (2013), we get
where \(\{\lambda _{k}(\widehat {\boldsymbol {{\varSigma }}}_{S})\}\) and {λk(ΣS)} are non increasingly ordered eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS, respectively. Thus, we obtain
Note that the numbers of row and column of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are equal to \(|\mathcal {F}_{c}\cap S|+|\mathcal {F}_{v}\cap S|(L-1)\). Since \(|\mathcal {F}_{v}\cap S|\leq M\) and \(|\mathcal {F}_{c} \cap S|\leq M\), we have \(\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|\leq ML\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|_{\infty }\) by the Cauchy Schwarz inequality, and
for any S satisfying |S|≤ M. Under Assumption 3, Lemmas A.1 and A.2 of Fan et al. (2014) hold. Thus, for k, j = 1,…,p, s, l = 1,…,L, and m ≥ 2, we have
where K is a positive constant. Recall that B(z) = A0B0(z), and \(C_{3}\leq \lambda _{\max \limits }({\boldsymbol {A}_{0}^{T}}\boldsymbol {A}_{0})\leq C_{4}\) for some positive constants C3 and C4. See Appendix A of Honda and Yabe (2017). Hence, for l = 1,…,L, we can denote \(B_{l}(z)={\sum }_{s=1}^{L}a_{ls}B_{0s}(z)\) where als is the (l, s) element of A0. Note that \({\sum }_{k=1}^{L}B_{0k}(z)=1\) due to the properties of the B-spline basis. Therefore, for s, l = 1,…,L, we obtain
As a result, we get \(E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}] \leq {C_{4}}^{m}\) and the right hand side of the last inequality in Eq. A.6 can be bounded by \(\leq {(2KC_{4})^{m-2}m!8K^{2}{C_{4}}^{2}e}{2}^{-1}\) for any m ≥ 2. Thus, from Bernstein’s inequality (see Lemma 2.2.11 of van der Vaart and Wellner (1996)), we get
Since M = O(nξ), M ≤ νmnξ for some νm > 0. Thus, the right hand side of Eq. A.8 can be bounded by
where \(K^{*}=\max \limits \{16K^{2}{C_{4}^{2}}e{\nu _{m}^{2}}{C_{L}^{4}}\tau _{\min \limits }^{-1}, 4KC_{4}\nu _{m} {C_{L}^{2}}\}\). Let A = {j : 1j ∈ S}. Note that {j : − 1j ∈ S}⊂{j : 1j ∈ S}. Since |A|≤|S|≤ M, the right hand side of Eq. A.5 can be further bounded by
The right hand side of the last inequality in Eq. A.10 converges to zero as \(n \rightarrow \infty \). Thus, Eq. A.2 holds. We can also get
in the same way. Eqs. A.2, A.11, and Assumption 2 imply
for any S satisfying |S|≤ M. This completes the proof of Lemma A.1. □
Corollary A.1 is derived by Lemma A.1 immediately.
Corollary A.1.
Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, with probability tending to one, we have
for any \(S\subset \mathcal {F}\) and \(E \subset \mathcal {F}\) satisfying |E ∪ S|≤ M, where QS is the orthogonal projection defined in Eq. A.20.
Proof of Corollary A.1.
Note that WE∪S = (WE, WS). By (A-74) in Greene (2012), \(({\boldsymbol {W}_{E}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{E})^{-1}\) is the upper left block of \(({\boldsymbol {W}}_{E\cup S}^{T}\) WE∪S)− 1. Notice that \(\lambda _{\max \limits }(\boldsymbol {A}^{-1})=\lambda _{\min \limits }(\boldsymbol {A})^{-1}\) and \(\lambda _{\min \limits }(\boldsymbol {A}^{-1})=\) \(\lambda _{\max \limits }(\boldsymbol {A})^{-1}\) for any symmetric and invertible matrix A. Hence, we get
and
By Lemma A.1, we have
with probability tending to one. Hence, the proof of Corollary A.1 is completed. □
Proof of Theorem 1.
We can prove Theorem 1 almost in the same way as in Cheng et al. (2016). However, since there are some differences because we are also considering structure identification, we give the proof of Theorem 1 below. Firstly, we can represent Eq. 2.1 by
where
as in Honda et al. (2019). If γj is suitably chosen, there exists some positive \(C_{g}^{\prime }\) such that
under Assumption 4. Note that |gcj|2 = |γ1j|2L− 1 and \(\|g_{vj}\|^{2}=\|\boldsymbol {\gamma }_{-1j}\|^{2}L^{-1}+O(L^{-4})\). See Appendix A of Honda and Yabe (2017) for properties of the orthonormal basis. Then, there exist some positive constants C1 and C2 such that \(C_{1}L \|g_{vj}\|^{2} \leq \|{\boldsymbol {\gamma }}_{-1j}\|^{2} \leq C_{2}L\|g_{vj}\|^{2}\). Under Assumption 6, we have
The matrix form of Eq. A.15 is represented by
Let us denote
for any S satisfying \(|S \cup \mathcal {S}_{0}|\leq M\) and \(\mathcal {S}_{0} \not \subset S\). We can easily show that HlSQS = HlS and \(n \hat {\sigma ^{2}_{S}} - n\hat \sigma ^{2}_{S(l)} =\|\boldsymbol {H}_{lS} \boldsymbol {Y}\|^{2}\). From the triangle inequality, we have
We evaluate each part in the right hand side of Eq. A.21. Since HlS is a projection matrix, we have
Hence, we have ∥HlSδ∥ = OP(n1/10). Next, we evaluate the third term in Eq. A.21. Under Assumption 5, Proposition 3 of Zhang (2010) holds. Thus, we have
for all x > 0 and for any \(l\in \mathcal {F}\cap S^{c}\), where m = rank(HlS). We take \(x=\log n\). If \(l\in \mathcal {F}_{c}\), then rank(HlS) = 1 and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(\log n)\), which implies \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2}=O_{P}(L\log n)\). If \(l \in \mathcal {F}_{v}\), then rank(HlS) = (L − 1) and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(L\log n)\). Thus, we have
Next, we evaluate \(\max \limits _{l \in \mathcal {S}_{0} \cap S^{c}}\| \boldsymbol {H}_{lS}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}}\|\). Notice that
Note that \(\lambda _{\min \limits }((\widetilde {\boldsymbol {W}}_{lS}^{T} \widetilde {\boldsymbol {W}}_{lS})^{-1})=\lambda _{\min \limits }(({\boldsymbol {W}_{l}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{l})^{-1})\) and |{l}∪ S|≤ M. From Corollary A.1, we get
with probability tending to one. We evaluate \(\max \limits _{l \in \mathcal {S}_{0}\cap S^{c}} \|\widetilde {\boldsymbol {W}}_{lS}^{T}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}} \|^{2} \). We can see that
Notice that \(|g_{cj}|^{2}+\|g_{vj}\|^{2}_{2}=\|g_{j}\|^{2}_{2} \leq \|g_{j}\|_{\infty }^{2}\). From Cauchy-Schwarz inequality, we get
with probability tending to one. Thus, we get
Since \(|\mathcal {S}_{0}\cup S|\leq M\), we have
by Corollary A.1. Notice that
with probability tending to one. Note that \(\min \limits \{\max \limits _{j \in \mathcal {S}_{c}}|g_{cj}|^{2}, \max \limits _{j \in \mathcal {S}_{v}}\|g_{vj}\) \(\|_{2}^{2}\}>C_{\min \limits }\) under Assumption 4. From Eqs. A.26, A.28, and A.30, we get
with probability tending to one. Equation A.31 implies that the first term in the right hand side of Eq. A.21 dominates the other two terms. Thus, we have
with probability tending to one. By taking {δ2n} as δ2n = 1 − (1 − δ1n)2(1 + δ1n)− 1, we get
for any S satisfying \(|S\cup \mathcal {S}_{0}|\leq M\) with probability tending to one, and the Proof of Theorem 1 is completed. □
Proof of Theorem 2.
Assume Theorem 2 is false. Then, we have \(\mathcal {S}_{v} \not \subset F_{k}\) or \(\mathcal {S}_{c} \not \subset E_{k}\) for k = 1,⋯ ,T∗. This implies that \(\mathcal {S}_{0} \not \subset S_{k}\) and \(|S_{k} \cup \mathcal {S}_{0}|\leq M\) for k = 1,⋯ ,T∗. Note that we have \(n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}} \geq \hat {\sigma }_{S_{k}(l)}^{2}\) and \(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{l\}} \leq n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}}\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\), and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1. Hence, we have
for any \(l \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Hence, if \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\), we obtain
since \(j^{*}=\arg \min \limits _{l \in \mathcal {F}\cap {S_{k}^{c}}} \hat {\sigma }^{2}_{S_{k}\cup \{l\}}\). If \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\), we can also show that
in the same way. The second terms of the right hand side in Eqs. A.35 and A.36 are o(n) under Assumption 1. In addition, we have \(\frac {1}{n}{\sum }_{i=1}^{n}{Y_{i}^{2}} \xrightarrow {P} E[Y^{2}]\). Therefore, Theorem 1 implies that the first terms of the right hand side in Eqs. A.35 and A.36 dominate the second terms, and we get
with probability tending to one. Equation A.37 implies that our forward selection procedure has selected some index at each k th step for k = 1,⋯ ,T∗ + 1. Thus, we have the monotone increasing sequence and the non increasing sequence \(\{S_{k} \}_{k=1}^{T^{*}+1}\) and \(\{\hat \sigma _{S_{k}}\}_{k=1}^{T^{*}+1}\), respectively. Note that
where \(\bar {Y}=n^{-1}{\sum }_{i=1}^{n}Y_{i}\). Theorem 1 implies \(\text {Var}(Y)\geq {\sum }_{k=1}^{T^{*}}(1-\delta _{2n})D^{*} = T^{*}(1-\delta _{2n})D^{*}\) with probability tending to one. Thus, we have
with probability tending to one, and a contradiction occurs. This completes the proof. □
Proof of Theorem 3.
Since \(\mathcal {S}_{0} \subset S_{k}\), we have
We can see that the first term and the second term of the right hand side are oP(1) by Eq. A.22 and Cauchy-Schwarz inequality. Note that \(\|\boldsymbol {Q}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=\|\boldsymbol {\epsilon }\|^{2}-\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}\), and \(\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=O_{P}(m\log n)\) by Proposition 3 of Zhang (2010), where \(m=\text {rank}(\boldsymbol {P}_{S_{k}})= |\mathcal {F}_{c}\cap S_{k}|+|\mathcal {F}_{v}\cap S_{k}|(L-1)\). Notice that |Sk|≤ 2k ≤ 2M, and M ≤ νmnξ for some νm > 0, since M = O(nξ). Thus, we have \(n^{-1}m\log n \leq n^{-1}2ML\log n \leq 2C_{L}\nu _{m} n^{\xi +1/5-1}\log n \rightarrow 0\) as \(n\rightarrow \infty ,\) and we get \(\|\boldsymbol {P}_{S_{k}}\|^{2}/n=o_{P}(1)\). Since \(\|\boldsymbol {\epsilon }\|^{2}/n\xrightarrow {P}\sigma ^{2}\), we obtain \(\hat {\sigma }_{S_{k}}^{2}=\sigma ^{2}+o_{P}(1)\). We can also show that \(\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=\sigma ^{2}+o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\) in the same way. Hence, we have \(\hat {\sigma }_{S_{k}}^{2}-\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\). Now, let \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Since \(-1<-(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{j^{*}\}})(\hat \sigma ^{2}_{S_{k}})^{-1}\) and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1, we get
Note that \(\|\boldsymbol {H}_{lS_{k}}\boldsymbol {Y}\|^{2} \leq 2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\delta }\|^{2} +2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\epsilon }\|^{2}\). From Eq. A.22 and Proposition 3 of Zhang (2010), we have
The first term of the right hand side in the last equality of Eq. A.41 is \(O_{P}(n^{1/5}\log n)\) because its denominator converges to one in probability. On the other hand, the second term in the same equality is \(O_{P}(n^{\xi _{0}})\), and \(O_{P}(n^{1/5-\xi _{0}}\log n)=o_{P}(1)\). Hence, we have
when \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Next, let \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\). By the same argument, we get
with probability tending to one. Eq. A.44 implies that our forward selection procedure stops variable selection at the k th step. The Proof of Theorem 3 is completed. □
Rights and permissions
About this article
Cite this article
Shinkyu, A. Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models. Sankhya A 85, 485–511 (2023). https://doi.org/10.1007/s13171-021-00261-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-021-00261-4