Beyond support in two-stage variable selection

Bécu, Jean-Michel; Grandvalet, Yves; Ambroise, Christophe; Dalmasso, Cyril

doi:10.1007/s11222-015-9614-1

Beyond support in two-stage variable selection

Published: 20 November 2015

Volume 27, pages 169–179, (2017)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Jean-Michel Bécu¹,
Yves Grandvalet¹,
Christophe Ambroise² &
…
Cyril Dalmasso²

562 Accesses
2 Citations
Explore all metrics

Abstract

Numerous variable selection methods rely on a two-stage procedure, where a sparsity-inducing penalty is used in the first stage to predict the support, which is then conveyed to the second stage for estimation or inference purposes. In this framework, the first stage screens variables to find a set of possibly relevant variables and the second stage operates on this set of candidate variables, to improve estimation accuracy or to assess the uncertainty associated to the selection of variables. We advocate that more information can be conveyed from the first stage to the second one: we use the magnitude of the coefficients estimated in the first stage to define an adaptive penalty that is applied at the second stage. We give the example of an inference procedure that highly benefits from the proposed transfer of information. The procedure is precisely analyzed in a simple setting, and our large-scale experiments empirically demonstrate that actual benefits can be expected in much more general situations, with sensitivity gains ranging from 50 to 100 % compared to state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Variable Selection in Nonparametric Sparse Regression

Article 31 May 2014

Laplace Approximation in High-Dimensional Bayesian Regression

Selective inference via marginal screening for high dimensional classification

Article 27 August 2019

Notes

In their two-stage procedure, Liu and Yu (2013) also proposed to construct confidence regions and to conduct hypothesis testing by bootstrapping residuals. Their approach fundamentally differs from Wasserman and Roeder (2009), in that inference does not rely on the two-stage procedure itself, but on the properties of the estimator obtained in the second stage.
Though many sparsity-inducing penalties, such as the Elastic-Net, the group-Lasso or the fused-Lasso lend themselves to the approach proposed here, the paper is restricted to the simple Lasso penalty.
Note that there is no finite-sample exact permutation test in multiple linear regression (Anderson and Robinson 2001). A test based on partial residuals (under the null hypothesis regression model) is asymptotically exact for unpenalized regression, but it does not apply to penalized regression.

References

Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)
Article MATH Google Scholar
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010)
Article Google Scholar
Anderson, M.J., Robinson, J.: Permutation tests for linear models. Austral. N. Z. J. Stat. 43(1), 75–88 (2001)
Article MathSciNet MATH Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)
Article MATH Google Scholar
Balding, D.: A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7(10), 781–791 (2006)
Article Google Scholar
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Article MathSciNet MATH Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300 (1995)
MathSciNet MATH Google Scholar
Boulesteix, A.L., Schmid, M.: Machine learning versus statistical modeling. Biom. J. 56, 588–593 (2014)
Article MathSciNet MATH Google Scholar
Bühlmann, P.: Statistical significance in high-dimensional linear models. Bernoulli 19, 1212–1242 (2013)
Article MathSciNet MATH Google Scholar
Candès, E., Tao, T.: The Dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann. Stat. 35, 2313–2351 (2007)
Article MathSciNet MATH Google Scholar
Chatterjee, A., Lahiri, S.N.: Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Stat. 41(3), 1232–1259 (2013)
Article MathSciNet MATH Google Scholar
Chong, I.G., Jun, C.H.: Performance of some variable selection methods when multicollinearity is present. Chemom. Intel. Lab. Syst. 78(1–2), 103–112 (2005)
Article Google Scholar
Cule, E., Vineis, P., De Lorio, M.: Significance testing in ridge regression for genetic data. BMC Bioinf. 12(372), 1–15 (2011)
Google Scholar
Dalmasso, C., Carpentier, W., Meyer, L., Rouzioux, C., Goujard, C., Chaix, M.L., Lambotte, O., Avettand-Fenoel, V., Le Clerc, S., Denis de Senneville, L., Deveau, C., Boufassa, F., Debre, P., Delfraissy, J.F., Broet, P., Theodorou, I.: Distinct genetic loci control plasma HIV-RNA and cellular HIV-DNA levels in HIV-1 infection: the ANRS genome wide association 01 study. PLoS One 3(12), e3907 (2008)
Article Google Scholar
Dudoit, S., Van der Laan, M.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)
Book MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Article MathSciNet MATH Google Scholar
Grandvalet, Y.: Least absolute shrinkage is equivalent to quadratic penalization. In: Niklasson L, Bodén M, Ziemske T (eds) ICANN’98, Perspectives in Neural Computing, vol 1, Springer, New York, pp. 201–206 (1998)
Grandvalet, Y., Canu, S.: Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in Neural Information Processing Systems 11 (NIPS 1998), MIT Press, Cambridge, pp. 445–451 (1999)
Halawa, A.M., El Bassiouni, M.Y.: Tests of regressions coefficients under ridge regression models. J. Stat. Comput. Simul. 65(1), 341–356 (1999)
MathSciNet MATH Google Scholar
Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models, Monographs on Statistics and Applied Probability, vol. 43. Chapman & Hall, London (1990)
MATH Google Scholar
Huang, J., Horowitz, J.L., Ma, S.: Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat. 36(2), 587–613 (2008)
Article MathSciNet MATH Google Scholar
Kyung, M., Gill, J., Ghosh, M., Casella, G.: Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 5(2), 369–411 (2010)
Article MathSciNet MATH Google Scholar
Liu, H., Yu, B.: Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression. Electr. J. Stat. 7, 3124–3169 (2013)
Article MathSciNet MATH Google Scholar
Lockhart, R., Taylor, J., Tibshirani, R.J., Tibshirani, R.: A significance test for the lasso. Ann. Stat. 42(2), 413–468 (2014)
Article MathSciNet MATH Google Scholar
Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)
Article MathSciNet MATH Google Scholar
Meinshausen, N., Meier, L., Bühlmann, P.: $p$-values for high-dimensional regression. J. Am. Stat. Assoc. 104(488), 1671–1681 (2009)
Article MathSciNet MATH Google Scholar
Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.A., Grill, J., Frouin, V.: Variable selection for generalized canonical correlation analysis. Biostatistics 15(3), 569–583 (2014)
Article Google Scholar
Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Verzelen, N.: Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electr. J. Stat. 6, 38–90 (2012)
Article MathSciNet MATH Google Scholar
Wang, Y., Yang, J., Yin, W., Zhang, W.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)
Article MathSciNet MATH Google Scholar
Wasserman, L., Roeder, K.: High-dimensional variable selection. Ann. Stat. 37(5A), 2178–2201 (2009)
Article MathSciNet MATH Google Scholar
Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 601–608 (2001)
Zhang, C.H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242 (2014)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by the UTC foundation for Innovation, in the ToxOnChip program. It has been carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02) within the “Investments for the future” program, managed by the National Agency for Research.

Author information

Authors and Affiliations

Sorbonne universités, Université de technologie de Compiègne, CNRS, Heudiasyc UMR 7253, CS 60 319, 60 203, Compiègne Cedex, France
Jean-Michel Bécu & Yves Grandvalet
LaMME, Université d’Évry val d’Essonne, 23 Boulevard de France, 91000, Évry, France
Christophe Ambroise & Cyril Dalmasso

Authors

Jean-Michel Bécu
View author publications
You can also search for this author in PubMed Google Scholar
Yves Grandvalet
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Ambroise
View author publications
You can also search for this author in PubMed Google Scholar
Cyril Dalmasso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yves Grandvalet.

Appendices

Variational equivalence

We show below that the quadratic penalty in $\varvec{\beta }$ in Problem (2) acts as the Lasso penalty $\lambda \left\| \varvec{\beta }\right\| _{1}$.

Proof

The Lagrangian of Problem (2) is:

$$\begin{aligned} L(\varvec{\beta }) = J(\varvec{\beta }) + \lambda \sum _{j=1}^p \frac{1}{\tau _{j}} \beta _{j}^{2} + \nu _0 \bigg ( \sum _{j=1}^p \tau _j -\left\| \varvec{\beta }\right\| _{1} \bigg ) - \sum _{j=1}^p \nu _j \tau _j. \end{aligned}$$

Thus, the first order optimality conditions for $\tau _j$ are

$$\begin{aligned} \frac{\partial L}{\partial \tau _j}(\tau _j^\star ) = 0&\Leftrightarrow - \lambda \frac{\beta _{j}^2}{{\tau _j^\star }^2} + \nu _0 - \nu _j = 0 \\&\Leftrightarrow - \lambda \beta _{j}^2 + \nu _0 \,{\tau _j^\star }^2 - \nu _j \,{\tau _j^\star }^2 = 0 \\&\Rightarrow - \lambda \beta _{j}^2 + \nu _0 \,{\tau _j^\star }^2 = 0 , \end{aligned}$$

where the term in $\nu _j$ vanishes due to complementary slackness, which implies here $\nu _j \tau _j^\star = 0$. Together with the constraints of Problem (2), the last equation implies $\tau _j^\star = \left| \beta _{j}\right| $, hence Problem (2) is equivalent to

$$\begin{aligned} \min _{\varvec{\beta }\in \mathbb {R}^p} J (\varvec{\beta }) + \lambda \left\| \varvec{\beta }\right\| _{1} , \end{aligned}$$

which is the original Lasso formulation. $\square $

Efficient implementation

Permutation tests rely on the simulation of numerous data sampled under the null hypothesis distribution. The number of replications must be important to estimate the rather extreme quantiles we are typically interested in. Here, we use $B=1000$ replications for the $q=|\mathscr {S}_{\hat{\lambda }}|$ variables selected in the screening stage. Each replication involving the fitting of a model, the total computational cost for solving these B systems of size q on the q selected variables is $O(Bq(q^{3}+q^{2}n))$. However, block-wise decompositions and inversions can bring computing savings by a factor q.

First, we recall that the adaptive-ridge estimate, computed at the cleaning stage, is computed as

$$\begin{aligned} {\,\hat{\varvec{\beta }}} = \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \mathbf {X}^{\top }\mathbf {y}, \end{aligned}$$

where $\varvec{{\varLambda }}$ is the diagonal adaptive-penalty matrix defined at the screening stage, whose jth diagonal entry is ${\lambda }/{\tau _{j}^\star }$, as defined in (1–3).

In the F-statistic (4), the permutation affects the calculation of the larger model $\hat{\mathbf {y}}_{1}$, which is denoted $\hat{\mathbf {y}}_{1}^{(b)}$ for the bth permutation. Using a similar notation convention for the design matrix and the estimated parameters, we have $\hat{\mathbf {y}}_{1}^{(b)}={\mathbf {X}^{(b)}}\hat{\varvec{\beta }}^{(b)}$. When testing the relevance of variable j, ${\mathbf {X}^{(b)}}$ is defined as the concatenation of the permuted variable $\mathbf {x}_{j}^{(b)}$ and the other original variables: ${\mathbf {X}^{(b)}}= (\mathbf {x}_{j}^{(b)}, \mathbf {X}_{\backslash j}) = (\mathbf {x}_{j}^{(b)}, \mathbf {x}_1, ..., \mathbf {x}_{j-1}, \mathbf {x}_{j+1}, ..., \mathbf {x}_{p})$. Then, $\hat{\varvec{\beta }}^{(b)}$ can be efficiently computed by using $a^{(b)}\in \mathbb {R}$, $\mathbf {v}^{(b)}\in \mathbb {R}^{q-1}$ and $\hat{\varvec{\beta }}_{\backslash j}\in \mathbb {R}^{q-1}$ defined as follows:

$$\begin{aligned} a^{(b)}&= (\Vert \mathbf {x}_{j}^{(b)}\Vert ^2_2 + {\varLambda }_{jj} ) - {\mathbf {x}_{j}^{(b)}}^{\top } \mathbf {X}_{\backslash j}{(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {x}_{j}^{(b)}\\ \mathbf {v}^{(b)}&= -{ (\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {x}_{j}^{(b)}\\ \hat{\varvec{\beta }}_{\backslash j}&= {(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {y}. \end{aligned}$$

Indeed, using the Schur complement, one writes $\hat{\varvec{\beta }}^{(b)}$ as follows:

$$\begin{aligned} \hat{\varvec{\beta }}^{(b)}= \frac{1}{a^{(b)}} \begin{pmatrix} 1 \\ \mathbf {v}^{(b)}\end{pmatrix} \begin{pmatrix} 1&{\mathbf {v}^{(b)}}^{\top } \end{pmatrix} \begin{pmatrix} {\mathbf {x}_{j}^{(b)}}^{\top } \mathbf {y}\\ \mathbf {X}_{\backslash j}^{\top } \mathbf {y}\end{pmatrix} + \begin{pmatrix} 0 \\ \hat{\varvec{\beta }}_{\backslash j}\end{pmatrix} . \end{aligned}$$

Hence, $\hat{\varvec{\beta }}^{(b)}$ can be obtained as a correction of the vector of coefficients $\hat{\varvec{\beta }}_{\backslash j}$ obtained under the smaller model. The key observation to be made here is that $\mathbf {x}_{j}^{(b)}$ does not intervene in the expression ${(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}$, which is the bottleneck in the computation of $a^{(b)}$, $\mathbf {v}^{(b)}$ and $\hat{\varvec{\beta }}_{\backslash j}$. It can therefore be performed once for all permutations. Additionally, ${(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}$ can be cheaply computed from $\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}$ as follows:

$$\begin{aligned}&{(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} = \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{\backslash j \backslash j} \\&\qquad -\left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{\backslash j j} \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{j j}^{-1} \\&\qquad \times \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{j{\backslash j}} . \end{aligned}$$

Thus we compute $\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}$ once, firstly correct for the removal of variable j, secondly correct for permutation b, thus eventually requiring $O(B(q^{3}+q^{2}n)))$ operations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bécu, JM., Grandvalet, Y., Ambroise, C. et al. Beyond support in two-stage variable selection. Stat Comput 27, 169–179 (2017). https://doi.org/10.1007/s11222-015-9614-1

Download citation

Received: 02 July 2015
Accepted: 05 November 2015
Published: 20 November 2015
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11222-015-9614-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond support in two-stage variable selection

Abstract

Access this article

Similar content being viewed by others

Adaptive Variable Selection in Nonparametric Sparse Regression

Laplace Approximation in High-Dimensional Bayesian Regression

Selective inference via marginal screening for high dimensional classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Variational equivalence

Proof

Efficient implementation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Beyond support in two-stage variable selection

Abstract

Access this article

Similar content being viewed by others

Adaptive Variable Selection in Nonparametric Sparse Regression

Laplace Approximation in High-Dimensional Bayesian Regression

Selective inference via marginal screening for high dimensional classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Variational equivalence

Proof

Efficient implementation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation