Skip to main content
Log in

Beyond support in two-stage variable selection

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Numerous variable selection methods rely on a two-stage procedure, where a sparsity-inducing penalty is used in the first stage to predict the support, which is then conveyed to the second stage for estimation or inference purposes. In this framework, the first stage screens variables to find a set of possibly relevant variables and the second stage operates on this set of candidate variables, to improve estimation accuracy or to assess the uncertainty associated to the selection of variables. We advocate that more information can be conveyed from the first stage to the second one: we use the magnitude of the coefficients estimated in the first stage to define an adaptive penalty that is applied at the second stage. We give the example of an inference procedure that highly benefits from the proposed transfer of information. The procedure is precisely analyzed in a simple setting, and our large-scale experiments empirically demonstrate that actual benefits can be expected in much more general situations, with sensitivity gains ranging from 50 to 100 % compared to state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In their two-stage procedure, Liu and Yu (2013) also proposed to construct confidence regions and to conduct hypothesis testing by bootstrapping residuals. Their approach fundamentally differs from Wasserman and Roeder (2009), in that inference does not rely on the two-stage procedure itself, but on the properties of the estimator obtained in the second stage.

  2. Though many sparsity-inducing penalties, such as the Elastic-Net, the group-Lasso or the fused-Lasso lend themselves to the approach proposed here, the paper is restricted to the simple Lasso penalty.

  3. Note that there is no finite-sample exact permutation test in multiple linear regression (Anderson and Robinson 2001). A test based on partial residuals (under the null hypothesis regression model) is asymptotically exact for unpenalized regression, but it does not apply to penalized regression.

References

  • Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)

    Article  MATH  Google Scholar 

  • Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010)

    Article  Google Scholar 

  • Anderson, M.J., Robinson, J.: Permutation tests for linear models. Austral. N. Z. J. Stat. 43(1), 75–88 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)

    Article  MATH  Google Scholar 

  • Balding, D.: A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7(10), 781–791 (2006)

    Article  Google Scholar 

  • Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  • Boulesteix, A.L., Schmid, M.: Machine learning versus statistical modeling. Biom. J. 56, 588–593 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P.: Statistical significance in high-dimensional linear models. Bernoulli 19, 1212–1242 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Candès, E., Tao, T.: The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35, 2313–2351 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Chatterjee, A., Lahiri, S.N.: Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Stat. 41(3), 1232–1259 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Chong, I.G., Jun, C.H.: Performance of some variable selection methods when multicollinearity is present. Chemom. Intel. Lab. Syst. 78(1–2), 103–112 (2005)

    Article  Google Scholar 

  • Cule, E., Vineis, P., De Lorio, M.: Significance testing in ridge regression for genetic data. BMC Bioinf. 12(372), 1–15 (2011)

    Google Scholar 

  • Dalmasso, C., Carpentier, W., Meyer, L., Rouzioux, C., Goujard, C., Chaix, M.L., Lambotte, O., Avettand-Fenoel, V., Le Clerc, S., Denis de Senneville, L., Deveau, C., Boufassa, F., Debre, P., Delfraissy, J.F., Broet, P., Theodorou, I.: Distinct genetic loci control plasma HIV-RNA and cellular HIV-DNA levels in HIV-1 infection: the ANRS genome wide association 01 study. PLoS One 3(12), e3907 (2008)

    Article  Google Scholar 

  • Dudoit, S., Van der Laan, M.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)

    Book  MATH  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Grandvalet, Y.: Least absolute shrinkage is equivalent to quadratic penalization. In: Niklasson L, Bodén M, Ziemske T (eds) ICANN’98, Perspectives in Neural Computing, vol 1, Springer, New York, pp. 201–206 (1998)

  • Grandvalet, Y., Canu, S.: Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in Neural Information Processing Systems 11 (NIPS 1998), MIT Press, Cambridge, pp. 445–451 (1999)

  • Halawa, A.M., El Bassiouni, M.Y.: Tests of regressions coefficients under ridge regression models. J. Stat. Comput. Simul. 65(1), 341–356 (1999)

    MathSciNet  MATH  Google Scholar 

  • Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models, Monographs on Statistics and Applied Probability, vol. 43. Chapman & Hall, London (1990)

    MATH  Google Scholar 

  • Huang, J., Horowitz, J.L., Ma, S.: Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat. 36(2), 587–613 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Kyung, M., Gill, J., Ghosh, M., Casella, G.: Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 5(2), 369–411 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, H., Yu, B.: Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression. Electr. J. Stat. 7, 3124–3169 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Lockhart, R., Taylor, J., Tibshirani, R.J., Tibshirani, R.: A significance test for the lasso. Ann. Stat. 42(2), 413–468 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Meinshausen, N., Meier, L., Bühlmann, P.: \(p\)-values for high-dimensional regression. J. Am. Stat. Assoc. 104(488), 1671–1681 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.A., Grill, J., Frouin, V.: Variable selection for generalized canonical correlation analysis. Biostatistics 15(3), 569–583 (2014)

    Article  Google Scholar 

  • Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Verzelen, N.: Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electr. J. Stat. 6, 38–90 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, Y., Yang, J., Yin, W., Zhang, W.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Wasserman, L., Roeder, K.: High-dimensional variable selection. Ann. Stat. 37(5A), 2178–2201 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 601–608 (2001)

  • Zhang, C.H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was supported by the UTC foundation for Innovation, in the ToxOnChip program. It has been carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02) within the “Investments for the future” program, managed by the National Agency for Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yves Grandvalet.

Appendices

Variational equivalence

We show below that the quadratic penalty in \(\varvec{\beta }\) in Problem (2) acts as the Lasso penalty \(\lambda \left\| \varvec{\beta }\right\| _{1}\).

Proof

The Lagrangian of Problem (2) is:

$$\begin{aligned} L(\varvec{\beta }) = J(\varvec{\beta }) + \lambda \sum _{j=1}^p \frac{1}{\tau _{j}} \beta _{j}^{2} + \nu _0 \bigg ( \sum _{j=1}^p \tau _j -\left\| \varvec{\beta }\right\| _{1} \bigg ) - \sum _{j=1}^p \nu _j \tau _j. \end{aligned}$$

Thus, the first order optimality conditions for \(\tau _j\) are

$$\begin{aligned} \frac{\partial L}{\partial \tau _j}(\tau _j^\star ) = 0&\Leftrightarrow - \lambda \frac{\beta _{j}^2}{{\tau _j^\star }^2} + \nu _0 - \nu _j = 0 \\&\Leftrightarrow - \lambda \beta _{j}^2 + \nu _0 \,{\tau _j^\star }^2 - \nu _j \,{\tau _j^\star }^2 = 0 \\&\Rightarrow - \lambda \beta _{j}^2 + \nu _0 \,{\tau _j^\star }^2 = 0 , \end{aligned}$$

where the term in \(\nu _j\) vanishes due to complementary slackness, which implies here \(\nu _j \tau _j^\star = 0\). Together with the constraints of Problem (2), the last equation implies \(\tau _j^\star = \left| \beta _{j}\right| \), hence Problem (2) is equivalent to

$$\begin{aligned} \min _{\varvec{\beta }\in \mathbb {R}^p} J (\varvec{\beta }) + \lambda \left\| \varvec{\beta }\right\| _{1} , \end{aligned}$$

which is the original Lasso formulation. \(\square \)

Efficient implementation

Permutation tests rely on the simulation of numerous data sampled under the null hypothesis distribution. The number of replications must be important to estimate the rather extreme quantiles we are typically interested in. Here, we use \(B=1000\) replications for the \(q=|\mathscr {S}_{\hat{\lambda }}|\) variables selected in the screening stage. Each replication involving the fitting of a model, the total computational cost for solving these B systems of size q on the q selected variables is \(O(Bq(q^{3}+q^{2}n))\). However, block-wise decompositions and inversions can bring computing savings by a factor q.

First, we recall that the adaptive-ridge estimate, computed at the cleaning stage, is computed as

$$\begin{aligned} {\,\hat{\varvec{\beta }}} = \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \mathbf {X}^{\top }\mathbf {y}, \end{aligned}$$

where \(\varvec{{\varLambda }}\) is the diagonal adaptive-penalty matrix defined at the screening stage, whose jth diagonal entry is \({\lambda }/{\tau _{j}^\star }\), as defined in (13).

In the F-statistic (4), the permutation affects the calculation of the larger model \(\hat{\mathbf {y}}_{1}\), which is denoted \(\hat{\mathbf {y}}_{1}^{(b)}\) for the bth permutation. Using a similar notation convention for the design matrix and the estimated parameters, we have \(\hat{\mathbf {y}}_{1}^{(b)}={\mathbf {X}^{(b)}}\hat{\varvec{\beta }}^{(b)}\). When testing the relevance of variable j, \({\mathbf {X}^{(b)}}\) is defined as the concatenation of the permuted variable \(\mathbf {x}_{j}^{(b)}\) and the other original variables: \({\mathbf {X}^{(b)}}= (\mathbf {x}_{j}^{(b)}, \mathbf {X}_{\backslash j}) = (\mathbf {x}_{j}^{(b)}, \mathbf {x}_1, ..., \mathbf {x}_{j-1}, \mathbf {x}_{j+1}, ..., \mathbf {x}_{p})\). Then, \(\hat{\varvec{\beta }}^{(b)}\) can be efficiently computed by using \(a^{(b)}\in \mathbb {R}\), \(\mathbf {v}^{(b)}\in \mathbb {R}^{q-1}\) and \(\hat{\varvec{\beta }}_{\backslash j}\in \mathbb {R}^{q-1}\) defined as follows:

$$\begin{aligned} a^{(b)}&= (\Vert \mathbf {x}_{j}^{(b)}\Vert ^2_2 + {\varLambda }_{jj} ) - {\mathbf {x}_{j}^{(b)}}^{\top } \mathbf {X}_{\backslash j}{(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {x}_{j}^{(b)}\\ \mathbf {v}^{(b)}&= -{ (\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {x}_{j}^{(b)}\\ \hat{\varvec{\beta }}_{\backslash j}&= {(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} \mathbf {X}_{\backslash j}^\top \mathbf {y}. \end{aligned}$$

Indeed, using the Schur complement, one writes \(\hat{\varvec{\beta }}^{(b)}\) as follows:

$$\begin{aligned} \hat{\varvec{\beta }}^{(b)}= \frac{1}{a^{(b)}} \begin{pmatrix} 1 \\ \mathbf {v}^{(b)}\end{pmatrix} \begin{pmatrix} 1&{\mathbf {v}^{(b)}}^{\top } \end{pmatrix} \begin{pmatrix} {\mathbf {x}_{j}^{(b)}}^{\top } \mathbf {y}\\ \mathbf {X}_{\backslash j}^{\top } \mathbf {y}\end{pmatrix} + \begin{pmatrix} 0 \\ \hat{\varvec{\beta }}_{\backslash j}\end{pmatrix} . \end{aligned}$$

Hence, \(\hat{\varvec{\beta }}^{(b)}\) can be obtained as a correction of the vector of coefficients \(\hat{\varvec{\beta }}_{\backslash j}\) obtained under the smaller model. The key observation to be made here is that \(\mathbf {x}_{j}^{(b)}\) does not intervene in the expression \({(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}\), which is the bottleneck in the computation of \(a^{(b)}\), \(\mathbf {v}^{(b)}\) and \(\hat{\varvec{\beta }}_{\backslash j}\). It can therefore be performed once for all permutations. Additionally, \({(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}\) can be cheaply computed from \(\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}\) as follows:

$$\begin{aligned}&{(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1} = \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{\backslash j \backslash j} \\&\qquad -\left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{\backslash j j} \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{j j}^{-1} \\&\qquad \times \left[ \left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1} \right] _{j{\backslash j}} . \end{aligned}$$

Thus we compute \(\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}\) once, firstly correct for the removal of variable j, secondly correct for permutation b, thus eventually requiring \(O(B(q^{3}+q^{2}n)))\) operations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bécu, JM., Grandvalet, Y., Ambroise, C. et al. Beyond support in two-stage variable selection. Stat Comput 27, 169–179 (2017). https://doi.org/10.1007/s11222-015-9614-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9614-1

Keywords

Navigation