Skip to main content

Abstract

Enrichment of predictive models with new biomolecular markers is an important task in high-dimensional omic applications. Increasingly, clinical studies include several sets of such omics markers available for each patient, measuring different levels of biological variation. As a result, one of the main challenges in predictive research is the integration of different sources of omic biomarkers for the prediction of health traits. We review several approaches for the combination of omic markers in the context of binary outcome prediction, all based on double cross-validation and regularized regression models. We evaluate their performance in terms of calibration and discrimination and we compare their performance with respect to single-omic source predictions. We illustrate the methods through the analysis of two real datasets. On the one hand, we consider the combination of two fractions of proteomic mass spectrometry for the calibration of a diagnostic rule for the detection of early stage breast cancer. On the other hand, we consider transcriptomics and metabolomics as predictors of obesity using data from the Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome (DILGOM) study, a population-based cohort, from Finland.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The authors “Mar Rodríguez-Girondo” and “Alexia Kakourou” contributed equally to this work.

References

  1. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  MATH  Google Scholar 

  2. Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22, 477–505.

    Article  MathSciNet  MATH  Google Scholar 

  3. Cox, D. R. (1958). Two further applications of a model for binary regression. Biometrika, 45, 562–565.

    Article  MATH  Google Scholar 

  4. de Noo, M. E., Deelder, A. M., Mertens, B. J. A., Ozalp, A., Bladergroen, M. R., van der Werff, M. P. J., & Tollenaar, R. A. E. M. (2005). Detection of colorectal cancer using MALDI-TOF serum protein profiling. European Journal of Cancer, 42, 1068–1076.

    Article  Google Scholar 

  5. Hand, D. J. (1997). Construction and assessment of classification rules. Chichester: Wiley.

    MATH  Google Scholar 

  6. Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21, 1–18.

    Article  MathSciNet  MATH  Google Scholar 

  7. Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of statistical learning: Data mining, inference, and prediction. Springer series in statistic. New York: Springer

    Google Scholar 

  8. Hoerl, A. E., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.

    Article  MATH  Google Scholar 

  9. Inouye, M., Kettunen, J., Soininen, P., Silander, K., Ripatti, S., et al. (2010). Metabonomic, transcriptomic, and genomic variation of a population cohort. Molecular Systems Biology, 6, 441.

    Google Scholar 

  10. Inouye, M., Silander, K., Hamalainen, E., Salomaa, V., Harald, K., Jousilahti, P., et al. (2010). An immune response network associated with blood lipid levels. Plos Genetics, 6, e1001113. doi:10.1371/journal.pgen.1001113.

  11. Jonathan, P., Krzanowski, W. J., & McCarthy, M. V. (2000). On the use of cross-validation to assess performance in multivariate prediction. Statistics and Computing, 10, 209–229.

    Article  Google Scholar 

  12. Kakourou, A., Vach, W., & Mertens B. (2014). Combination approaches improve predictive performance of diagnostic rules for mass-spectrometry proteomic data. Journal of Computational Biology, 21, 898–914.

    Article  Google Scholar 

  13. Kneib, T., Hothorn, T., & Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics, 65, 626–634.

    Article  MathSciNet  MATH  Google Scholar 

  14. Leblanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91, 1641–1650.

    MathSciNet  MATH  Google Scholar 

  15. Le Cessie, S., & van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied Statistics, 41, 191–201.

    Article  MATH  Google Scholar 

  16. Liu, H., DÁndrade, P., Fulmer-Smentek, S., Lorenzi, P., Kohn, K. W., Weinstein, J. N., Pommier, Y., & Reinhold, W. C. (2010). mRNA and microRNA expression profiles of the NCI-60 integrated with drug activities. Molecular Cancer Therapeutics, 9, 1080–1091.

    Article  Google Scholar 

  17. Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society B, 70, 53–71.

    Article  MathSciNet  MATH  Google Scholar 

  18. Mertens, B. J. A. (2003). Microarrays, pattern recognition and exploratory data analysis. Statistics in Medicine, 22, 1879–1899

    Article  Google Scholar 

  19. Mertens, B. J. A., de Noo, M. E., Tollenaar, R. A. E. M., & Deelder, A. M. (2006). Mass spectrometry proteomic diagnosis: Enacting the double cross validatory paradigm. Journal of Computational Biology, 13, 1591–1605.

    Article  MathSciNet  Google Scholar 

  20. Mertens, B. J. A., van der Burgt, Y. E. M., Velstra, B., Mesker, W. E., Tollenaar, R. A. E. M., & Deelder, A. M. (2011). On the use of double cross-validation for the combination of proteomic mass spectral data for enhanced diagnosis and prediction. Statistics and Probability Letters, 81, 759–766.

    Article  MathSciNet  MATH  Google Scholar 

  21. Pencina, M. J., D’Agostino, R. B. Sr., D’Agostino, R. B. Jr., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine, 27, 157–172.

    Article  MathSciNet  Google Scholar 

  22. Pepe, M. S., Kerr, K. F., Longton, G., & Wang, Z. (2013). Testing for improvement in prediction model performance. Statistics in Medicine, 32, 1467–1482.

    Article  MathSciNet  Google Scholar 

  23. Rodríguez-Girondo, M., Salo, P., Burzykowski, T., Perola, M., Houwing-Duistermaat, J. J., & Mertens, B. (2016) Sequential double cross-validation for augmented prediction assessment in high-dimensional omic applications. Working Paper in ArXiv. arXiv:1601.08197v1.

    Google Scholar 

  24. Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology, 21, 128–138.

    Article  Google Scholar 

  25. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society. Series B, 36, 111–147.

    MathSciNet  MATH  Google Scholar 

  26. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  27. Tutz, G., & Binder, H. (2006). Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62, 961–971.

    Article  MathSciNet  MATH  Google Scholar 

  28. van de Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N., & Wilting, S. M. (2015). Better prediction by use of co-data: Adaptive group-regularized ridge regression Statistics in Medicine, 35, 368–381.

    Google Scholar 

  29. van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. U.C. Berkeley Division of Biostatistics Working Paper Series, Working Paper 222.

    Google Scholar 

  30. Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26, 565–574.

    Article  Google Scholar 

  31. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, 67, 301–320.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The work is supported by funding from the European Community’s Seventh Framework Programme FP7/2011: Marie Curie Initial Training Network MEDIASRES with the Grant Agreement Number 290025 and by funding from the European Union’s Seventh Framework Programme FP7/Health/F5/2012: MIMOmics under the Grant Agreement Number 305280. We thank Sigrid Jusélius Foundation and Yrjö Jahnsson Foundation for providing the DILGOM expression data and Yrjö Jahnsson Foundation for providing DILGOM NMR metabolomics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mar Rodríguez-Girondo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Rodríguez-Girondo, M. et al. (2017). On the Combination of Omics Data for Prediction of Binary Outcomes. In: Datta, S., Mertens, B. (eds) Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-45809-0_14

Download citation

Publish with us

Policies and ethics