Abstract
Metabolomics is the science of characterizing and quantifying small molecule metabolites in biological systems. These metabolites give organisms their biochemical characteristics, providing a link between genotype, environment, and phenotype. With these opportunities also come data challenges, such as compound annotation, missing values, and batch effects. We present the steps of a general pipeline to process untargeted mass spectrometry data to alleviate the latter two challenges. We assume to have a matrix with metabolite abundances, with metabolites in rows and samples in columns. The steps in the pipeline include summarizing technical replicates (if available), filtering, imputing, transforming, and normalizing the data. In each of these steps, a method and parameters should be chosen based on assumptions one is willing to make, the question of interest, and diagnostic tools. Besides giving a general pipeline that can be adapted by the reader, our goal is to review diagnostic tools and criteria that are helpful when making decisions in each step of the pipeline and assessing the effectiveness of normalization and batch correction. We conclude by giving a list of useful packages and discuss some alternative approaches that might be more appropriate for the reader’s data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jordan KW, Nordenstam J, Lauwers GY, Rothenberger DA, Alavi K, Garwood M, Cheng LL (2009) Metabolomic characterization of human rectal adenocarcinoma with intact tissue magnetic resonance spectroscopy. Dis Colon Rectum 52(3):520–525. https://doi.org/10.1007/DCR.0b013e31819c9a2c. PubMed PMID: 00003453-200903000-00024
Spratlin JL, Serkova NJ, Eckhardt SG (2009) Clinical applications of metabolomics in oncology: a review. Clin Cancer Res 15(2):431
Griffin JL, Shockcor JP (2004) Metabolic profiles of cancer cells. Nat Rev Cancer 4:551. https://doi.org/10.1038/nrc1390
Mendes P, Kell DB, Westerhoff HV (1996) Why and when channelling can decrease pool size at constant net flux in a simple dynamic channel. Biochim Biophys Acta 1289(2):175–186. https://doi.org/10.1016/0304-4165(95)00152-2
Mendes P, Kell DB, Westerhoff HV (2005) Channelling can decrease pool size. Eur J Biochem 204(1):257–266. https://doi.org/10.1111/j.1432-1033.1992.tb16632.x
Boros LG, Lerner MR, Morgan DL, Taylor SL, Smith BJ, Postier RG, Brackett DJ (2005) [1,2-13C2]-D-glucose profiles of the serum, liver, pancreas, and DMBA-induced pancreatic tumors of rats. Pancreas 31:4
El-Deredy W, Ashmore SM, Branston NM, Darling JL, Williams SR, Thomas DGT (1997) Pretreatment prediction of the chemotherapeutic response of human glioma cell cultures using nuclear magnetic resonance spectroscopy and artificial neural networks. Cancer Res 57(19):4196
Griffin JL, Pole JCM, Nicholson JK, Carmichael PL (2003) Cellular environment of metabolites and a metabonomic study of tamoxifen in endometrial cells using gradient high resolution magic angle spinning 1H NMR spectroscopy. Biochim Biophys Acta 1619(2):151–158. https://doi.org/10.1016/S0304-4165(02)00475-0
Bahr TM, Hughes GJ, Armstrong M, Reisdorph R, Coldren CD, Edwards MG, Schnell C, Kedl R, LaFlamme DJ, Reisdorph N, Kechris KJ, Bowler RP (2013) Peripheral blood mononuclear cell gene expression in chronic obstructive pulmonary disease. Am J Respir Cell Mol Biol 49(2):316–323. https://doi.org/10.1165/rcmb.2012-0230OC. PubMed PMID: 23590301; PMCID: PMC3824029
Bowler RP, Jacobson S, Cruickshank C, Hughes GJ, Siska C, Ory DS, Petrache I, Schaffer JE, Reisdorph N, Kechris K (2015) Plasma sphingolipids associated with chronic obstructive pulmonary disease phenotypes. Am J Respir Crit Care Med 191(3):275–284. https://doi.org/10.1164/rccm.201410-1771OC. PubMed PMID: 25494452; PMCID: PMC4351578
Roberts LD, Souza AL, Gerszten RE, Clish CB (2012) Targeted metabolomics. Curr Protoc Mol Biol Chapter 30:Unit30.2. https://doi.org/10.1002/0471142727.mb3002s98
Gowda GAN, Raftery D (2017) Recent advances in nmr-based metabolomics. Anal Chem 89(1):490–510. https://doi.org/10.1021/acs.analchem.6b04420
Markley JL, Brüschweiler R, Edison AS, Eghbalnia HR, Powers R, Raftery D, Wishart DS (2017) The future of NMR-based metabolomics. Curr Opin Biotechnol 43:34–40. https://doi.org/10.1016/j.copbio.2016.08.001
Gowda GAN, Djukovic D (2014) Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol Biol 1198:3–12. https://doi.org/10.1007/978-1-4939-1258-2_1
Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127. https://doi.org/10.1093/biostatistics/kxj037
Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3(9):e161. https://doi.org/10.1371/journal.pgen.0030161
Leek JT, Storey JD (2008) A general framework for multiple testing dependence. Proc Natl Acad Sci 105(48):18718
Fernández-Albert F, Llorach R, Garcia-Aloy M, Ziyatdinov A, Andres-Lacueva C, Perera A (2014) Intensity drift removal in LC/MS metabolomics by common variance compensation. Bioinformatics 30(20):2899–2905. https://doi.org/10.1093/bioinformatics/btu423
Redestig H, Fukushima A, Stenlund H, Moritz T, Arita M, Saito K, Kusano M (2009) Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data. Anal Chem 81(19):7974–7980. https://doi.org/10.1021/ac901143w
Reisetter AC, Muehlbauer MJ, Bain JR, Nodzenski M, Stevens RD, Ilkayeva O, Metzger BE, Newgard CB, Lowe WL Jr, Scholtens DM (2017) Mixture model normalization for non-targeted gas chromatography/mass spectrometry metabolomics data. BMC Bioinformatics 18(1):84. https://doi.org/10.1186/s12859-017-1501-7. PubMed PMID: 28153035; PMCID: PMC5290663
Nodzenski M, Muehlbauer MJ, Bain JR, Reisetter AC, Lowe WL, Scholtens DM (2014) Metabomxtr: an R package for mixture-model analysis of non-targeted metabolomics data. Bioinformatics 30(22):3287–3288. https://doi.org/10.1093/bioinformatics/btu509
Snyder LR, Kirkland JJ, Dolan JW (2010) Introduction to modern liquid chromatography, 3rd edn. Wiley, Hoboken, NJ
Åberg KM, Alm E, Torgrip RJO (2009) The correspondence problem for metabonomics datasets. Anal Bioanal Chem 394(1):151–162. https://doi.org/10.1007/s00216-009-2628-9
Zhou B, Xiao JF, Tuli L, Ressom HW (2012) LC-MS-based metabolomics. Mol BioSyst 8(2):470–481. https://doi.org/10.1039/c1mb05350g
Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, Crapo JD (2010) Genetic epidemiology of COPD (COPDGene) study design. COPD 7(1):32–43. https://doi.org/10.3109/15412550903499522. PubMed PMID: 20214461; PMCID: PMC2924193
Petrache I, Petrusca DN, Bowler RP, Kamocki K (2011) Involvement of ceramide in cell death responses in the pulmonary circulation. Proc Am Thorac Soc 8(6):492–496. https://doi.org/10.1513/pats.201104-034MW. PubMed PMID: 22052925; PMCID: PMC3359077
Ahmed FS, Jiang XC, Schwartz JE, Hoffman EA, Yeboah J, Shea S, Burkart KM, Barr RG (2014) Plasma sphingomyelin and longitudinal change in percent emphysema on CT. The MESA lung study. Biomarkers 19(3):207–213. https://doi.org/10.3109/1354750X.2014.896414. PubMed PMID: 24649875; PMCID: PMC4088962
Hughes G, Cruickshank-Quinn C, Reisdorph R, Lutz S, Petrache I, Reisdorph N, Bowler R, Kechris K (2014) MSPrep—summarization, normalization and diagnostics for processing of mass spectrometry-based metabolomic data. Bioinformatics 30(1):133–134. https://doi.org/10.1093/bioinformatics/btt589
Ejigu BA, Valkenborg D, Baggerman G, Vanaerschot M, Witters E, Dujardin JC, Burzykowski T, Berg M (2013) Evaluation of normalization methods to pave the way towards large-scale LC-MS-based metabolomics profiling experiments. OMICS 17(9):473–485. https://doi.org/10.1089/omi.2013.0010. PubMed PMID: 23808607; PMCID: PMC3760460
Han TL, Yang Y, Zhang H, Law KP (2017) Analytical challenges of untargeted GC-MS-based metabolomics and the critical issues in selecting the data processing strategy. F1000Res 6:967. https://doi.org/10.12688/f1000research.11823.1. PubMed PMID: 28868138; PMCID: PMC5553085
Chen J, Zhang P, Lv M, Guo H, Huang Y, Zhang Z, Xu F (2017) Influences of normalization method on biomarker discovery in gas chromatography-mass spectrometry-based untargeted metabolomics: what should be considered? Anal Chem 89(10):5342–5348. https://doi.org/10.1021/acs.analchem.6b05152
Di Guida R, Engel J, Allwood JW, Weber RJ, Jones MR, Sommer U, Viant MR, Dunn WB (2016) Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12:93. https://doi.org/10.1007/s11306-016-1030-9. PubMed PMID: 27123000; PMCID: PMC4831991
Hrydziuszko O, Viant MR (2012) Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics 8(1):161–174. https://doi.org/10.1007/s11306-011-0366-4
Wei R, Wang J, Su M, Jia E, Chen S, Chen T, Ni Y (2018) Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep 8(1):663. https://doi.org/10.1038/s41598-017-19120-0
Han J, Danell RM, Patel JR, Gumerov DR, Scarlett CO, Speir JP, Parker CE, Rusyn I, Zeisel S, Borchers CH (2008) Towards high-throughput metabolomics using ultrahigh-field Fourier transform ion cyclotron resonance mass spectrometry. Metabolomics 4(2):128–140. https://doi.org/10.1007/s11306-008-0104-8
Payne TG, Southam AD, Arvanitis TN, Viant MR (2009) A signal filtering method for improved quantification and noise discrimination in Fourier transform ion cyclotron resonance mass spectrometry-based metabolomics data. J Am Soc Mass Spectrom 20(6):1087–1095. https://doi.org/10.1016/j.jasms.2009.02.001
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:b2393. https://doi.org/10.1136/bmj.b2393
Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Bijlsma S, Bobeldijk I, Verheij ER, Ramaker R, Kochhar S, Macdonald IA, van Ommen B, Smilde AK (2006) Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal Chem 78(2):567–574. https://doi.org/10.1021/ac051495j
Kowarik A, Templ M (2016) Imputation with the R package VIM. J Stat Software 74(7):16. https://doi.org/10.18637/jss.v074.i07
Oba S, Sato M-a, Takemasa I, Monden M, Matsubara K-i, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096. https://doi.org/10.1093/bioinformatics/btg287
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7:142. https://doi.org/10.1186/1471-2164-7-142
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6):882–883. https://doi.org/10.1093/bioinformatics/bts034
Gagnon-Bartsch JA, Speed TP (2012) Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3):539–552. https://doi.org/10.1093/biostatistics/kxr034. PubMed PMID: 22101192; PMCID: PMC3577104
Gandolfo LC, Speed TP (2018) RLE plots: visualising unwanted variation in high dimensional data. PLoS One 13(2):e0191629
Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, Speed TP (2005) Quality assessment of affymetrix GeneChip data. In: Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S (eds) Bioinformatics and computational biology solutions using R and bioconductor. Springer, New York, NY, pp 33–47
Brettschneider J, Collin F, Bolstad BM, Speed TP (2008) Quality assessment for short oligonucleotide microarray data. Technometrics 50(3):241–264. https://doi.org/10.1198/004017008000000334
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459. https://doi.org/10.1002/wics.101
Dunn WB, Overy S, Quick WP (2005) Evaluation of automated electrospray-TOF mass spectrometryfor metabolic fingerprinting of the plant metabolome. Metabolomics 1(2):137–148. https://doi.org/10.1007/s11306-005-4433-6
Overy SA, Walker HJ, Malone S, Howard TP, Baxter CJ, Sweetlove LJ, Hill SA, Quick WP (2005) Application of metabolite profiling to the identification of traits in a population of tomato introgression lines. J Exp Bot 56(410):287–296. https://doi.org/10.1093/jxb/eri070
Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78(3):779–787. https://doi.org/10.1021/ac051437y
Xia J, Wishart David S (2016) Using MetaboAnalyst 3.0 for comprehensive metabolomics data analysis. Curr Protoc Bioinformatics 55(1):14.0.1–14.0.91. https://doi.org/10.1002/cpbi.11
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017) Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics. 18:114. https://doi.org/10.1186/s12859-017-1547-6
Box GEP, Cox DR (1964) An Analysis of Transformations. J R Stat Soc Series B 26(2):211–252
Bojko A (ed) (2009) Informative or misleading? Heatmaps deconstructed. Human-computer interaction new trends. Springer, Berlin
Risso D, Schwartz K, Sherlock G, Dudoit S (2011) GC-content normalization for RNA-Seq data. BMC Bioinformatics 12(1):480. https://doi.org/10.1186/1471-2105-12-480
Kassambara A, Mundt F (2017) Factoextra: extract and visualize the results of multivariate data analyses. https://cran.r-project.org/web/packages/factoextra/index.html
Stacklies W, Redestig H, Scholz M, Walther D, Selbig J (2007) pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23(9):1164–1167. https://doi.org/10.1093/bioinformatics/btm069
Bishop CM (ed) (1999) Variational principal components. 1999 ninth international conference on artificial neural networks ICANN 99 (Conf Publ No 470)
Karpievitch YV, Nikolic SB, Wilson R, Sharman JE, Edwards LM (2015) Metabolomics data normalization with EigenMS. PLoS One 9(12):e116221. https://doi.org/10.1371/journal.pone.0116221
Karpievitch YV, Taverner T, Adkins JN, Callister SJ, Anderson GA, Smith RD, Dabney AR (2009) Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 25(19):2573–2580. https://doi.org/10.1093/bioinformatics/btp426
Calderón-Santiago M, López-Bascón MA, Peralbo-Molina Á, Priego-Capote F (2017) MetaboQC: a tool for correcting untargeted metabolomics data with mass spectrometry detection using quality controls. Talanta 174:29–37. https://doi.org/10.1016/j.talanta.2017.05.076
De Livera AM, Dias DA, De Souza D, Rupasinghe T, Pyke J, Tull D, Roessner U, McConville M, Speed TP (2012) Normalizing and integrating metabolomics data. Anal Chem 84(24):10768–10776. https://doi.org/10.1021/ac302748b
De Livera AM, Sysi-Aho M, Jacob L, Gagnon-Bartsch JA, Castillo S, Simpson JA, Speed TP (2015) Statistical methods for handling unwanted variation in metabolomics data. Anal Chem 87(7):3606–3615. https://doi.org/10.1021/ac502439y
Shen X, Gong X, Cai Y, Guo Y, Tu J, Li H, Zhang T, Wang J, Xue F, Zhu Z-J (2016) Normalization and integration of large-scale metabolomics data using support vector regression. Metabolomics 12(5):89. https://doi.org/10.1007/s11306-016-1026-5
Acknowledgments
D.R., S.J., and K.K. were supported by NIH/NHLBI Grant Number P20 HL113445. D.R., D.G., and K.K. were supported by NIH/NCATS Colorado CTSA Grant Number UL1 TR002535. H.P.-L. was supported by training grant T15 LM009451. Contents are the authors’ sole responsibility and do not necessarily represent official NIH views.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Reinhold, D., Pielke-Lombardo, H., Jacobson, S., Ghosh, D., Kechris, K. (2019). Pre-analytic Considerations for Mass Spectrometry-Based Untargeted Metabolomics Data. In: D'Alessandro, A. (eds) High-Throughput Metabolomics. Methods in Molecular Biology, vol 1978. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9236-2_20
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9236-2_20
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9235-5
Online ISBN: 978-1-4939-9236-2
eBook Packages: Springer Protocols