Guidelines for Bioinformatics and the Statistical Analysis of Omic Data

Bhattacharya, Surajit; Gordish-Dressman, Heather

doi:10.1007/978-1-4939-9802-9_4

Surajit Bhattacharya³ &
Heather Gordish-Dressman^4,5

Part of the book series: Methods in Physiology ((METHPHYS))

652 Accesses
1 Altmetric

Abstract

This chapter is a resource for those designing omics experiments and those analyzing the data from such experiments. It is organized into two parts, one with a focus on bioinformatics tools and techniques, and the other with a focus on statistical analyses. It is intended to be a high-level instructional chapter for those who are interested in performing their own analyses, not a comprehensive discussion of either area. The first section discusses the bioinformatics tools and algorithms used in genomics and transcriptomics. It describes typical workflows and the tools available for performing an omic experiment and underscores the importance of both the tools being used and a clear understanding of the underlying algorithm. The second section describes general study design principles that should be taken into account before an experiment is begun. It describes some basic principles of statistical analysis and commonly used methods. It is not a comprehensive discussion of statistical theory nor does it describe more complex statistical models. The guidance of a statistician is advised for complex study designs, hypotheses, or statistical models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hood, L., & Galas, D. (2003). The digital code of DNA. Nature, 421(6921), 444–448.
Article PubMed CAS Google Scholar
Dahm, R. (2008). Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Human Genetics, 122(6), 565–581.
Article CAS PubMed Google Scholar
Levy, S. E., & Myers, R. M. (2016). Advancements in next-generation sequencing. Annual Review of Genomics and Human Genetics, 17(1), 95–115.
Article CAS PubMed Google Scholar
Reis-Filho, J. S. (2009). Next-generation sequencing. Breast Cancer Research, 11(S3), S12.
Article PubMed CAS PubMed Central Google Scholar
Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 38(6), 1767–1771.
Article CAS PubMed Google Scholar
Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research, 8(3), 175–185.
Article CAS PubMed Google Scholar
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 8(3), 186–194.
Article CAS PubMed Google Scholar
Andrews, S. (2010). FastQC a quality control tool for high throughput sequence data. Retrieved November 25, 2018 from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal, 17(1), 10.
Article Google Scholar
Joshi, N. A., & Fass, J. N. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files.
Google Scholar
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Article CAS PubMed PubMed Central Google Scholar
Adjeroh, D., Bell, T., & Mukherjee, A. (2008). The Burrows-Wheeler transform: Data compression, suffix arrays, and pattern matching. New York: Springer.
Book Google Scholar
Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., & Yiu, S. M. (2008). Compressed indexing and local alignment of DNA. Bioinformatics, 24(6), 791–797.
Article CAS PubMed Google Scholar
Li, H., et al. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079.
Article PubMed PubMed Central CAS Google Scholar
McKenna, A., et al. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.
Article CAS PubMed PubMed Central Google Scholar
Garrison, E., & Marth, G. (2016). Haplotype-based variant detection from short-read sequencing.
Google Scholar
Kobayashi, M., et al. (2017). Heap: A highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data. DNA Research, 24(4), 397–405.
Article CAS PubMed PubMed Central Google Scholar
Tattini, L., D’Aurizio, R., & Magi, A. (2015). Detection of genomic structural variants from next-generation sequencing data. Frontiers in Bioengineering and Biotechnology, 3, 92.
Article PubMed PubMed Central Google Scholar
Chen, K., et al. (2009). BreakDancer: An algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9), 677–681.
Article CAS PubMed PubMed Central Google Scholar
Korbel, J. O., et al. (2009). PEMer: A computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology, 10(2), R23.
Article PubMed PubMed Central CAS Google Scholar
Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: Detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods, 6(7), 473–474.
Article CAS PubMed Google Scholar
Magi, A., Tattini, L., Pippucci, T., Torricelli, F., & Benelli, M. (2012). Read count approach for DNA copy number variants detection. Bioinformatics, 28(4), 470–478.
Article CAS PubMed Google Scholar
Magi, A., et al. (2013). EXCAVATOR: Detecting copy number variants from whole-exome sequencing data. Genome Biology, 14(10), R120.
Article PubMed PubMed Central Google Scholar
Abyzov, A., Urban, A. E., Snyder, M., & Gerstein, M. (2011). CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research, 21(6), 974–984.
Article CAS PubMed PubMed Central Google Scholar
Schröder, J., et al. (2014). Socrates: Identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics, 30(8), 1064–1072.
Article PubMed PubMed Central CAS Google Scholar
Karakoc, E., et al. (2012). Detection of structural variants and indels within exome data. Nature Methods, 9(2), 176–178.
Article CAS Google Scholar
Earl, D., et al. (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research, 21(12), 2224–2241.
Article CAS PubMed PubMed Central Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., & McVean, G. (2012). De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2), 226–232.
Article CAS PubMed PubMed Central Google Scholar
Nijkamp, J. F., van den Broek, M. A., Geertman, J.-M. A., Reinders, M. J. T., Daran, J.-M. G., & de Ridder, D. (2012). De novo detection of copy number variation by co-assembly. Bioinformatics, 28(24), 3195–3202.
Article CAS PubMed Google Scholar
Rausch, T., Zichner, T., Schlattl, A., Stutz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18), i333–i339.
Article CAS PubMed PubMed Central Google Scholar
Layer, R. M., Chiang, C., Quinlan, A. R., & Hall, I. M. (2014). LUMPY: a probabilistic framework for structural variant discovery. Genome Biology, 15(6), R84.
Article PubMed PubMed Central Google Scholar
Wong, K., Keane, T. M., Stalker, J., & Adams, D. J. (2010). Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biology, 11(12), R128.
Article PubMed PubMed Central Google Scholar
Jeffares, D. C., et al. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature Communications, 8, 14061.
Article CAS PubMed PubMed Central Google Scholar
English, A. C., et al. (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome. BMC Genomics, 16(1), 286.
Article PubMed PubMed Central CAS Google Scholar
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), e164.
Article PubMed PubMed Central CAS Google Scholar
Sherry, S. T., et al. (2001). dbSNP: The NCBI database of genetic variation. Nucleic Acids Research, 29(1), 308–311.
Article CAS PubMed PubMed Central Google Scholar
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L., & Scherer, S. W. (2014). The database of genomic variants: A curated collection of structural variation in the human genome. Nucleic Acids Research, 42(Database issue), D986–D992.
Article CAS PubMed Google Scholar
Landrum, M. J., et al. (2018). ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Research, 46(D1), D1062–D1067.
Article CAS PubMed Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J., & Kircher, M. (2018). CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 47(D1), D886–D894.
Article PubMed Central CAS Google Scholar
Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 46(3), 310–315.
Article CAS PubMed PubMed Central Google Scholar
Cingolani, P., et al. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly, 6(2), 80–92.
Article CAS PubMed PubMed Central Google Scholar
Cingolani, P., et al. (2012). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in Genetics, 3, 35.
Article PubMed PubMed Central Google Scholar
Geoffroy, V., et al. (2018). AnnotSV: An integrated tool for structural variations annotation. Bioinformatics, 34(20), 3572–3574.
Article CAS PubMed Google Scholar
Freeman, W. M., Walker, S. J., & Vrana, K. E. (1999). Quantitative RT-PCR: Pitfalls and potential. BioTechniques, 26(1), 112–125.
Article CAS PubMed Google Scholar
Bumgarner, R. (2013). Overview of DNA microarrays: Types, applications, and their future. Current Protocols in Molecular Biology, 101(1), 22–21.
Google Scholar
Solomon, M. J., Larsen, P. L., & Varshavsky, A. (1988). Mapping protein-DNA interactions in vivo with formaldehyde: Evidence that histone H4 is retained on a highly transcribed gene. Cell, 53(6), 937–947.
Article CAS PubMed Google Scholar
Van Gelder, R. N., von Zastrow, M. E., Yool, A., Dement, W. C., Barchas, J. D., & Eberwine, J. H. (1990). Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proceedings of the National Academy of Sciences of the United States of America, 87(5), 1663–1667.
Article PubMed PubMed Central Google Scholar
Shalon, D., Smith, S. J., & Brown, P. O. (1996). A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research, 6(7), 639–645.
Article CAS PubMed Google Scholar
Ritchie, M. E., et al. (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47.
Article PubMed PubMed Central CAS Google Scholar
Gautier, L., Cope, L., Bolstad, B. M., & Irizarry, R. A. (2004). Affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3), 307–315.
Article CAS PubMed Google Scholar
Dunning, M. J., Smith, M. L., Ritchie, M. E., & Tavare, S. (2007). Beadarray: R classes and methods for Illumina bead-based data. Bioinformatics, 23(16), 2183–2184.
Article CAS PubMed Google Scholar
Bolstad, B. M., Irizarry, R., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185–193.
Article CAS PubMed Google Scholar
Carvalho, B. S., & Irizarry, R. A. (2010). A framework for oligonucleotide microarray preprocessing. Bioinformatics, 26(19), 2363–2367.
Article CAS PubMed PubMed Central Google Scholar
Warnes, G. R., Bolker, B., Bonebakker, L., Gentleman, R., Huber, W., & Liaw, A. (2009). gplots: Various R programming tools for plotting data. R Packag. version 2.
Google Scholar
Student. (1908). The probable error of a mean. Biometrika. Retreived May 07, 2016, from http://seismo.berkeley.edu/~kirchner/eps_120/Odds_n_ends/Students_original_paper.pdf.
Fisher, R. A. (1919). XV.—The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh, 52(02), 399–433.
Article Google Scholar
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1), 289–300.
Google Scholar
Bonferroni, C. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8, 3–62.
Google Scholar
Schadt, E. E., Turner, S., & Kasarskis, A. (2010). A window into third-generation sequencing. Human Molecular Genetics, 19(R2), R227–R240.
Article CAS PubMed Google Scholar
Mikheyev, A. S., & Tin, M. M. Y. (2014). A first look at the Oxford Nanopore MinION sequencer. Molecular Ecology Resources, 14(6), 1097–1102.
Article CAS PubMed Google Scholar
Eisenstein, M. (2012). Oxford Nanopore announcement sets sequencing sector abuzz. Nature Biotechnology, 30(4), 295–296.
Article CAS PubMed Google Scholar
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013). TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4), R36.
Article PubMed PubMed Central CAS Google Scholar
Trapnell, C., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562–578.
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5), 511–515.
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359.
Article CAS PubMed PubMed Central Google Scholar
Ferragina, P., & Manzini, G. (2001). An experimental study of a compressed index. Information Sciences, 135(1–2), 13–28.
Article Google Scholar
Dobin, A., et al. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21.
Article CAS PubMed Google Scholar
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360.
Article CAS PubMed PubMed Central Google Scholar
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667.
Article CAS PubMed PubMed Central Google Scholar
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., & Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33(3), 290–295.
Article CAS PubMed PubMed Central Google Scholar
Frazee, A. C., Pertea, G., Jaffe, A. E., Langmead, B., Salzberg, S. L., & Leek, J. T. (2015). Ballgown bridges the gap between transcriptome assembly and expression analysis. Nature Biotechnology, 33(3), 243–246.
Article CAS PubMed PubMed Central Google Scholar
Wang, L., Wang, S., & Li, W. (2012). RSeQC: Quality control of RNA-seq experiments. Bioinformatics, 28(16), 2184–2185.
Article CAS PubMed Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7), 621–628.
Article CAS PubMed Google Scholar
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., & Dewey, C. N. (2010). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4), 493–500.
Article PubMed CAS Google Scholar
Li, B., & Dewey, C. N. (2011). RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12(1), 323.
Article CAS PubMed PubMed Central Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1976). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.
Google Scholar
Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166–169.
Article CAS PubMed Google Scholar
Liao, Y., Smyth, G. K., & Shi, W. (2014). featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923–930.
Article CAS PubMed Google Scholar
Lawrence, M., et al. (2013). Software for computing and annotating genomic ranges. PLoS Computational Biology, 9(8), e1003118.
Article CAS PubMed PubMed Central Google Scholar
Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: Transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.
Article PubMed CAS Google Scholar
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Article CAS PubMed Google Scholar
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
Article PubMed PubMed Central CAS Google Scholar
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135(3), 370.
Article Google Scholar
Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2), 117–186.
Article Google Scholar
Feng, J., Meyer, C. A., Wang, Q., Liu, J. S., Shirley Liu, X., & Zhang, Y. (2012). GFOLD: A generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics, 28(21), 2782–2788.
Article CAS PubMed Google Scholar
Tarazona, S., et al. (2015). Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Research, 43(21), e140.
PubMed PubMed Central Google Scholar
Toedling, J., & Huber, W. (2008). Analyzing ChIP-chip data using bioconductor. PLoS Computational Biology, 4(11), e1000227.
Article PubMed PubMed Central CAS Google Scholar
Toedling, J., Sklyar, O., & Huber, W. (2007). Ringo – an R/bioconductor package for analyzing ChIP-chip readouts. BMC Bioinformatics, 8(1), 221.
Article PubMed PubMed Central CAS Google Scholar
Durinck, S., et al. (2005). BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16), 3439–3440.
Article CAS PubMed Google Scholar
Alexa, A., Rahnenfuhrer, J., & Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22(13), 1600–1607.
Article CAS PubMed Google Scholar
Zhang, Y., et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 9(9), R137.
Article PubMed PubMed Central CAS Google Scholar
Xu, S., Grullon, S., Ge, K., & Peng, W. (2014). Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods in Molecular Biology, 1150, 97.
Article CAS PubMed Google Scholar
Hayatsu, H. (2008). Discovery of bisulfite-mediated cytosine conversion to uracil, the key reaction for DNA methylation analysis – a personal account. Proceedings of the Japan Academy. Series B, Physical and Biological Sciences, 84(8), 321–330.
Article CAS PubMed PubMed Central Google Scholar
Morris, T. J., et al. (2014). ChAMP: 450k chip analysis methylation pipeline. Bioinformatics, 30(3), 428–430.
Article CAS PubMed Google Scholar
Tian, Y., et al. (2017). ChAMP: Updated methylation analysis pipeline for illumina BeadChips. Bioinformatics, 33(24), 3982–3984.
Article CAS PubMed PubMed Central Google Scholar
Aryee, M. J., et al. (2014). Minfi: A flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics, 30(10), 1363–1369.
Article CAS PubMed PubMed Central Google Scholar
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., & Storey, J. D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882–883.
Article CAS PubMed PubMed Central Google Scholar
Carson Sievert, P. T. I., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P. (2018). Create interactive web graphics via ‘plotly.js’ [R package plotly version 4.8.0]. Comprehensive R Archive Network (CRAN).
Google Scholar
Krueger, F., & Andrews, S. R. (2011). Bismark: A flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics, 27(11), 1571–1572.
Article CAS PubMed PubMed Central Google Scholar
Chen, P.-Y., Cokus, S. J., & Pellegrini, M. (2010). BS Seeker: Precise mapping for bisulfite sequencing. BMC Bioinformatics, 11(1), 203.
Article CAS PubMed PubMed Central Google Scholar
Kreck, B., Marnellos, G., Richter, J., Krueger, F., Siebert, R., & Franke, A. (2012). B-SOLANA: An approach for the analysis of two-base encoding bisulfite sequencing data. Bioinformatics, 28(3), 428–429.
Article CAS PubMed Google Scholar
Frith, M. C., Mori, R., & Asai, K. (2012). A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Research, 40(13), e100.
Article CAS PubMed PubMed Central Google Scholar
Saito, Y., Tsuji, J., & Mituyama, T. (2014). Bisulfighter: Accurate detection of methylated cytosines and differentially methylated regions. Nucleic Acids Research, 42(6), e45.
Article CAS PubMed PubMed Central Google Scholar
Xi, Y., & Li, W. (2009). BSMAP: Whole genome bisulfite sequence MAPping program. BMC Bioinformatics, 10(1), 232.
Article PubMed PubMed Central CAS Google Scholar
Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., & Bock, C. (2014). Comprehensive analysis of DNA methylation data with RnBeads. Nature Methods, 11(11), 1138–1140.
Article CAS PubMed PubMed Central Google Scholar
Saito, Y., & Mituyama, T. (2015). Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions. BMC Genomics, 16(Suppl 12), S3.
Article PubMed PubMed Central CAS Google Scholar
Song, Q., Decato, B., Hong, E. E., Zhou, M., & Fang, F. (2013). A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS One, 8(12), 81148.
Article CAS Google Scholar
Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83.
Article PubMed PubMed Central Google Scholar
Hebestreit, K., Dugas, M., & Klein, H.-U. (2013). Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics, 29(13), 1647–1653.
Article CAS PubMed Google Scholar
Wreczycka, K., Gosdschan, A., Yusuf, D., Grüning, B., Assenov, Y., & Akalin, A. (2017). Strategies for analyzing bisulfite sequencing data. Journal of Biotechnology, 261, 105–115.
Article CAS PubMed Google Scholar
Tsuji, J., & Weng, Z. (2015). Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data. Briefings in Bioinformatics, 17(6), bbv103.
Article Google Scholar
Eberwine, J., et al. (1992). Analysis of gene expression in single live neurons. Proceedings of the National Academy of Sciences of the United States of America, 89(7), 3010–3014.
Article CAS PubMed PubMed Central Google Scholar
Hwang, B., Lee, J. H., & Bang, D. (2018). Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental and Molecular Medicine, 50(8), 96.
Article CAS PubMed Central Google Scholar
Van Der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411–420.
Article CAS PubMed PubMed Central Google Scholar
Afgan, E., et al. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research, 46(W1), W537–W544.
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M., et al. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1), 25–29.
Article CAS PubMed PubMed Central Google Scholar
Huang, D. W., Sherman, B. T., & Lempicki, R. A. (2009). Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1), 1–13.
Article CAS Google Scholar
Fisher, R. A. (1922). On the interpretation of χ² from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87.
Article Google Scholar
Ludbrook, J. (2008). Analysis of 2 × 2 tables of frequencies: Matching test to experimental design. International Journal of Epidemiology, 37(6), 1430–1435.
Article PubMed Google Scholar
Huang, D. W., Sherman, B. T., & Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57.
Article CAS Google Scholar
Falcon, S., & Gentleman, R. (2007). Using GOstats to test gene lists for GO term association. Bioinformatics, 23(2), 257–258.
Article CAS PubMed Google Scholar
Maere, S., Heymans, K., & Kuiper, M. (2005). BiNGO: A cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 21(16), 3448–3449.
Article CAS PubMed Google Scholar
Subramanian, A., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
Article CAS Google Scholar
Lee, H. K., Braynen, W., Keshav, K., & Pavlidis, P. (2005). ErmineJ: Tool for functional analysis of gene expression data sets. BMC Bioinformatics, 6(1), 269.
Article PubMed PubMed Central CAS Google Scholar
Al-Shahrour, F., et al. (2007). From genes to functional classes in the study of biological systems. BMC Bioinformatics, 8, 114.
Article PubMed PubMed Central CAS Google Scholar
Nam, D., Kim, S.-B., Kim, S.-K., Yang, S., Kim, S.-Y., & Chu, I.-S. (2006). ADGO: Analysis of differentially expressed gene sets using composite GO annotation. Bioinformatics, 22(18), 2249–2253.
Article CAS PubMed Google Scholar
Nogales-Cadenas, R., et al. (2009). GeneCodis: Interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Research, 37(Web Server issue), W317–W322.
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1), 27–30.
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D., et al. (2014). Pfam: The protein families database. Nucleic Acids Research, 42(Database issue), D222–D230.
Article CAS PubMed Google Scholar
Matys, V., et al. (2003). TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 31(1), 374–378.
Article CAS PubMed PubMed Central Google Scholar
Warde-Farley, D., et al. (2010). The GeneMANIA prediction server: Biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 38(Web Server issue), W214–W220.
Article CAS PubMed PubMed Central Google Scholar
Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). BioGRID: A general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535–D539.
Article CAS PubMed Google Scholar
Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1), Article17.
Article PubMed Google Scholar
Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9(1), 559.
Article PubMed PubMed Central CAS Google Scholar
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048.
Article CAS PubMed PubMed Central Google Scholar
Gregory, R., Warnes, R., Bolker, B., Bonebakker, L., Gentleman, M., Liaw, W. H. A., Lumley, T., Maechler, B., Magnusson, A., Moeller, S., Schwartz, M., & Venables, B. (2016). Various R programming tools for plotting data. R Package Version, 2(4), 1.
Google Scholar
Walter, W., Sánchez-Cabo, F., & Ricote, M. (2015). GOplot: An R package for visually combining expression data with functional analysis. Bioinformatics, 31(17), 2912–2914.
Article CAS PubMed Google Scholar
Ghosh, D., & Poisson, L. M. (2009). “Omics” data and levels of evidence for biomarker discovery. Genomics, 93, 13–16.
Article CAS PubMed Google Scholar
Wheelock, A. M., & Wheelock, C. E. (2013). Trials and tribulations of ‘omics data analysis: Assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine. Molecular BioSystems, 9, 2589.
Article CAS PubMed Google Scholar
Kraus, L. (2015). Editorial: Would you like a hypothesis with those data? Omics and the age of discovery science. Molecular Endocrinology, 29(11), 1531–1534.
Article PubMed PubMed Central CAS Google Scholar
Vaux, D. L., Fidler, F., & Cumming, G. (2012). Replicates and repeats—What is the difference and is it significant? A brief discussion of statistics and experimental design. EMBO Reports, 13(4), 291.
Article CAS PubMed PubMed Central Google Scholar
Bell, G. (2016). Comment: Replicates and repeats. BMC Biology, 14, 28.
Article PubMed PubMed Central CAS Google Scholar
Whitley, E., & Ball, J. (2002). Statistics review 4: Sample size calculations. Critical Care, 6(4), 335.
Article PubMed PubMed Central Google Scholar
Billoir, E., Navratil, V., & Blaise, B. J. (2015). Sample size calculation in metabolic phenotyping studies. Briefings in Bioinformatics, 16(5), 813–819.
Article PubMed Google Scholar
Urdan, T. C. (2010). Statistics in plain English (3rd ed.). New York: Routledge.
Google Scholar
Pett, M. A. (1997). Nonparametric statistics for health care research: Statistics for small samples and unusual distributions. Thousand Oaks, CA: Sage.
Google Scholar
Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B, 64(Part 3), 479–498.
Article Google Scholar
Feise, R. J. (2002). Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology, 2, 8.
Article PubMed PubMed Central Google Scholar
Chen, S. Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for multiple comparisons. Journal of Thoracic Disease, 9(6), 1725–1729.
Article PubMed PubMed Central Google Scholar
Forshed, J. (2017). Experimental design in clinical ‘omics biomarker discovery. Journal of Proteome Research, 16, 3954–3960.
Article CAS PubMed Google Scholar
Guyatt, G., Jaeschke, R., Heddle, N., Cook, D., Shannon, H., & Walter, S. (1995). Basic statistics for clinicians: 1. Hypothesis testing. CMAJ, 152(1), 27–32.
CAS PubMed PubMed Central Google Scholar
Guyatt, G., Jaeschke, R., Heddle, N., Cook, D., Shannon, H., & Walter, S. (1995). Basic statistics for clinicians: 2. Interpreting study results: Confidence intervals. CMAJ, 152(2), 169–173.
CAS PubMed PubMed Central Google Scholar
Guyatt, G., Walkter, S., Shannon, H., Cook, D., Jaeschke, R., & Heddle, N. (1995). Basic statistics for clinicians: 4. Correlation and regression. CMAJ, 152(4), 497–504.
CAS PubMed PubMed Central Google Scholar
Hanley, J. A., & Moodie, E. E. M. (2011). Sample size, precision and power calculations: A unified approach. Journal of Biometrics and Biostatistics, 2, 5.
Article Google Scholar
Ioannidis, J. P. A., Tarone, R., & McLaughlin, J. K. (2011). The false-positive to false-negative ratio in epidemiologic studies. Epidemiology, 22(4), 450–456.
Article PubMed Google Scholar
Jarschke, R., Guyatt, G., Shannon, H., Walter, S., Cook, D., & Heddle, N. (1995). Basic statistics for clinicians: 3. Assessing the effects of treatment: Measures of association. CMAJ, 152(3), 351–357.
Google Scholar
Mazzocchi, F. (2015). Could big data be the end of theory in science? A few remarks on the epistemology of data-driven science. EMBO Reports, 16(10), 1250–1255.
Article CAS PubMed PubMed Central Google Scholar
Rajasundaram, D., & Selbig, J. (2016). More effort — More results: Recent advances in integrative ‘omics’ data analysis. Current Opinion in Plant Biology, 30, 57–61.
Article CAS PubMed Google Scholar
Senn, S., & Bretz, F. (2007). Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics, 6, 161–170.
Article PubMed Google Scholar
Signe, A., Esteban, F. J., Stavreus-Evers, A., Simon, C., Giudice, L., Lessey, B. A., Horcajadas, J. A., Macklon, N. S., D’Hooghe, T., Campoy, C., Fauser, B. C., Salamonsen, L. A., & Salumets, A. (2014). Guidelines for the design, analysis and interpretation of ‘omics’ data: Focus on human endometrium. Human Reproduction Update, 20(1), 12–28.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA
Surajit Bhattacharya
Center for Translational Research, Children’s National Medical Center, Washington, DC, USA
Heather Gordish-Dressman
Department of Pediatrics, The George Washington University School of Medicine and Health Sciences, Washington, DC, USA
Heather Gordish-Dressman

Authors

Surajit Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Heather Gordish-Dressman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heather Gordish-Dressman .

Editor information

Editors and Affiliations

Research Institute for Sport and Exercise Sciences, Liverpool John Moores University, Liverpool, Merseyside, UK
Jatin George Burniston
Children’s National Health System, George Washington University, Washington, DC, USA
Yi-Wen Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bhattacharya, S., Gordish-Dressman, H. (2019). Guidelines for Bioinformatics and the Statistical Analysis of Omic Data. In: Burniston, J., Chen, YW. (eds) Omics Approaches to Understanding Muscle Biology. Methods in Physiology. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-9802-9_4

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9802-9_4
Published: 06 November 2019
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-9801-2
Online ISBN: 978-1-4939-9802-9
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics