Abstract
The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task. In this study, we provide additional evidence that Feed Forward Neural Networks (FFNN) trained on epigenetic data and one-dimensional convolutional neural networks (CNN) trained on DNA sequence data can successfully predict active regulatory regions in different cell lines. We show that model selection by means of Bayesian optimization applied to both FFNN and CNN models can significantly improve deep neural network performance, by automatically finding models that best fit the data. Further, we show that techniques applied to balance active and non-active regulatory regions in the human genome in training and test data may lead to over-optimistic or poor predictions. We recommend to use actual imbalanced data that was not used to train the models for evaluating their generalization performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
ENCODE Data at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC.
- 2.
ENCODE Data at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC; ENCODE fold-change values are described here https://sites.google.com/site/anshulkundaje.
- 3.
- 4.
- 5.
- 6.
For computing Multiple Correspondence Analysis we used the python package available at https://github.com/esafak/mca.
- 7.
References
Latchman, D.S.: Transcription factors: an overview. Int. J. Exp. Pathol. 74, 417–422 (1993)
Mora, A., Sandve, G.K., Gabrielsen, O.S., Eskeland, R.: In the loop: promoter-enhancer interactions and bioinformatics. Brief. Bioinform. 17, 980–995 (2016)
Lambert, S.A., et al.: The human transcription factors. Cell 172, 650–665 (2018)
Schubach, M., Re, M., Robinson, P.N., Valentini, G.: Imbalance-aware machine learning for predicting rare and commondisease-associated non-coding variants. Sci. Rep. 7(1), 1–2 (2017)
Rentzsch, P., Witten, D., Cooper, G., Shendure, J., Kircher, M.: CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019)
Javierre, B., et al.: Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384 (2016)
Bernstein, B., et al.: The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 28, 1045 (2010)
Dunham, I., et al.: An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)
Shen, Y., et al.: A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116 (2012)
Zhu, J., et al.: Genome-wide chromatin state transitions associated with developmental and environmental cues. Cell 152, 642–654 (2013)
Noguchi, S., et al.: FANTOM5 CAGE profiles of human and mouse samples. Sci. Data 4, 170112 (2017)
Lizio, M., et al.: Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 16, 22 (2015)
Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518, 317 (2015)
Ernst, J., Kellis, M.: ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9(3), 215–216 (2012)
Hoffman, M.M., Buske, O.J., Wang, J., Weng, Z., Bilmes, J.A., Noble, W.S.: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473 (2012)
Kwasnieski, J.C., Fiore, C., Chaudhari, H.G., Cohen, B.A.: High-throughput functional testing of encode segmentation predictions. Genome Res. 24, 1595–1602 (2014)
Yip, K.Y., et al.: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012)
Lu, Y., Qu, W., Shan, G., Zhang, C.: DELTA: a distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications. PLoS ONE 10, e0130622 (2015)
Kleftogiannis, D., Kalnis, P., Bajic, V.: DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 43(1), e6 (2014)
Min, X., Zeng, W., Chen, S., Chen, N., Chen, T., Jiang, R.: Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics 18, 478 (2017). https://doi.org/10.1186/s12859-017-1878-3
Li, Y., Shi, W., Wasserman, W.W.: Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics 19, 202 (2018)
Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Park, Y., Kellis, M.: Deep learning for regulatory genomics. Nat. Biotechnol. 33, 825 (2015)
Yang, B., et al.: BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 33(13), 1930–1936 (2017)
Liu, F., Li, H., Ren, C., Bo, X.C., Shu, W.: PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci. Rep. 6, 28517 (2016)
Andersson, R., et al.: An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980). https://doi.org/10.1007/BF00344251
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Hierlemann, A., Schweizer-Berberich, M., Weimar, U., Kraus, G., Pfau, A., Göpel, W.: Pattern recognition and multicomponent analysis. Sens. Update 2, 119–180 (1996)
Chollet, F., et al.: Keras (2018). https://github.com/fchollet/keras
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
Swersky, K., Snoek, J., Adams, P.: Multi-task Bayesian optimization. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 2004–2012. Curran Associates, Inc., Red Hook (2013)
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104, 148–175 (2016)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2012, pp. 2951–2959. Curran Associates, Inc., Red Hook (2012)
Dozat, T.: Incorporating Nesterov momentum into Adam. In: International Conference on Learning Representations, Workshop (ICLRW), pp. 1–6 (2016)
Bewick, V., Cheek, L., Ball, J.R.: Statistics review 13: receiver operating characteristic curves. Crit. Care 8, 508–512 (2004)
Boyd, K., Eng, K.H., Page, C.D.: Area under the precision-recall curve: point estimates and confidence intervals. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8190, pp. 451–466. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40994-3_29
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, 1–21 (2015)
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1, 80–83 (1945)
Pratt, J.W.: Remarks on zeros and ties in the Wilcoxon signed rank procedures. J. Am. Stat. Assoc. 54, 655–667 (1959)
Derrick, B., Paul W.: Comparing two samples from an individual Likert question. Int. J. Math. Stat. 18(3) (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cappelletti, L. et al. (2020). Bayesian Optimization Improves Tissue-Specific Prediction of Active Regulatory Regions with Deep Neural Networks. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_54
Download citation
DOI: https://doi.org/10.1007/978-3-030-45385-5_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45384-8
Online ISBN: 978-3-030-45385-5
eBook Packages: Computer ScienceComputer Science (R0)