Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analyses of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) and recurrent neural networks (RNN) are introduced to motif discovery task and have achieved state-of–art performance. However, these methods still have limitations such as neglecting the context information in large-scale sequencing data. Thus, inspired by the similarity between DNA sequence and human language, in this paper we propose a hierarchical attention network for predicting DNA-protein binding sites which is based on a natural language processing method for document classification. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lambert, S.A., et al.: The human transcription factors. Cell 172, 650–665 (2018)
Huang, D.-S., Du, J.-X.: A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Trans. Neural Netw. 19, 2099–2115 (2008)
Bao, W., Huang, Z., Yuan, C.-A., Huang, D.-S.: Pupylation sites prediction with ensemble classification model. Int. J. Data Min. Bioinform. 18, 91–104 (2017)
Deng, S.-P., Zhu, L., Huang, D.-S.: Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 13, 27–35 (2016)
Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., Luscombe, N.M.J.N.R.G.: A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 10, 252 (2009)
Huang, D.-S., Zhang, L., Han, K., Deng, S., Yang, K., Zhang, H.: Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. Curr. Protein Pept. Sci. 15, 553–560 (2014)
Elnitski, L., Jin, V.X., Farnham, P.J., Jones, S.J.J.G.R.: Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res. 16, 1455–1464 (2006)
Berger, M.F., Philippakis, A.A., Qureshi, A.M., He, F.S., Estep III, P.W., Bulyk, M.L.J.N.B.: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429 (2006)
Stormo, G.D.J.B.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000)
Weirauch, M.T., et al.: Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126 (2013)
Furey, T.S.J.N.R.G.: ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat. Rev. Genet. 13, 840 (2012)
Yu, H.-J., Huang, D.-S.: Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 10, 457–467 (2013)
Zhu, L., Deng, S.-P., Huang, D.-S.: A two-stage geometric method for pruning unreliable links in protein-protein networks. IEEE Trans. Nanobiosci. 14, 528–534 (2015)
Bao, W., Jiang, Z., Huang, D.-S.: Novel human microbe-disease association prediction using network consistency projection. BMC Bioinform. 18, 543 (2017)
Liu, B., Li, K., Huang, D.-S., Chou, K.-C.: iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 34(22), 3835–3842 (2018)
Alipanahi, B., Delong, A., Weirauch, M.T., Frey, B.J.J.N.B.: Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831 (2015)
Shen, Z., Zhang, Y.-H., Han, K., Nandi, A.K., Honig, B., Huang, D.-S.: miRNA-disease association prediction with collaborative matrix factorization. Complexity 2017, 9 (2017)
Zhu, L., Guo, W.-L., Deng, S.-P., Huang, D.-S.: ChIP-PIT: enhancing the analysis of ChIP-Seq data using convex-relaxed pair-wise interaction tensor decomposition. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 55–63 (2016)
Zhou, J., Troyanskaya, O.G.J.N.M.: Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931 (2015)
Huang, D.-S., Jiang, W.: A general CPL-AdS methodology for fixing dynamic parameters in dual environments. IEEE Trans. Syst. Man Cybern. B (Cybern.) 42, 1489–1500 (2012)
Le, N.Q.K., Yapp, E.K.Y., Ho, Q.-T., Nagasundaram, N., Ou, Y.-Y., Yeh, H.-Y.J.A.B.: iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal. Biochem. 571, 53–61 (2019)
Asgari, E., Mofrad, M.R.J.P.O.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10, e0141287 (2015)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Fletez-Brant, C., Lee, D., McCallion, A.S., Beer, M.A.J.N.A.R.: kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 41, W544–W556 (2013)
Orenstein, Y., Shamir, R.J.N.A.R.: A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data. Nucleic Acids Res. 42, e63–e63 (2014)
Lee, D., et al.: A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955 (2015)
Yao, Z., MacQuarrie, K.L., Fong, A.P., Tapscott, S.J., Ruzzo, W.L., Gentleman, R.C.J.B.: Discriminative motif analysis of high-throughput dataset. Bioinformatics 30, 775–783 (2013)
Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.J.B.: Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32, i121–i127 (2016)
Fawcett, T.J.P.R.L.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Zhu, L., Zhang, H.-B., Huang, D.-S.: Direct AUC optimization of regulatory motifs. Bioinformatics 33, i243–i251 (2017)
Aslam, J.A., Yilmaz, E., Pavlu, V.: A geometric interpretation of r-precision and its correlation with average precision. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 573–574. ACM
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240. ACM
Acknowledgements
This work was supported by the grants of the National Science Foundation of China, Nos. 61861146002, 61520106006, 61772370, 61873270, 61702371, 61672382, 61672203, 61572447, 61772357, and 61732012, China Post-doctoral Science Foundation Grant, No. 2017M611619, and supported by “BAGUI Scholar” Program and the Scientific & Technological Base and Talent Special Program, GuiKe AD18126015 of the Guangxi Zhuang Autonomous Region of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, W., Yuan, CA., Qin, X., Huang, ZK., Shang, L. (2019). Hierarchical Attention Network for Predicting DNA-Protein Binding Sites. In: Huang, DS., Jo, KH., Huang, ZK. (eds) Intelligent Computing Theories and Application. ICIC 2019. Lecture Notes in Computer Science(), vol 11644. Springer, Cham. https://doi.org/10.1007/978-3-030-26969-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-26969-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26968-5
Online ISBN: 978-3-030-26969-2
eBook Packages: Computer ScienceComputer Science (R0)