Abstract
A fast and highly memory-efficient implementation of the TRIBE-MCL clustering algorithm is proposed to perform the classification of huge protein sequence data sets using an ordinary PC. Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. The proposed algorithm was tested on huge synthetic protein sequence data sets. The validation process revealed that the proposed method extended the data size processable on a regular PC from previously reported 250 thousand to one million items. The algorithm needs 10–20 % less time for processing the same data sizes than previous efficient Markov clustering algorithms, without losing anything from the partition quality. The proposed solution is open for further improvement via parallel data processing.
Research supported by the Hungarian National Research Funds (OTKA), Project no. PD103921. S. M. Szilágyi is a Bolyai Fellow of the Hungarian Academy of Sciences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul, S.F., Madden, T.L., Schaffen, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search program. Nucl. Acids Res. 25, 3389–3402 (1997)
Andreeva, A., Howorth, D., Chadonia, J.M., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Data growth and its impact on the SCOP database: new developments. Nucl. Acids Res. 36, D419–D425 (2008)
Dai, H., Zhou, Q., He, O., Bian, J.: Markov clustering based placement algorithm for island-style FPGAs. In: IEEE International Conference on Green Circuits and Systems, pp. 123–128. IEEE Press, New York (2010)
Dhara, M., Shukla, K.K.: Characteristics of restricted neighbourhood search algorithm and Markov clustering on modified power-law distribution. In: 1st International Conference on Recent Advances in Information Technology, pp. 520–525. IEEE Press, New York (2012)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Enright, A.J., van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucl. Acids Res. 30, 1575–1584 (2002)
Gáspári, Z., Vlahovicek, K., Pongor, S.: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 21, 3322–3323 (2005)
Hospedales, T., Gong, S.G., Xiang, T.: A Markov clustering topic model for mining behaviour in video. In: 12th IEEE International Conference on Computer Vision, pp. 1156–1172. IEEE Press, New York (2009)
Keensub, L., Ellis, D.P.W., Loui, A.C.: Detecting local semantic concepts in environmental sounds using Markov model based clustering. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 2278–2281. IEEE Press, New York (2010)
Lo Conte, L., Ailey, B., Hubbard, T.J., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a structural classification of protein database. Nucl. Acids Res. 28, 257–259 (2000)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Pons, P., Latapy, M.: Computing communities in large networks using random walks. In: Yolum, I., Güngör, T., Gürgen, F., Özturan, C. (eds.) ISCIS 2005. LNCS, vol. 3733, pp. 284–293. Springer, Heidelberg (2005)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Structural Classification of Proteins database. http://scop.mrc-lmb.cam.ac.uk/scop
Szilágyi, L., Kovács, L., Szilágyi, S.M.: Synthetic test data generation for hierarchical graph clustering methods. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014, Part II. LNCS, vol. 8835, pp. 303–310. Springer, Heidelberg (2014)
Szilágyi, L., Szilágyi, S.M., Hirsbrunner, B.: A fast and memory-efficient hierarchical graph clustering algorithm. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014, Part I. LNCS, vol. 8834, pp. 247–254. Springer, Heidelberg (2014)
Szilágyi, S.M., Szilágyi, L.: A fast hierarchical clustering algorithm for large-scale protein sequence data sets. Comput. Biol. Med. 48, 94–101 (2014)
Zhu, X., Li, H.: Unsupervised human action categorization using latent Dirichlet Markov clustering. In: 4th International Conference on Intelligent Networking and Collaborative Systems, pp. 347–352. IEEE Press, New York (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Szilágyi, L., Nagy, L.L., Szilágyi, S.M. (2015). Recent Advances in Improving the Memory Efficiency of the TRIBE MCL Algorithm. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9490. Springer, Cham. https://doi.org/10.1007/978-3-319-26535-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-26535-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26534-6
Online ISBN: 978-3-319-26535-3
eBook Packages: Computer ScienceComputer Science (R0)