Abstract
Clustering of genes on the basis of expression profiles is generally the first step in understanding how a class of genes behaves in a biological process. A number of supervised and unsupervised algorithms are available in statistics and machine learning literature for clustering microarray data, but the algorithms are restricted in their ability to evaluate the results of a clustering algorithm in the light of biologically meaningful clusters. If two gene sequences are similar, then we would expect that their genetic expressions are similar and that they are similarly annotated in the Gene Ontology (GO) databases. Hence a comparison of the expression level similarity of two gene sequences against their corresponding similarity of annotation in the GO can establish this fact. Semantic similarity has now become a valuable tool for validating the results drawn from biomedical studies such as gene clustering and gene expression data analysis. This paper borrows from our previous work on meta-ensembles using cancer datasets where the output of several clustering algorithms are subsequently fed to a consensus building process to generate a stable set of cluster results. Next, these cluster results are further refined through a sequence of biological validation process for each gene pair of a given cluster using semantic similarity and sequence similarity. We have tested our approach on several benchmark cancer datasets in an attempt to provide a more accurate biological analysis of the clusters and the results have been found to be satisfactory.
Similar content being viewed by others
References
Altschul SF et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Azuaje F, Bodenreider O (2004) Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study. In: Proceedings IEEE Fourth Symp. Bioinformatics and Bioeng. (BIBE 2004). Taichung, Taiwan, 2004
Bhattacherjee V et al (2007) Neural crest and mesoderm lineage-dependent gene expression in orofacial development. Differentiation 75(5):463–477
Cheng J et al (2004) A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat 14:687–700
Chenna R et al (2003) Multiple sequence alignment with clustal series of programs. Nucleic Acids Res 31(13):3497–3500
Chu S et al (1998) The transcriptional program of sporulation in budding yeast. Science 282(5389):699–705
Couto FM, Silva MJ, Coutinho P (2003) Implementation of a functional semantic similarity measure between gene-products. technical report. Univ. of Lisbon, Lisbon
Couto FM, Silva MJ, Coutinho P (2005) Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In: Proceedings of the ACM Conference in Information and Knowledge Management, 2005
Couto FM, Silva MJ, Coutinho P (2007) Measuring semantic similarity between gene ontology terms. Data Knowl Eng 61:137–152
Datta S, Datta S (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4):459–466
Datta S, Datta S (2006) Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinform 7:397
DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278(5338):680–686
Dopazo J, Carazo JM (1997) Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol 44(2):226–233
Dunn JC (1974) Well separated clusters and fuzzy partitions. J Cybern 4:95–104
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. In: Proceedings Natl Acad Sci USA., 1998
Fraley C, Raftery AE (2001) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 17:126–136
Gentleman RC, Carey VJ, Bates DM (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80
Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212
Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108
Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2):126–136
Jiang J, Conrath D (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, 1997
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kent W et al (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006
Kohonen T (1997) Self-organizing maps, 2nd edn. Springer-Verlag, Berlin
Lam TW et al (2008) Compressed indexing and local alignment of DNA. Bioinformatics 24(6):791–797
Larkin M et al (2007) Clustal W and clustal X version 2.0. Bioinformatics 23(21):2947–2948
Lee HK et al (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14:1085–1094
Li J, Liu H (2002) Kent ridge bio-medical dataset repository (Online). Available at: http://sdmc.lit.org.sg/GEDatasets/Datasets.html
Li J, Gong B, Chen X et al (2011) DOSim: an R package for similarity between diseases based on disease ontology. BMC Bioinform 12:266
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, Morgan Kaufmann, 1998
Lord PW, Stevens RD, Brass A, Goble CA (2003a) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19:1275–1283
Lord PW, Stevens RD, Brass A, Goble CA (2003) Semantic similarity measures as tools for exploring the gene ontology. In: Proceedings of the 8th Pacific Symposium on Biocomputing. 2003
Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health 2(3):159–173
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Newberg LA (2008) Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics 26(16):1772–1778
Othman R, Deris S, Illias R (2007) A genetic similarity algorithm for searching the gene ontology terms and annotating anonymous protein sequences. J Biomed Inf 23:529–538
Pekar V, Staab S (2002) Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th international conference on Computational linguistics. Morristown, NJ, USA, 2002
Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. Man, and Cybernetics, In IEEE Transaction on Systems, p 1989
Resnick P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995
Resnick P (1999) Semantic similarity in a taxonomy: an information based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:302
Sevilla JL et al (2005) Correlation between gene expression and GO semantic similarity IEEE/ACM. Trans Comput Biol Bioinf 2(4):330–337
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Stuart JM, Segal E, Koller D, Kim SK (2003) A gene- coexpression network for global discovery of conserved genetic modules. Science 302(5643):249–255
Su AI, et al. (2002) Large-Scale Analysis of the Human and Mouse Transcriptomes. In: Proceedings of the National Academy of Science, USA, 2002
Team RC, (2013) R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: http://www.R-project.org
The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res 11(8):1425–1433
van’t Veer LJ (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Wang, J.Z. et al., 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics
Wang H, Azuaje F, Bodenreider O, Dopazo J (2004) Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. In: Proceedings Computational Intelligence in Bioinformatics and Computational Biology.CA, USA, 2004
Wang H, Azuaje F, Bodenreider O (2005) An ontology-driven clustering method for supporting gene expression analysis, computer-based medical systems. In: Proceedings IEEE Symposium on Computer-based Medical Systems. 2005
Wu H, et al. (2005) Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acid Res 33: 2822–2837. Available at: http://www.view.ncbi.nlm.nih.gov/pubmed/15901854
Wu Z, Palmer MS, (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd. Annual Meeting of the Association for Computational Linguistics (ACL 1994). 1994
Wu X et al (2006) Prediction of yeast proteinprotein interaction network: insights from the gene ontology and annotations. Nucleic Acids Res 34:2137–2150
Yeung KY, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17(4):309–318
Yu H, Gao L, Tu K, Guo Z (2005) Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene 352:75–81
Yu G et al (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26(7):976–978
Zheng H, Azuaje F, Wang H (2010) seGOsa: software environment for Gene Ontology-driven similarity assessment. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM’10). 2010
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The results for the experiments involving semantic similarity, sequence similarity, internal measures, stability measures and biological measures on the (a) Breast Cancer Dataset (b) Lymphoma Dataset and (c) Embryonal Tumours of the Central Nervous System (CNS) dataset are given here.
1.1 Appendix A: The pair-wise gene expression similarity matrix
The pair-wise gene expression similarity is calculated using Pearson Correlation for (a) Breast Cancer Dataset, (b) Lymphoma Dataset and (c) Embryonal Tumours of the Central Nervous System (CNS) dataset. The similarity matrix value for some of the genes is shown in Fig. 8 below along with the plots of the values. These expression values will be used subsequently for comparison with the semantic similarity values for the corresponding pair of genes in Section C.
1.2 Appendix B: The pair-wise semantic similarity matrix
The pair-wise semantic similarity matrix for the Lin, Jiang and Conrath and Wang measures for the Breast Cancer Dataset are calculated and the semantic similarity values and plots for some of the genes is shown in Fig. 9 below:
The pair-wise semantic similarity matrix given in Fig. 10 shows the values for Lin, Jiang and Conrath and Wang measures for the Lymphoma Dataset along with the plots.
Figure 11 below shows the pair-wise semantic similarity values and plots of some of the genes for the Lin, Jiang and Conrath and Wang measures for the Embryonal Tumours of the Central Nervous System Dataset.
1.3 Appendix C: Comparison of pair-wise gene expression similarity and semantic similarity
Figure 12 below gives a comparison of expression similarity and semantic similarity of four sample gene pairs of the Breast Cancer Dataset, suggesting that gene products with similar expression patterns might have similarly annotated profiles.
Figures 13 and 14 show the comparison of expression similarity and semantic similarity of four sample gene pairs of lymphoma dataset and embryonal tumours of central nervous system dataset. The assumption that gene products with similar expression patterns might have similarly annotated profiles also seems to hold true for this dataset. The graph obtained from the plot of the values for the genes exhibit a similar trend.
1.4 Appendix D: Comparison of pair-wise gene expression similarity, semantic similarity and sequence similarity
The gene expression similarity, Lin, Jiang and Conrath and Wang semantic similarity and sequence similarity of the some of the gene pairs of Breast Cancer Dataset is given in Fig. 15. From the graph it can be clearly observed that the genes pair-wise scores for the various measures follow a common trend, indicating a correlation between gene expression similarity, semantic similarity and sequence similarity.
Figure 16 depicts the gene expression similarity, Lin, Jiang and Conrath and Wang semantic similarity and sequence similarity of the some of the Gene Pairs of Lymphoma Dataset. From the graph it can be clearly observed that the genes pair-wise scores for the various measures follow a common trend, indicating a correlation between gene expression similarity, semantic similarity and sequence similarity. This is borne out also by Fig. 17 for the Gene Pairs of Embryonal Tumours of Central Nervous System Dataset.
1.5 Appendix E: Result of internal validation
The internal validation measures of connectivity, Dunn index and silhouette width for each of the eight algorithms for the Breast Cancer dataset are shown in Table 11. We notice that hierarchical clustering with two clusters performs the best in the case of connectivity and silhouette width and with four clusters in case of Dunn index. The plots of the connectivity, Dunn index, and silhouette width are given in Fig. 18, which indicates that hierarchical clustering outperforms the other clustering algorithms under each validation measure and hence appears to be the method of choice.
The eight algorithms using the Lymphoma dataset are subjected to the calculation of connectivity, Dunn index and silhouette width internal validation measures and the results are shown in Table 12. As in the case in Table 11, we notice that hierarchical clustering with two clusters performs the best in the case of connectivity and silhouette width and with three clusters in case of Dunn index. This indicates that hierarchical clustering outperforms the other clustering algorithms under each validation measure and from the plots of connectivity, Dunn index, and silhouette width shown in Fig. 19 hierarchical clustering appears to be the method of choice.
The internal validation scores for the measures of connectivity, Dunn index and silhouette width for the Embryonal Tumours of Central Nervous System dataset are shown in Table 13. The optimal scores show that hierarchical clustering with two clusters performs the best in each case, which is confirmed by the plots of the connectivity, Dunn index and silhouette width in Fig. 20. SOTA is seen not to perform well as it could not uncover clusters between the ranges 4–6.
1.6 Appendix F: Result of stability measures
The results of APN, AD, ADM and FOM for the Breast Cancer dataset are given in Table 14.
For the APN and ADM measures, values close to zero are preferred. The optical score in Table 14 shows that hierarchical clustering with four clusters gives the best score, as was also in the case of internal validation. However, for the other two measures model-based clustering with six clusters has the best score. It is illustrative to graphically visualize each of the validation measures.
The plots of the APN, AD, and ADM are given in Fig. 21. The APN measure shows an interesting trend, in that it initially stabilizes from two to four clusters for all the clustering methods except for SOM and SOTA, but marginally increases afterwards. Though hierarchical clustering with four clusters has the best score, Diana with six clusters is a close second. The AD and FOM measures tend to decrease as the number of clusters increases. Here model-based clustering with six clusters has the best overall score, though the other algorithms have similar scores. The plot of the FOM measure is very similar to the AD measure, so we have omitted it from the figure. For the ADM measure hierarchical with four clusters again has the best score.
For the Lymphoma dataset, the plots of the APN, AD and ADM are given in Fig. 22. Though the graph of APN measure shows Diana as the most favourable algorithm, one must keep in mind that this algorithm is a special case of the hierarchical algorithm. Thus, hierarchical clustering with two clusters gives the best score and it matches the findings as seen in the case of internal validation in the optimal score given in Table 15. For the AD and FOM measures, PAM with six clusters has the best overall score, but over the entire range of clusters evaluated SOM, K-means and Diana have comparable performance. Similarly, for the ADM measure hierarchical has a more stable and better performance.
The plots for the Embryonal Tumours of Central Nervous System Dataset of APN, AD and ADM are given in Fig. 23. Hierarchical clustering with two clusters is seen to be performing the best score and it confirms the findings of internal validation, followed by Diana. This is confirmed by the optimal scores given in Table 16. It is seen that PAM and k-means also perform well. For the AD and FOM measures, PAM with six clusters has the best overall score along with k-means. SOTA is seen not to perform well in this case as it could not uncover clusters between the ranges 4–6.
1.7 Appendix G: Results of BHI and BSI
The BHI and the BSI values were computed for each clustering algorithm in the range of cluster numbers from two to six. We first consider the breast cancer data. Table 17 shows the scores for the Breast Cancer Dataset and we see that Diana has the highest BHI score for six clusters and the highest BSI score is by hierarchical algorithm for two clusters, which indicates that consistency of clustering for genes with similar biological functionality is given by hierarchical algorithm.
Figure 24 shows the plots of BHI for the eight clustering algorithms which reveal that Diana happens to produce most homogeneous biological clusters based on this data set and the results are statistically significant when the number of clusters is between four and six.
The plots of BSI are shown in Fig. 25 and hierarchical algorithm seems to be the most stable in its capability of producing clusters using reduced data sets that are biologically alike. Considering both indices, we would say that hierarchical algorithm is the best choice for this data set to maximize the biological homogeneity and Diana can be a worthwhile consideration if six clusters are desired.
The scores for the Lymphoma Dataset are shown in Table 18 and we see that SOM has the highest BHI score for six clusters and the highest BSI score is again by hierarchical algorithm for two clusters, which indicates its consistency to produce most homogeneous biological clusters based on this data set.
Figure 26 shows that SOM produces the most homogeneous biological clusters when six clusters are required and hierarchical is the most consistent of all the algorithms. The plots of BSI are shown in Fig. 27 and hierarchical algorithm appears to be the most stable in its capability of producing clusters that are biologically alike and model-based clustering appears to be the least stable. We can conclude that hierarchical algorithm seems to be the best choice for this data set to maximize the biological homogeneity, considering both the indices.
Finally, for the Embryonal Tumours of Central Nervous System dataset, the BHI and the BSI scores are shown in Table 19 and in both cases, hierarchical scores the highest points for producing biological significant clusters.
Although hierarchical shows a marked increase for cluster sizes of five or six, SOM can be the algorithm of choice when four clusters are desired, as shown in Fig. 28. When we compare the plots of BSI as shown in Fig. 29, we can see that all the clustering algorithms have produced significantly consistent results barring SOTA, as it could not generate clusters between the ranges 4–6. Hierarchical algorithm seems to be the most stable in its ability of producing biologically relevant clusters and when we look at both the indices, we can conclude that hierarchical algorithm is the best choice for this data set to maximize the biological homogeneity.
Rights and permissions
About this article
Cite this article
Nagi, S., Bhattacharyya, D.K. Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures. Netw Model Anal Health Inform Bioinforma 3, 67 (2014). https://doi.org/10.1007/s13721-014-0067-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-014-0067-9