Abstract
Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf, 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.
Similar content being viewed by others
Notes
Since the sum of squares is the squared Euclidean distance, this is intuitively the “nearest” mean.
The CWTS-C5 and UMSI0 are the clustering solutions generated by two different methods, Infomap and the Smart Local Moving Algorithm (SLMA) respectively, applied on the direct citation network of articles. The two ECOOM clustering solutions are generated by applying the Louvain method to find communities among bibliographic coupled articles where ECOOM-NLP11 also incorporates the keywords information. The STS-RG clusters are generated by first projecting the small Astro dataset to the full Scopus database and collecting their cluster assignments after the full Scopus articles are clustered using SLMA on the direct citation network. More detailed account can be found in Velden et al. (2017).
References
Achlioptas, D. (2003). Database-friendly random projections: Johnson–Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi:10.1016/S0022-0000(03)00025-4.
Béjar, J. (2013). K-means vs mini batch k-means: A comparison. Tech. rep., Universitat Politècnica de Catalunya. http://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf.
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. (12pp).
Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.
Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. doi:10.1002/asi.22896.
Bruckner, E., Ebeling, W., & Scharnhorst, A. (1990). The application of evolution models in scientometrics. Scientometrics, 18(1–2), 21–41. doi:10.1007/BF02019160.
Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis pp. 1–32.
Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal, 62(6), 17531806. doi:10.1002/j.1538-7305.1983.tb03513.x.
Garfield, E. (1983). Citation indexing—Its theory and application in science, technology and humanities. Philadelphia: ISI Press.
Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level. Scientometrics, 37, 195–221.
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the representation of clusters and topics. the astronomy dataset. Scientometrics. doi:10.1007/s11192-017-2301-6.
Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data: different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics. doi:10.1007/s11192-017-2296-z.
Harris, Z. (1954). Distributional structure. Word, 10(23), 146162.
Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.
Koopman, R., Wang, S., & Scharnhorst, A. (2015) .Contextualization of topics—browsing through terms, authors, journals and cluster allocations. In: Salah, A.A., Tonta, Y., Salah, A.A.A., Sugimoto, C.R., Al, U., (Eds.), Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf.
Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics—browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.
Koopman, R., Wang, S., Scharnhorst, A., & Englebienne, G. (2015). Ariadne’s thread: Interactive navigation in a world of networked information. In: Begole, B., Kim, J., Inkpen, K., Woo, W., (Eds.), Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18–23, 2015, pp. 1833–1838. ACM doi:10.1145/2702613.2732781.
Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research Policy, 18(4), 209–223. doi:10.1016/0048-7333(89)90016-4.
Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about’monarch butterflies”,frankenfoods’,and’stem cells’. Scientometrics, 67(2), 231–258.
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms, chap. Chapter 20. An Example Inference Task: Clustering, p. 284292. Cambridge University Press.
Newman, M. E. (2006). Modularity and community structure in networks. Proc Natl Acad Sci USA, 103(23), 8577–8582. doi:10.1073/pnas.0601602103. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=pubmed&list_uids=16723398&dopt=AbstractPlus.
Rip, A., & Courtial, J. P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.
Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. doi:10.1016/0377-0427(87)90125-7.
Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguistica, 20(1), 3353.
Sculley, D. (2016). Web scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, p. 11771178. Raleigh, NC, USA.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.
Sugimoto, C. R., & Weingart, S. (2015). The kaleidoscope of disciplinarity. Journal of Documentation, 71(4), 775–794. doi:10.1108/JD-06-2014-0082. http://www.scopus.com/inward/record.url?eid=2-s2.0-84933503812&partnerID=tZOtx3y1.
Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., & Wang, S. (2017). Comparison of topic extraction approaches and their results. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 28372854.
Weaver, W. (1955). Translation. In W. Locke & D. Booth (Eds.), Machine translation of languages (pp. 15–23). Cambridge, Massachusetts: MIT Press.
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques, third edition edn. The Morgan Kaufmann series in data management systems. Burlington: Morgan Kaufmann.
Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics, 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832.
Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832. The ASIS&ISSI ”metrics” pre-conference seminar and the Global Alliance.
Acknowledgements
Part of this work has been funded by the COST Action TD1210 Knowescape. We would like to thank Jochen Gläser and Andrea Scharnhorst for extended comments on earlier versions of the text. We would also like to thank the internal reviewer Michael Heinz as well as the anonymous external referees for their valuable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, S., Koopman, R. Clustering articles based on semantic similarity. Scientometrics 111, 1017–1031 (2017). https://doi.org/10.1007/s11192-017-2298-x
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-017-2298-x