Clustering articles based on semantic similarity

Wang, Shenghui; Koopman, Rob

doi:10.1007/s11192-017-2298-x

Clustering articles based on semantic similarity

Published: 27 February 2017

Volume 111, pages 1017–1031, (2017)
Cite this article

Scientometrics Aims and scope Submit manuscript

Shenghui Wang¹ &
Rob Koopman¹

1674 Accesses
50 Citations
Explore all metrics

Abstract

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf, 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Various Document Clustering Tasks Using Word Lists

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Notes

Since the sum of squares is the squared Euclidean distance, this is intuitively the “nearest” mean.
https://en.wikipedia.org/wiki/F1_score.
http://scikit-learn.org/.
https://networkx.github.io/.
http://perso.crans.org/aynaud/communities/.
The CWTS-C5 and UMSI0 are the clustering solutions generated by two different methods, Infomap and the Smart Local Moving Algorithm (SLMA) respectively, applied on the direct citation network of articles. The two ECOOM clustering solutions are generated by applying the Louvain method to find communities among bibliographic coupled articles where ECOOM-NLP11 also incorporates the keywords information. The STS-RG clusters are generated by first projecting the small Astro dataset to the full Scopus database and collecting their cluster assignments after the full Scopus articles are clustered using SLMA on the direct citation network. More detailed account can be found in Velden et al. (2017).

References

Achlioptas, D. (2003). Database-friendly random projections: Johnson–Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi:10.1016/S0022-0000(03)00025-4.
Article MathSciNet MATH Google Scholar
Béjar, J. (2013). K-means vs mini batch k-means: A comparison. Tech. rep., Universitat Politècnica de Catalunya. http://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf.
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. (12pp).
Article Google Scholar
Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.
Article Google Scholar
Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. doi:10.1002/asi.22896.
Article Google Scholar
Bruckner, E., Ebeling, W., & Scharnhorst, A. (1990). The application of evolution models in scientometrics. Scientometrics, 18(1–2), 21–41. doi:10.1007/BF02019160.
Article Google Scholar
Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis pp. 1–32.
Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal, 62(6), 17531806. doi:10.1002/j.1538-7305.1983.tb03513.x.
Article Google Scholar
Garfield, E. (1983). Citation indexing—Its theory and application in science, technology and humanities. Philadelphia: ISI Press.
Google Scholar
Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level. Scientometrics, 37, 195–221.
Article Google Scholar
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the representation of clusters and topics. the astronomy dataset. Scientometrics. doi:10.1007/s11192-017-2301-6.
Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data: different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics. doi:10.1007/s11192-017-2296-z.
Harris, Z. (1954). Distributional structure. Word, 10(23), 146162.
Google Scholar
Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.
Article MathSciNet MATH Google Scholar
Koopman, R., Wang, S., & Scharnhorst, A. (2015) .Contextualization of topics—browsing through terms, authors, journals and cluster allocations. In: Salah, A.A., Tonta, Y., Salah, A.A.A., Sugimoto, C.R., Al, U., (Eds.), Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf.
Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics—browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.
Koopman, R., Wang, S., Scharnhorst, A., & Englebienne, G. (2015). Ariadne’s thread: Interactive navigation in a world of networked information. In: Begole, B., Kim, J., Inkpen, K., Woo, W., (Eds.), Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18–23, 2015, pp. 1833–1838. ACM doi:10.1145/2702613.2732781.
Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research Policy, 18(4), 209–223. doi:10.1016/0048-7333(89)90016-4.
Article Google Scholar
Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about’monarch butterflies”,frankenfoods’,and’stem cells’. Scientometrics, 67(2), 231–258.
Article Google Scholar
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms, chap. Chapter 20. An Example Inference Task: Clustering, p. 284292. Cambridge University Press.
Newman, M. E. (2006). Modularity and community structure in networks. Proc Natl Acad Sci USA, 103(23), 8577–8582. doi:10.1073/pnas.0601602103. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=pubmed&list_uids=16723398&dopt=AbstractPlus.
Rip, A., & Courtial, J. P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.
Article Google Scholar
Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. doi:10.1016/0377-0427(87)90125-7.
Article MATH Google Scholar
Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguistica, 20(1), 3353.
Google Scholar
Sculley, D. (2016). Web scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, p. 11771178. Raleigh, NC, USA.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.
Article Google Scholar
Sugimoto, C. R., & Weingart, S. (2015). The kaleidoscope of disciplinarity. Journal of Documentation, 71(4), 775–794. doi:10.1108/JD-06-2014-0082. http://www.scopus.com/inward/record.url?eid=2-s2.0-84933503812&partnerID=tZOtx3y1.
Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., & Wang, S. (2017). Comparison of topic extraction approaches and their results. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 28372854.
MathSciNet MATH Google Scholar
Weaver, W. (1955). Translation. In W. Locke & D. Booth (Eds.), Machine translation of languages (pp. 15–23). Cambridge, Massachusetts: MIT Press.
Google Scholar
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques, third edition edn. The Morgan Kaufmann series in data management systems. Burlington: Morgan Kaufmann.
Google Scholar
Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics, 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832.
Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832. The ASIS&ISSI ”metrics” pre-conference seminar and the Global Alliance.

Download references

Acknowledgements

Part of this work has been funded by the COST Action TD1210 Knowescape. We would like to thank Jochen Gläser and Andrea Scharnhorst for extended comments on earlier versions of the text. We would also like to thank the internal reviewer Michael Heinz as well as the anonymous external referees for their valuable comments and suggestions.

Author information

Authors and Affiliations

OCLC Research, Schipholweg 99, Leiden, The Netherlands
Shenghui Wang & Rob Koopman

Authors

Shenghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rob Koopman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenghui Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Koopman, R. Clustering articles based on semantic similarity. Scientometrics 111, 1017–1031 (2017). https://doi.org/10.1007/s11192-017-2298-x

Download citation

Received: 06 June 2016
Published: 27 February 2017
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11192-017-2298-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering articles based on semantic similarity

Abstract

Access this article

Similar content being viewed by others

Various Document Clustering Tasks Using Word Lists

Combining semantic and term frequency similarities for text clustering

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering articles based on semantic similarity

Abstract

Access this article

Similar content being viewed by others

Various Document Clustering Tasks Using Word Lists

Combining semantic and term frequency similarities for text clustering

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation