Skip to main content
Log in

Clustering articles based on semantic similarity

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf, 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Since the sum of squares is the squared Euclidean distance, this is intuitively the “nearest” mean.

  2. https://en.wikipedia.org/wiki/F1_score.

  3. http://scikit-learn.org/.

  4. https://networkx.github.io/.

  5. http://perso.crans.org/aynaud/communities/.

  6. The CWTS-C5 and UMSI0 are the clustering solutions generated by two different methods, Infomap and the Smart Local Moving Algorithm (SLMA) respectively, applied on the direct citation network of articles. The two ECOOM clustering solutions are generated by applying the Louvain method to find communities among bibliographic coupled articles where ECOOM-NLP11 also incorporates the keywords information. The STS-RG clusters are generated by first projecting the small Astro dataset to the full Scopus database and collecting their cluster assignments after the full Scopus articles are clustered using SLMA on the direct citation network. More detailed account can be found in Velden et al. (2017).

References

  • Achlioptas, D. (2003). Database-friendly random projections: Johnson–Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi:10.1016/S0022-0000(03)00025-4.

    Article  MathSciNet  MATH  Google Scholar 

  • Béjar, J. (2013). K-means vs mini batch k-means: A comparison. Tech. rep., Universitat Politècnica de Catalunya. http://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf.

  • Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. (12pp).

    Article  Google Scholar 

  • Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.

    Article  Google Scholar 

  • Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. doi:10.1002/asi.22896.

    Article  Google Scholar 

  • Bruckner, E., Ebeling, W., & Scharnhorst, A. (1990). The application of evolution models in scientometrics. Scientometrics, 18(1–2), 21–41. doi:10.1007/BF02019160.

    Article  Google Scholar 

  • Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis pp. 1–32.

  • Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal, 62(6), 17531806. doi:10.1002/j.1538-7305.1983.tb03513.x.

    Article  Google Scholar 

  • Garfield, E. (1983). Citation indexing—Its theory and application in science, technology and humanities. Philadelphia: ISI Press.

    Google Scholar 

  • Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level. Scientometrics, 37, 195–221.

    Article  Google Scholar 

  • Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the representation of clusters and topics. the astronomy dataset. Scientometrics. doi:10.1007/s11192-017-2301-6.

  • Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data: different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics. doi:10.1007/s11192-017-2296-z.

  • Harris, Z. (1954). Distributional structure. Word, 10(23), 146162.

    Google Scholar 

  • Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.

    Article  MathSciNet  MATH  Google Scholar 

  • Koopman, R., Wang, S., & Scharnhorst, A. (2015) .Contextualization of topics—browsing through terms, authors, journals and cluster allocations. In: Salah, A.A., Tonta, Y., Salah, A.A.A., Sugimoto, C.R., Al, U., (Eds.), Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf.

  • Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics—browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.

  • Koopman, R., Wang, S., Scharnhorst, A., & Englebienne, G. (2015). Ariadne’s thread: Interactive navigation in a world of networked information. In: Begole, B., Kim, J., Inkpen, K., Woo, W., (Eds.), Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18–23, 2015, pp. 1833–1838. ACM doi:10.1145/2702613.2732781.

  • Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research Policy, 18(4), 209–223. doi:10.1016/0048-7333(89)90016-4.

    Article  Google Scholar 

  • Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about’monarch butterflies”,frankenfoods’,and’stem cells’. Scientometrics, 67(2), 231–258.

    Article  Google Scholar 

  • MacKay, D. (2003). Information Theory, Inference and Learning Algorithms, chap. Chapter 20. An Example Inference Task: Clustering, p. 284292. Cambridge University Press.

  • Newman, M. E. (2006). Modularity and community structure in networks. Proc Natl Acad Sci USA, 103(23), 8577–8582. doi:10.1073/pnas.0601602103. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=pubmed&list_uids=16723398&dopt=AbstractPlus.

  • Rip, A., & Courtial, J. P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.

    Article  Google Scholar 

  • Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. doi:10.1016/0377-0427(87)90125-7.

    Article  MATH  Google Scholar 

  • Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguistica, 20(1), 3353.

    Google Scholar 

  • Sculley, D. (2016). Web scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, p. 11771178. Raleigh, NC, USA.

  • Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.

    Article  Google Scholar 

  • Sugimoto, C. R., & Weingart, S. (2015). The kaleidoscope of disciplinarity. Journal of Documentation, 71(4), 775–794. doi:10.1108/JD-06-2014-0082. http://www.scopus.com/inward/record.url?eid=2-s2.0-84933503812&partnerID=tZOtx3y1.

  • Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., & Wang, S. (2017). Comparison of topic extraction approaches and their results. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.

  • Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 28372854.

    MathSciNet  MATH  Google Scholar 

  • Weaver, W. (1955). Translation. In W. Locke & D. Booth (Eds.), Machine translation of languages (pp. 15–23). Cambridge, Massachusetts: MIT Press.

    Google Scholar 

  • Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques, third edition edn. The Morgan Kaufmann series in data management systems. Burlington: Morgan Kaufmann.

    Google Scholar 

  • Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics, 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832.

  • Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832. The ASIS&ISSI ”metrics” pre-conference seminar and the Global Alliance.

Download references

Acknowledgements

Part of this work has been funded by the COST Action TD1210 Knowescape. We would like to thank Jochen Gläser and Andrea Scharnhorst for extended comments on earlier versions of the text. We would also like to thank the internal reviewer Michael Heinz as well as the anonymous external referees for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shenghui Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Koopman, R. Clustering articles based on semantic similarity. Scientometrics 111, 1017–1031 (2017). https://doi.org/10.1007/s11192-017-2298-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-017-2298-x

Keywords

Navigation