A Graph-Based Approach to Topic Clustering for Online Comments to News

Aker, Ahmet; Kurtic, Emina; Balamurali, A. R.; Paramita, Monica; Barker, Emma; Hepple, Mark; Gaizauskas, Rob

doi:10.1007/978-3-319-30671-1_2

Ahmet Aker²¹,
Emina Kurtic²¹,
A. R. Balamurali²²,
Monica Paramita²¹,
Emma Barker²¹,
Mark Hepple²¹ &
…
Rob Gaizauskas²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Included in the following conference series:

European Conference on Information Retrieval

4681 Accesses
13 Citations

Abstract

This paper investigates graph-based approaches to labeled topic clustering of reader comments in online news. For graph-based clustering we propose a linear regression model of similarity between the graph nodes (comments) based on similarity features and weights trained using automatically derived training data. To label the clusters our graph-based approach makes use of DBPedia to abstract topics extracted from the clusters. We evaluate the clustering approach against gold standard data created by human annotators and compare its results against LDA – currently reported as the best method for the news comment clustering task. Evaluation of cluster labelling is set up as a retrieval task, where human annotators are asked to identify the best cluster given a cluster label. Our clustering approach significantly outperforms the LDA baseline and our evaluation of abstract cluster labels shows that graph-based approaches are a promising method of creating labeled clusters of news comments, although we still find cases where the automatically generated abstractive labels are insufficient to allow humans to correctly associate a label with its cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://goo.gl/3f8Hqu.
2.
http://www.theguardian.com/commentisfree/2014/aug/10/readers-editor-online-abuse-women-issues.
3.
http://wiki.dbpedia.org/.
4.
Soft clustering methods allow one data item to be assigned to multiple clusters.
5.
I.e. one comment can be assigned to only one cluster.
6.
http://tagme.di.unipi.it/.
7.
MCL runs a predefined number of iterations. We ran MCL with 5000 iterations.
8.
https://opennlp.apache.org/.
9.
We used Weka (http://www.cs.waikato.ac.nz/ml/weka/) implementation of linear regression.
10.
The number of topics (k) to assign was determined empirically, i.e. we varied 2\(<\) \(k\) \(<\)10, and chose k=5 based on the clarity of the labels generated.
11.
We take the most-common sense. The 10 word limit is to reduce noise. Less than 10 DBPedia concepts may be identified, as not all topic words have an identically-titled DBPedia concept.
12.
To limit noise, we reduce the relation set c.f. Hulpus et al. to include only skos:broader, skos:broaderOf, rdfs:subClassOf, rdfs. Graph expansion is limited to two hops.
13.
Several graph-centrality metrics were explored: betweeness_centrality, load_centrality, degree_centrality, closeness_centrality, of which the last was used for the results reported here.
14.
Hulpus et al. [8] merge together the graphs of multiple topics, so as to derive a single label to encompass them. We have found it preferable to provide a separate label for each topic, i.e. so the overall label for a cluster comprises 5 label terms for the individual topics.
15.
We use the LDA implementation from http://jgibblda.sourceforge.net/.
16.
The difference in these results is significant at the Bonferroni corrected level of significance of \(p<0.0125\), adjusted for 4-way comparison between the human-to-human and all automatic conditions.
17.
We apply both models on comments regardless whether they contain quotes or not. However, in case of graph-Human-quotesRemoved before it is applied on the testing data we make sure that the comments containing quotes are also quotes free.

References

Aker, A., Kurtic, E., Hepple, M., Gaizauskas, R., Di Fabbrizio, G.: Comment-to-article linking in the online news domain. In: Proceedings of MultiLing, SigDial 2015 (2015)
Google Scholar
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Chapter Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. (CSUR) 41(3), 17 (2009)
Article Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. bull. 76(5), 378 (1971)
Article Google Scholar
Hüllermeier, E., Rifqi, M., Henzgen, S., Senge, R.: Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans. Fuzzy Syst. 20(3), 546–556 (2012)
Article Google Scholar
Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 465–474, NY, USA (2013). http://doi.acm.org/10.1145/2433396.2433454
Jurgens, D., Klapaftis, I.: Semeval-2013 task 13: Word sense induction for graded and non-graded senses. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), vol. 2, pp. 290–299 (2013)
Google Scholar
Khabiri, E., Caverlee, J., Hsu, C.F.: Summarizing user-contributed comments. In: ICWSM (2011)
Google Scholar
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)
Google Scholar
Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 605–613. Association for Computational Linguistics (2010)
Google Scholar
Liu, C., Tseng, C., Chen, M.: Incrests: Towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27, 2986–3000 (2015)
Article Google Scholar
Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In: Eighth International AAAI Conference on Weblogs and Social Media (2014)
Google Scholar
Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 265–274. ACM (2012)
Google Scholar
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadina, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), pp. 20–21 (2012)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)
Google Scholar
Salton, G., Lesk, E.M.: Computer evaluation of indexing and text processing. J. ACM 15, 8–36 (1968)
Article MATH Google Scholar
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM (2012)
Google Scholar
Van Dongen, S.M.: Graph clustering by flow simulation (2001)
Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the EU - Seventh Framework Program (FP7/2007–2013) under grant agreement n610916 SENSEI.

Author information

Authors and Affiliations

University of Sheffield, Sheffield, UK
Ahmet Aker, Emina Kurtic, Monica Paramita, Emma Barker, Mark Hepple & Rob Gaizauskas
LIF-CNRS Marseille, Marseille, France
A. R. Balamurali

Authors

Ahmet Aker
View author publications
You can also search for this author in PubMed Google Scholar
Emina Kurtic
View author publications
You can also search for this author in PubMed Google Scholar
A. R. Balamurali
View author publications
You can also search for this author in PubMed Google Scholar
Monica Paramita
View author publications
You can also search for this author in PubMed Google Scholar
Emma Barker
View author publications
You can also search for this author in PubMed Google Scholar
Mark Hepple
View author publications
You can also search for this author in PubMed Google Scholar
Rob Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmet Aker .

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Padova, Italy
Nicola Ferro
Faculty of Informatics, University of Lugano (USI), Lugano, Switzerland
Fabio Crestani
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Marie-Francine Moens
Systèmes d’informations, Big Data et Recherche d’Information, Institut de Recherche en Informatique de Toulouse IRIT/équipe SIG, Toulouse Cedex 04, France
Josiane Mothe
Yahoo! Labs London, London, UK
Fabrizio Silvestri
Department of Information Engineering, University of Padua, Padova, Italy
Giorgio Maria Di Nunzio
TU Delft - EWI/ST/WIS, Delft, The Netherlands
Claudia Hauff
Department of Information Engineering, University of Padua, Padova, Italy
Gianmaria Silvello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aker, A. et al. (2016). A Graph-Based Approach to Topic Clustering for Online Comments to News. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-30671-1_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30670-4
Online ISBN: 978-3-319-30671-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics