Skip to main content

A Graph-Based Approach to Topic Clustering for Online Comments to News

  • Conference paper
Advances in Information Retrieval (ECIR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Included in the following conference series:

Abstract

This paper investigates graph-based approaches to labeled topic clustering of reader comments in online news. For graph-based clustering we propose a linear regression model of similarity between the graph nodes (comments) based on similarity features and weights trained using automatically derived training data. To label the clusters our graph-based approach makes use of DBPedia to abstract topics extracted from the clusters. We evaluate the clustering approach against gold standard data created by human annotators and compare its results against LDA – currently reported as the best method for the news comment clustering task. Evaluation of cluster labelling is set up as a retrieval task, where human annotators are asked to identify the best cluster given a cluster label. Our clustering approach significantly outperforms the LDA baseline and our evaluation of abstract cluster labels shows that graph-based approaches are a promising method of creating labeled clusters of news comments, although we still find cases where the automatically generated abstractive labels are insufficient to allow humans to correctly associate a label with its cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://goo.gl/3f8Hqu.

  2. 2.

    http://www.theguardian.com/commentisfree/2014/aug/10/readers-editor-online-abuse-women-issues.

  3. 3.

    http://wiki.dbpedia.org/.

  4. 4.

    Soft clustering methods allow one data item to be assigned to multiple clusters.

  5. 5.

    I.e. one comment can be assigned to only one cluster.

  6. 6.

    http://tagme.di.unipi.it/.

  7. 7.

    MCL runs a predefined number of iterations. We ran MCL with 5000 iterations.

  8. 8.

    https://opennlp.apache.org/.

  9. 9.

    We used Weka (http://www.cs.waikato.ac.nz/ml/weka/) implementation of linear regression.

  10. 10.

    The number of topics (k) to assign was determined empirically, i.e. we varied 2\(<\) \(k\) \(<\)10, and chose k=5 based on the clarity of the labels generated.

  11. 11.

    We take the most-common sense. The 10 word limit is to reduce noise. Less than 10 DBPedia concepts may be identified, as not all topic words have an identically-titled DBPedia concept.

  12. 12.

    To limit noise, we reduce the relation set c.f. Hulpus et al. to include only skos:broader, skos:broaderOf, rdfs:subClassOf, rdfs. Graph expansion is limited to two hops.

  13. 13.

    Several graph-centrality metrics were explored: betweeness_centrality, load_centrality, degree_centrality, closeness_centrality, of which the last was used for the results reported here.

  14. 14.

    Hulpus et al. [8] merge together the graphs of multiple topics, so as to derive a single label to encompass them. We have found it preferable to provide a separate label for each topic, i.e. so the overall label for a cluster comprises 5 label terms for the individual topics.

  15. 15.

    We use the LDA implementation from http://jgibblda.sourceforge.net/.

  16. 16.

    The difference in these results is significant at the Bonferroni corrected level of significance of \(p<0.0125\), adjusted for 4-way comparison between the human-to-human and all automatic conditions.

  17. 17.

    We apply both models on comments regardless whether they contain quotes or not. However, in case of graph-Human-quotesRemoved before it is applied on the testing data we make sure that the comments containing quotes are also quotes free.

References

  1. Aker, A., Kurtic, E., Hepple, M., Gaizauskas, R., Di Fabbrizio, G.: Comment-to-article linking in the online news domain. In: Proceedings of MultiLing, SigDial 2015 (2015)

    Google Scholar 

  2. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. (CSUR) 41(3), 17 (2009)

    Article  Google Scholar 

  6. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. bull. 76(5), 378 (1971)

    Article  Google Scholar 

  7. Hüllermeier, E., Rifqi, M., Henzgen, S., Senge, R.: Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans. Fuzzy Syst. 20(3), 546–556 (2012)

    Article  Google Scholar 

  8. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 465–474, NY, USA (2013). http://doi.acm.org/10.1145/2433396.2433454

  9. Jurgens, D., Klapaftis, I.: Semeval-2013 task 13: Word sense induction for graded and non-graded senses. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), vol. 2, pp. 290–299 (2013)

    Google Scholar 

  10. Khabiri, E., Caverlee, J., Hsu, C.F.: Summarizing user-contributed comments. In: ICWSM (2011)

    Google Scholar 

  11. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)

    Google Scholar 

  12. Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 605–613. Association for Computational Linguistics (2010)

    Google Scholar 

  13. Liu, C., Tseng, C., Chen, M.: Incrests: Towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27, 2986–3000 (2015)

    Article  Google Scholar 

  14. Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In: Eighth International AAAI Conference on Weblogs and Social Media (2014)

    Google Scholar 

  15. Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 265–274. ACM (2012)

    Google Scholar 

  16. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadina, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), pp. 20–21 (2012)

    Google Scholar 

  17. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)

    Google Scholar 

  18. Salton, G., Lesk, E.M.: Computer evaluation of indexing and text processing. J. ACM 15, 8–36 (1968)

    Article  MATH  Google Scholar 

  19. Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM (2012)

    Google Scholar 

  20. Van Dongen, S.M.: Graph clustering by flow simulation (2001)

    Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the EU - Seventh Framework Program (FP7/2007–2013) under grant agreement n610916 SENSEI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmet Aker .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Aker, A. et al. (2016). A Graph-Based Approach to Topic Clustering for Online Comments to News. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30671-1_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30670-4

  • Online ISBN: 978-3-319-30671-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics