Combining Thesaurus Knowledge and Probabilistic Topic Models

Loukachevitch, Natalia; Nokel, Michael; Ivanov, Kirill

doi:10.1007/978-3-319-73013-4_6

Natalia Loukachevitch²⁵,
Michael Nokel²⁶ &
Kirill Ivanov²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10716))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

2233 Accesses
3 Citations

Abstract

In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Blei, D.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar
Smith, A., Lee, T.Y., Poursabzi-Sangdeh, F., Boyd-Graber, J., Elmqvist, N., Findlater, L.: Evaluating visual representations for topic understanding and their effects on manually generated labels. Trans. Assoc. Comput. Linguist. 5, 1–15 (2017)
Google Scholar
Chang, J., Boyd-Graber, J., Wang, Ch., Gerrich S., Blei, D.: Reading tea leaves: how humans interpret topic models. In: Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 288–296 (2009)
Google Scholar
Boyd-Graber, J., Mimno, D., Newman, D.: Care and feeding of topic models: problems, diagnostics, and improvements. In: Handbooks of Modern Statistical Methods. CRC Press, Boca Raton (2014)
Google Scholar
Blei, D., Lafferty, J.: Visualizing topics with multi-word expressions (2009). https://arxiv.org/pdf/0907.1013.pdf
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32 (2011)
Google Scholar
Newman, D., Bonilla, E., Buntine, W.: Improving topic coherence with regularized topic models. In: Advances in Neural Information Processing Systems, pp. 496–504 (2011)
Google Scholar
Xie, P., Yang D., Xing, E.: Incorporating word correlation knowledge into topic modeling. In: Proceedings of NAACL-2015, pp. 725–734 (2015)
Google Scholar
Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Discovering coherent topics using general knowledge. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 209–218. ACM (2013)
Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of EMNLP 2011, pp. 262–272 (2011)
Google Scholar
Gao, Y., Wen, D.: Semantic similarity-enhanced topic models for document analysis. In: Chang, M., Li, Y. (eds.) Smart Learning Environments. LNET, pp. 45–56. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-44447-4_3
Google Scholar
Wallach, H.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984 (2006)
Google Scholar
Griffiths, T., Steyvers, M., Tenenbaum, J.: Topics in semantic representation. Psychol. Rev. 114(2), 211–244 (2007)
Article Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pp. 697–702 (2007)
Google Scholar
Lau, J., Baldwin, T., Newman, D.: On collocations and topic models. ACM Trans. Speech Lang. Process. 10(3), 1–14 (2013)
Article Google Scholar
Nokel, M., Loukachevitch, N.: A method of accounting bigrams in topic models. In: Proceedings of the 11th Workshop on Multiword Expressions (2015)
Google Scholar
Nokel, M., Loukachevitch, N.: Accounting ngrams and multi-word terms can improve topic models. In: Proceedings of the 11th Workshop on Multiword Expressions (2016)
Google Scholar
Loukachevitch, N., Dobrov B.: RuThes linguistic ontology vs. Russian wordnets. In: Proceedings of Global WordNet Conference (GWC-2014) (2014)
Google Scholar
Lau, J., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the European Chapter of the Association for Computational Linguistics (2014)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference, Potsdam, Germany, pp. 31–40 (2009)
Google Scholar
Frantzi, K., Ananiadou, S.: The c-value/nc-value domain-independent method for multi-word term extraction. J. Natural Lang. Process. 6(3), 145–179 (1999)
Article Google Scholar

Download references

Acknowledgments

This study is supported by Russian Scientific Foundation in part concerning the combined approach uniting thesaurus information and probabilistic topic models (project N16-18-02074). The study on application of the approach to content analysis of Islam sites is supported by Russian Foundation for Basic Research (project N 16-29-09606).

Author information

Authors and Affiliations

Lomosov Moscow State University, Moscow, Russia
Natalia Loukachevitch & Kirill Ivanov
Yandex, Moscow, Russia
Michael Nokel

Authors

Natalia Loukachevitch
View author publications
You can also search for this author in PubMed Google Scholar
Michael Nokel
View author publications
You can also search for this author in PubMed Google Scholar
Kirill Ivanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalia Loukachevitch .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Wil M.P. van der Aalst
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovsky Institute of Mathematics and Mechanics, Ekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
Skolkovo Institute of Science and Technology, Moscow, Russia
Victor Lempitsky
National Research University Higher School of Economics, Moscow, Russia
Irina A. Lomazova
Moscow State University, Moscow, Russia
Natalia Loukachevitch
LORIA, Campus Scientifique, Vandœuvre lès Nancy, France
Amedeo Napoli
University of Hamburg, Hamburg, Germany
Alexander Panchenko
University of Florida, Gainesville, Florida, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Indiana University, Bloomington, Indiana, USA
Stanley Wasserman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Loukachevitch, N., Nokel, M., Ivanov, K. (2018). Combining Thesaurus Knowledge and Probabilistic Topic Models. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science(), vol 10716. Springer, Cham. https://doi.org/10.1007/978-3-319-73013-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-73013-4_6
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73012-7
Online ISBN: 978-3-319-73013-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics