Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Kang, Dongyeop; Park, Youngja; Chari, Suresh N.

doi:10.1007/978-3-662-44848-9_41

Dongyeop Kang²³,
Youngja Park²⁴ &
Suresh N. Chari²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8724))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4658 Accesses
5 Citations

Abstract

We propose Hetero-Labeled LDA (hLLDA), a novel semi-supervised topic model, which can learn from multiple types of labels such as document labels and feature labels (i.e., heterogeneous labels), and also accommodate labels for only a subset of classes (i.e., partial labels). This addresses two major limitations in existing semi-supervised learning methods: they can incorporate only one type of domain knowledge (e.g. document labels or feature labels), and they assume that provided labels cover all the classes in the problem space. This limits their applicability in real-life situations where domain knowledge for labeling comes in different forms from different groups of domain experts and some classes may not have labels. hLLDA resolves both the label heterogeneity and label partialness problems in a unified generative process.

hLLDA can leverage different forms of supervision and discover semantically coherent topics by exploiting domain knowledge mutually reinforced by different types of labels. Experiments with three document collections–Reuters, 20 Newsgroup and Delicious– validate that our model generates a better set of topics and efficiently discover additional latent topics not covered by the labels resulting in better classification and clustering accuracy than existing supervised or semi-supervised topic models. The empirical results demonstrate that learning from multiple forms of domain knowledge in a unified process creates an enhanced combined effect that is greater than a sum of multiple models learned separately with one type of supervision.

Download to read the full chapter text

Chapter PDF

Supervised labeled latent Dirichlet allocation for document categorization

Article 25 November 2014

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Supervised topic models with weighted words: multi-label document classification

Article 01 April 2018

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

20 Newsgroup, http://qwone.com/~jason/20Newsgroups/
Andrzejewski, D., Zhu, X.: Latent dirichlet allocation with topic-in-set knowledge. In: NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (2009)
Google Scholar
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: ICML (2009)
Google Scholar
Andrzejewski, D., Zhu, X., Craven, M., Recht, B.: A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic. In: IJCAI (2011)
Google Scholar
Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR (2003)
Google Scholar
Chang, J., Blei, D.M.: Relational topic models for document networks. Journal of Machine Learning Research - Proceedings Track 5 (2009)
Google Scholar
Delicious, http://arvindn.livejournal.com/116137.html
Griffiths, T.L., Steyvers, M.: Proceedings of the National Academy of Sciences (2004)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS (2004)
Google Scholar
Ho, Q., Eisenstein, J., Xing, E.P.: Document hierarchies from text and links. In: WWW (2012)
Google Scholar
Jagarlamudi, J., Daumé, H., Udupa, R.: Incorporating lexical priors into topic models. In: EACL, pp. 204–213 (2012)
Google Scholar
Kang, D., Jiang, D., Pei, J., Liao, Z., Sun, X., Choi, H.J.: Multidimensional mining of large-scale search logs: a topic-concept cube approach. In: WSDM (2011)
Google Scholar
Kim, D., Kim, S., Oh, A.: Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: ICML (2012)
Google Scholar
Lacoste-Julien, S., Sha, F., Jordan, M.I.: Disclda: Discriminative learning for dimensionality reduction and classification. In: NIPS (2008)
Google Scholar
Lafferty, J.D., Blei, D.M.: Correlated topic models. In: NIPS (2005)
Google Scholar
Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. (2007)
Google Scholar
Mimno, D.M., McCallum, A.: Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In: UAI, pp. 411–418 (2008)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program: electronic library and information systems (1980)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP (2009)
Google Scholar
Ramage, D., Manning, C.D., Dumais, S.T.: Partially labeled topic models for interpretable text mining. In: KDD (2011)
Google Scholar
Reuters-21578, http://kdd.ics.uci.edu/databases/reuters21578/
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association (2006)
Google Scholar
Zhu, J., Ahmed, A., Xing, E.P.: Medlda: maximum margin supervised topic models for regression and classification. In: ICML (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

IT Convergence Laboratory, KAIST Institute, Daejeon, South Korea
Dongyeop Kang
IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Youngja Park & Suresh N. Chari

Authors

Dongyeop Kang
View author publications
You can also search for this author in PubMed Google Scholar
Youngja Park
View author publications
You can also search for this author in PubMed Google Scholar
Suresh N. Chari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences, Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, D., Park, Y., Chari, S.N. (2014). Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44848-9_41

Download citation

DOI: https://doi.org/10.1007/978-3-662-44848-9_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44847-2
Online ISBN: 978-3-662-44848-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Abstract

Chapter PDF

Similar content being viewed by others

Supervised labeled latent Dirichlet allocation for document categorization

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Supervised topic models with weighted words: multi-label document classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Abstract

Chapter PDF

Similar content being viewed by others

Supervised labeled latent Dirichlet allocation for document categorization

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Supervised topic models with weighted words: multi-label document classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation