Multi-label dataless text classification with topic modeling

Zha, Daochen; Li, Chenliang

doi:10.1007/s10115-018-1280-0

Multi-label dataless text classification with topic modeling

Regular Paper
Published: 08 December 2018

Volume 61, pages 137–160, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

1498 Accesses
23 Citations
Explore all metrics

Abstract

Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seed-guided Multi-label Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multi-label classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single category-topic which covers the meaning of the category. To accommodate with multi-label documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seed-guided biased GPU (i.e., generalized Pólya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than state-of-the-art alternatives and even outperforms supervised solutions in some scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Article 06 January 2023

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Robust supervised topic models under label noise

Article 14 April 2021

Notes

Category and category-topic are considered equivalent and exchangeable in this work when the context has no ambiguity.
https://github.com/WHUIR/SMTM.
http://disi.unitn.it/moschitti/corpora.htm.
http://nlp.uned.es/social-tagging/delicioust140/.
https://nlp.stanford.edu/software/tmt/tmt-0.4/.
NLTK is used to split the documents into sentences.
https://github.com/hsoleimani/MLTM.
https://code.google.com/archive/p/word2vec/.

References

Belanger D, McCallum A (2016) Structured prediction energy networks. In: Proceedings of the 36th annual international conference on machine learning, pp 983–992
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. In: Proceedings of the 23rd AAAI conference on artificial intelligence, pp 830–835
Chemudugunta C, Smyth P, Steyvers M (2007) Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS, pp 241–248
Chen G, Ye D, Xing Z, Chen J, Cambria E (2017) Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: Proceedings of the 2017 international joint conference on neural networks, pp 2377–2383
Chen X, Xia Y, Jin P, Carroll J (2015) Dataless text classification with descriptive lda. In: Proceedings of the 29th AAAI conference on artificial intelligence, pp 2224–2231
Chen Z, Liu B (2014) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1116–1125
Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013) Leveraging multi-domain prior knowledge in topic models. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 2071–2077
Cissé M, Al-Shedivat M, Bengio S (2016) Adios: architectures deep in output space. In: Proceedings of the 36th annual international conference on machine learning, pp 2770–2779
Druck G, Mann G, McCallum A (2008) Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 595–602
Fan RE, Lin CJ (2007) A study on threshold selection for multi-label classification. Department of Computer Science, National Taiwan University, pp 1–23
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 1606–1611
Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: Proceedings of the 14th ACM international conference on information and knowledge management, ACM, pp 195–200
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235 (suppl 1)
Article Google Scholar
Heinrich G (2004) Parameter estimation for text analysis. Technical report
Ishwaran H, Rao JS (2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Ann Stat 33:730–773
Article MathSciNet MATH Google Scholar
Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn ECML–98:137–142
Google Scholar
Ko Y, Seo J (2004) Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p 255
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: Proceedings of the 35th annual international conference on machine learning, pp 957–966
Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 23rd annual conference on neural information processing systems, pp 897–904
Li C, Wang B, Pavlu V, Aslam J (2016a) Conditional bernoulli mixtures for multi-label classification. In: International conference on machine learning, pp 2482–2491
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016b) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR conference on research and development in information retrieval, pp 165–174
Li C, Xing J, Sun A, Ma Z (2016c) Effective document labeling with very few seed words: a topic model approach. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 85–94
Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst 36(2):11:1–11:30
Article Google Scholar
Li C, Zhou W, Ji F, Duan Y, Chen H (2018a) A deep relevance model for zero-shot document filtering. In: Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, vol 1, Long Papers, pp 2300–2310
Li X, Guo Y (2013) Active learning with multi-label svm classification. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 1479–1485
Li X, Yang B (2018) A pseudo label based dataless Naive Bayes algorithm for text classification with seed words. In: Proceedings of the 27th international conference on computational linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp 1908–1917
Li X, Li C, Chi J, Jihong O, Li C (2018b) Dataless text classification: A topic modeling approach with document manifold. In: Proceedings of the 27th ACM international on conference on information and knowledge management
Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web, pp 539–550
Liu B, Li X, Lee WS, Yu PS (2004) Text classification by labeling words. In: Proceedings of the 19th AAAI conference on artificial intelligence, pp 425–430
Liu J, Chang WC, Wu Y, Yang Y (2017) Deep learning for extreme multi-label text classification. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 115–124
Mahmoud H (2008) Pólya urn models. CRC Press, Boca Raton
Book MATH Google Scholar
Mcauliffe JD, Blei DM (2008) Supervised topic models. In: Proceedings of the 22nd annual conference on neural information processing systems, pp 121–128
Mei Q, Ling X, Wondra M, Su H, Zhai C (2007) Topic sentiment mixture: modeling facets and opinions in weblogs. In: WWW, pp 171–180
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems, pp 3111–3119
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 262–272
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 248–256
Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359
Article MathSciNet Google Scholar
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1):157–208
Article MathSciNet MATH Google Scholar
Soleimani H, Miller DJ (2016) Semi-supervised multi-label topic models for document classification and sentence labeling. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 105–114
Song Y, Roth D (2014) On dataless hierarchical text classification. In: Proceedings of the 28th AAAI conference on artificial intelligence, pp 2224–2231
Song Y, Upadhyay S, Peng H, Roth D (2016) Cross-lingual dataless classification for many languages. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 2901–2907
Sun YY, Zhang Y, Zhou ZH (2010) Multi-label learning with weak label. In: Proceedings of the 24th AAAI conference on artificial intelligence, pp 593–598
Tao X, Li Y, Lau RY, Wang H (2012) Unsupervised multi-label text classification using a world knowledge ontology. In: Proceedings of the 2012 Pacific-Asia conference on knowledge discovery and data mining, pp 480–492
Tsoumakas G, Katakis I (2006) Multi-label classification: an overview. Int J Data Warehous Min 3(3):1–13
Article Google Scholar
Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. In: Data mining and knowledge discovery handbook. Springer, pp 667–685
Wang B, Li C, Pavlu V, Aslam J (2017) Regularizing model complexity and label structure for multi-label text classification. arXiv preprint arXiv:1705.00740
Wang C, Blei DM (2009) Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In: Proceedings of the 23rd annual conference on neural information processing systems, pp 1982–1989
Wang S, Chen Z, Fei G, Liu B, Emery S (2016) Targeted topic modeling for focused analysis. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244
Yang B, Sun JT, Wang T, Chen Z (2009) Effective multi-label active learning for text classification. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 917–926
Zhu J, Ahmed A, Xing EP (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264
Zubiaga A, García-Plaza AP, Fresno V, Martínez R (2009) Content-based clustering for tag cloud visualization. In: Proceedings of the 2009 international conference on advances in network analysis and mining, pp 316–319

Download references

Acknowledgements

This research was supported by National Natural Science Foundation of China (Nos. 61872278, 61502344), Natural Science Foundation of Hubei Province (No. 2017CFB502), Natural Scientific Research Program of Wuhan University (No. 2042017kf0225). Chenliang Li is the corresponding author.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430072, China
Daochen Zha
School of Cyber Science and Engineering, Wuhan University, Wuhan, 430072, China
Chenliang Li

Authors

Daochen Zha
View author publications
You can also search for this author in PubMed Google Scholar
Chenliang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A. Seed words for evaluation

We manually label some seed words for Delicious and Ohsumed based on standard LDA model. The seed words for Delicious are listed as follows:

Category	Seed words
Politics	Politics, government, political, democracy, senate
Design	Design, css, gallery, designers, designer, graphic
Programming	Programming, php, javascript, python, ruby
java	java, eclipse, tomcat, applet
Reference	Reference
internet	internet, traffic
Computer	Computer, mac, drive, desktop, screen, hardware
Education	Education, students, learning, school, teachers
web	web, html, ajax
Language	Language, languages, French
Science	Science, scientific, brain, scientists, researchers
Writing	Writing, fiction, tales
Culture	Culture, art, music
History	History, collections, historical, ancient
Philosophy	Philosophy, ethics
Books	Books, book, chapter, reading, authors, readers
English	English
Religion	Religion, Christian, church, religious, fathers, testament, Jesus
Grammar	Grammar, idioms, verbs, verb, sentence, clause, punctuation
Style	Style

And the seed words for Ohsumed are listed as follows:

Category	Seed words
Bacterial Infections and Mycoses	Bacterial, infections, mycoses, sepsis
Virus Diseases	Virus, viral, measles, herpes, influenza
Parasitic Diseases	Parasite, parasites, malaria, falciparum, leishmaniasis
Neoplasms	Neoplasms, neoplasm, cancer, carcinoma, tumor
Musculoskeletal Diseases	Musculoskeletal, spine, osteomyelitis
Digestive System Diseases	Digestive, gastric, hepatitis, bowel, biliary
Stomatognathic Diseases	Stomatitis, teeth, parotid, periodontal
Respiratory Tract Diseases	Respiratory, lung, pneumonia, bronchial
Otorhinolaryngologic Diseases	Otolaryngologist, ear, hearing, otitis
Nervous System Diseases	Nervous, nerve, neurologic, dementia, neurological
Eye Diseases	Eye, eyes, cataract
Urologic and Male Genital diseases	Urologic, urological, genital, bladder, prostate, prostatic
Female Genital Diseases and pregnancy Complications	Genital, pregnancy, endometrial, endometriosis
Cardiovascular Diseases	Cardiovascular, ventricular, heart, cardiac, hypertension
Hemic and Lymphatic Diseases	Lymphadenopathy, anemia, sickle, thrombocytopenia
Neonatal Diseases and Abnormalities	Neonatal, neonates, abnormalities, congenital, anomalies
Skin and Connective Tissue Diseases	Skin, connective, tissue, rheumatoid, psoriasis, dermal
Nutritional and Metabolic Diseases	Nutritional, nutrition, metabolic, glucose, insulin, diabetes, diabetic
Endocrine Diseases	Endocrine, thyroid, parathyroid
Immunologic Diseases	Immunologic, immunodeficiency, leukemia
Disorders of Environmental Origin	Disorders, injuries, trauma, fracture
Animal Diseases	Animal animals
pathological Conditions, Signs and Symptoms	Pathological postoperative

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zha, D., Li, C. Multi-label dataless text classification with topic modeling. Knowl Inf Syst 61, 137–160 (2019). https://doi.org/10.1007/s10115-018-1280-0

Download citation

Received: 08 November 2017
Revised: 08 July 2018
Accepted: 24 November 2018
Published: 08 December 2018
Issue Date: 01 October 2019
DOI: https://doi.org/10.1007/s10115-018-1280-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label dataless text classification with topic modeling

Abstract

Access this article

Similar content being viewed by others

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Robust supervised topic models under label noise

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 A. Seed words for evaluation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-label dataless text classification with topic modeling

Abstract

Access this article

Similar content being viewed by others

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Robust supervised topic models under label noise

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 A. Seed words for evaluation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation