Improving Text Categorization Using Domain Knowledge

Zhu, Jingbo; Chen, Wenliang

doi:10.1007/11428817_10

Jingbo Zhu¹⁹ &
Wenliang Chen¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1388 Accesses
1 Citations

Abstract

In this paper, we mainly study and propose an approach to improve document classification using domain knowledge. First we introduce a domain knowledge dictionary NEUKD, and propose two models which use domain knowledge as textual features for text categorization. The first one is BOTW model which uses domain associated terms and conventional words as textual features. The other one is BOF model which uses domain features as textual features. But due to limitation of size of domain knowledge dictionary, we study and use a machine learning technique to solve the problem, and propose a BOL model which could be considered as the extended version of BOF model. In the comparison experiments, we consider naïve Bayes system based on BOW model as baseline system. Comparison experimental results of naïve Bayes systems based on those four models (BOW, BOTW, BOF and BOL) show that domain knowledge is very useful for improving text categorization. BOTW model performs better than BOW model, and BOL and BOF models perform better than BOW model in small number of features cases. Through learning new features using machine learning technique, BOL model performs better than BOF model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, Las Vegas, Las Vegas (1995)
Google Scholar
Lewis, D., Schapire, R., Callan, J., Papka, R.: Training Algorithms for Linear Text Classifiers. In: Proceedings of ACM SIGIR, pp. 298–306 (1996)
Google Scholar
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Machine Learning: ECML 1998, Tenth European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Lewis, D.: A Comparison of Two Learning Algorithms for Text Categorization. In: Symposium on Document Analysis and IR (1994)
Google Scholar
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)
Google Scholar
McCallum, Nigam, K.: A Comparison of Event Models for naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Scott, Matwin, S.S.: Text classification using WordNet hypernyms. In: Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal (1998)
Google Scholar
Lee, S., Shishibori, M.: Passage Segmentation Based on Topic Matter. Computer Processing of Oriental Languages 15(3), 305–340 (2002)
Article Google Scholar
Jingbo, Z., Tianshun, Y.: FIFA-based Text Classification. Journal of Chinese Information Processing 16(3) (2002) (in Chinese)
Google Scholar
Wenliang, C., Jingbo, Z., Tianshun, Y.: Automatic Learning Field Words by Bootstrapping. In: Proceedings of the 7th national conference on computational linguistics, JSCL 2003 (2003) (in Chinese)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (1999)
Google Scholar
Wenliang, C., Xingzhi, C., Huizhen, W., Jingbo, Z., Tianshun, Y.: Automatic Word Clustering for Text Categorization Using Global Information. In: AIRS 2004, Beijing (2004)
Google Scholar
China Library Categorization Editorial Board. China Library Categorization, 4th ed., Beijing, Beijing Library Press (1999) (in Chinese)
Google Scholar
Tianshun, Y., Jingbo, Z., li, Z., Ying, Y.: Natural Language Processing-research on making computers understand human languages, Tsinghua University Press (2002) (In Chinese)
Google Scholar
Salton, G., McGill, M.J.: An introduction to modern information retrieval. McGraw-Hill, New York (1983)
Google Scholar
Baker, L.D., MCallum, A.K.: Distributional clustering of words for text classification. In: Proc. 21st Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 96–103 (1998)
Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 30th Annual Meeting of the ACL, pp. 183–190 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Lab, Institute of Computer Software and Theory, Northeastern University, Shenyang, 110004, P.R. China
Jingbo Zhu & Wenliang Chen

Authors

Jingbo Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wenliang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
Andrés Montoyo
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Lab. CEDRIC, CNAM, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, J., Chen, W. (2005). Improving Text Categorization Using Domain Knowledge. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_10

Download citation

DOI: https://doi.org/10.1007/11428817_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics