Handling of Imbalanced Data in Text Classification: Category-Based Term Weights

Liu, Ying; Loh, Han Tong; Kamal, Youcef-Toumi; Tor, Shu Beng

doi:10.1007/978-1-84628-754-1_10

Ying Liu²,
Han Tong Loh³,
Youcef-Toumi Kamal⁴ &
…
Shu Beng Tor⁵

4919 Accesses
5 Citations

Abstract

Learning from imbalanced data has emerged as a new challenge to the machine learning (ML), data mining (DM) and text mining (TM) communities. Two recent workshops in 2000 [17] and 2003 [7] at AAAI and ICML conferences respectively and a special issue in ACM SIGKDD explorations [8] are dedicated to this topic. It has been witnessing growing interest and attention among researchers and practitioners seeking solutions in handling imbalanced data. An excellent review of the state-ofthe- art is given by Gary Weiss [43].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates R. & Ribeiro-Neto B. (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA
Google Scholar
Baoli L., Qin L. & Shiwen Y. (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP) 3:215–226
Article Google Scholar
Blum A. & Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co-Training. In: COLT: Proceedings of the Workshop on Computational Learning Theory
Google Scholar
Brank J., Grobelnik M., Milic-Frayling N. & Mladenic D. (2003) Training text classifiers with SVM on very few positive examples. Report MSR-TR-2003-34
Google Scholar
Burges C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2:121–167
Article Google Scholar
Castillo M. D. d. & Serrano J. I. (2004) A multistrategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:70–79
Google Scholar
Chawla N., Japkowicz N. & Kolcz A. (eds) (2003) Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets
Google Scholar
Chawla N., Japkowicz N. & Kolcz A. (eds) (2004) Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6
Google Scholar
Debole F. & Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM Symposium on Applied computing
Google Scholar
Dietterich T., Margineantu D., Provost F. & Turney P. (eds) (2000) Proceedings of the ICML’2000 Workshop on Cost-sensitive Learning
Google Scholar
Dumais S. & Chen H. (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000)
Google Scholar
Elkan C. (2001) The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01)
Google Scholar
Fan W., Yu P. S. & Wang H. (2004) Mining Extremely Skewed Trading Anomalies. In: Advances in Database Technology-EDBT 2004: 9th International Conference on Extending Database Technology
Google Scholar
Forman G. (2003) An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3:1289–1305
MATH Google Scholar
Ghani R. (2002) Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In: International Conference on Machine Learning (ICML 2002)
Google Scholar
Goldman S. & Zhou Y. (2000) Enhancing Supervised Learning with Unlabeled Data. In: Proceedings of 17th International Conference on Machine Learning
Google Scholar
Japkowicz N. (eds) (2000) Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. AAAI Tech Report WS-00–05, AAAI
Google Scholar
Japkowicz N., Myers C. & Gluck M. A. (1995) A Novelty Detection Approach to Classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95)
Google Scholar
Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: ECML-98, Tenth European Conference on Machine Learning
Google Scholar
Joachims T. (2001) A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Google Scholar
Joachims T. (2002) Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers
Google Scholar
Leopold E. & Kindermann J. (2002) Text Categorization with Support Vector Machines-How to Represent Texts in Input Space. Machine Learning 46:423–444
Article MATH Google Scholar
Lewis D. D. & Gale W. A. (1994) A Sequential Algorithm for Training Text Classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval
Google Scholar
Lewis D. D., Yang Y., Rose T. G. & Li F. (2004) RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5:361–397
Google Scholar
Liu A. Y. C. (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. Masters thesis. University of Texas at Austin
Google Scholar
Liu B., Dai Y., Li X., Lee W. S. & Yu P. (2003) Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03)
Google Scholar
Liu Y., Loh H. T. & Tor S. B. (2004) Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium 2004
Google Scholar
Liu Y., Loh H. T., Youcef-Toumi K. & Tor S. B. (2005) MCV1: An Engineering Paper Corpus for Manufacturing Knowledge Retrieval. submitted to the Journal of Knowledge and Information System (KAIS)
Google Scholar
Man L. (2004) A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. In: Text Seminar of CHIME Group at the National University of Singapore
Google Scholar
Manevitz L. M. & Yousef M. (2002) One-class svms for document classification. The Journal of Machine Learning Research 2:139–154
Article MATH Google Scholar
Mladenic D. & Grobelnik M. (1999) Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML’99
Google Scholar
Ng H. T., Goh W. B. & Low K. L. (1997) Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Google Scholar
Nickerson A., Japkowicz N. & Milios E. (2001) Using Unsupervised Learning to Guide Re-Sampling in Imbalanced Data Sets. In: Proceedings of the Eighth International Workshop on AI and Statitsics
Google Scholar
Nigam K. P. (2001) Using unlabeled data to improve text classification. PhD thesis. Carnegie Mellon University
Google Scholar
Raskutti B. & Kowalczyk A. (2004) Extreme re-balancing for SVMs: a case study. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:60–69
Google Scholar
Rijsbergen C. J. v. (1979) Information Retrieval. 2nd edn. Butterworths, London, UK
Google Scholar
Ruiz M. E. & Srinivasan P. (2002) Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5:87–118
Article MATH Google Scholar
Salton G. & Buckley C. (1988) Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24:513–523
Article Google Scholar
Salton G. & McGill M. J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, USA
MATH Google Scholar
Sebastiani F. (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34:1–47
Article MathSciNet Google Scholar
Sun A., Lim E. P., Ng W. K. & Srivastava J. (2004) Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16:1305–1308
Article Google Scholar
Vapnik V. N. (1999) The Nature of Statistical Learning Theory. 2nd edn. Springer-Verlag, New York
Google Scholar
Weiss G. M. (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:7–19
Google Scholar
Weiss G. M. & Provost F. (2003) Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19:315–354
MATH Google Scholar
Yang Y. (1996) Sampling Strategies and Learning Efficiency in Text Categorization. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access
Google Scholar
Yang Y. & Liu X. (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Google Scholar
Yang Y. & Pedersen J. O. (1997) A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning
Google Scholar
Yu H., Zhai C. & Han J. (2003) Text Classification from Positive and Unlabeled Documents. In: Proceedings of the twelfth international conference on Information and knowledge management (CIKM 2003)
Google Scholar
Zelikovitz S. & Hirsh H. (2000) Improving Short Text Classification Using Unlabeled Background Knowledge. In: Proceedings of the Seventeenth International Conference on Machine Learning(ICML2000)
Google Scholar
Zheng Z., Wu X. & Srihari R. (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:80–89
Google Scholar

Download references

Author information

Authors and Affiliations

Singapore MIT Alliance, National University of Singapore, Singapore, 117576
Ying Liu
Dept. of Mechanical Engineering, National University of Singapore, Singapore, 119260
Han Tong Loh
Dept. of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Youcef-Toumi Kamal
School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore, 117576
Shu Beng Tor

Authors

Ying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Han Tong Loh
View author publications
You can also search for this author in PubMed Google Scholar
Youcef-Toumi Kamal
View author publications
You can also search for this author in PubMed Google Scholar
Shu Beng Tor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bellevue, WA, 98008, USA
Anne Kao BA, MA, MS, PhD & Stephen R. Poteet BA, MA, CPhil &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, Y., Loh, H.T., Kamal, YT., Tor, S.B. (2007). Handling of Imbalanced Data in Text Classification: Category-Based Term Weights. In: Kao, A., Poteet, S.R. (eds) Natural Language Processing and Text Mining. Springer, London. https://doi.org/10.1007/978-1-84628-754-1_10

Download citation

DOI: https://doi.org/10.1007/978-1-84628-754-1_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-175-4
Online ISBN: 978-1-84628-754-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics