Abstract
Learning from imbalanced data has emerged as a new challenge to the machine learning (ML), data mining (DM) and text mining (TM) communities. Two recent workshops in 2000 [17] and 2003 [7] at AAAI and ICML conferences respectively and a special issue in ACM SIGKDD explorations [8] are dedicated to this topic. It has been witnessing growing interest and attention among researchers and practitioners seeking solutions in handling imbalanced data. An excellent review of the state-ofthe- art is given by Gary Weiss [43].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates R. & Ribeiro-Neto B. (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA
Baoli L., Qin L. & Shiwen Y. (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP) 3:215–226
Blum A. & Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co-Training. In: COLT: Proceedings of the Workshop on Computational Learning Theory
Brank J., Grobelnik M., Milic-Frayling N. & Mladenic D. (2003) Training text classifiers with SVM on very few positive examples. Report MSR-TR-2003-34
Burges C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2:121–167
Castillo M. D. d. & Serrano J. I. (2004) A multistrategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:70–79
Chawla N., Japkowicz N. & Kolcz A. (eds) (2003) Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets
Chawla N., Japkowicz N. & Kolcz A. (eds) (2004) Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6
Debole F. & Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM Symposium on Applied computing
Dietterich T., Margineantu D., Provost F. & Turney P. (eds) (2000) Proceedings of the ICML’2000 Workshop on Cost-sensitive Learning
Dumais S. & Chen H. (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000)
Elkan C. (2001) The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01)
Fan W., Yu P. S. & Wang H. (2004) Mining Extremely Skewed Trading Anomalies. In: Advances in Database Technology-EDBT 2004: 9th International Conference on Extending Database Technology
Forman G. (2003) An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3:1289–1305
Ghani R. (2002) Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In: International Conference on Machine Learning (ICML 2002)
Goldman S. & Zhou Y. (2000) Enhancing Supervised Learning with Unlabeled Data. In: Proceedings of 17th International Conference on Machine Learning
Japkowicz N. (eds) (2000) Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. AAAI Tech Report WS-00–05, AAAI
Japkowicz N., Myers C. & Gluck M. A. (1995) A Novelty Detection Approach to Classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95)
Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: ECML-98, Tenth European Conference on Machine Learning
Joachims T. (2001) A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Joachims T. (2002) Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers
Leopold E. & Kindermann J. (2002) Text Categorization with Support Vector Machines-How to Represent Texts in Input Space. Machine Learning 46:423–444
Lewis D. D. & Gale W. A. (1994) A Sequential Algorithm for Training Text Classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval
Lewis D. D., Yang Y., Rose T. G. & Li F. (2004) RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5:361–397
Liu A. Y. C. (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. Masters thesis. University of Texas at Austin
Liu B., Dai Y., Li X., Lee W. S. & Yu P. (2003) Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03)
Liu Y., Loh H. T. & Tor S. B. (2004) Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium 2004
Liu Y., Loh H. T., Youcef-Toumi K. & Tor S. B. (2005) MCV1: An Engineering Paper Corpus for Manufacturing Knowledge Retrieval. submitted to the Journal of Knowledge and Information System (KAIS)
Man L. (2004) A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. In: Text Seminar of CHIME Group at the National University of Singapore
Manevitz L. M. & Yousef M. (2002) One-class svms for document classification. The Journal of Machine Learning Research 2:139–154
Mladenic D. & Grobelnik M. (1999) Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML’99
Ng H. T., Goh W. B. & Low K. L. (1997) Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Nickerson A., Japkowicz N. & Milios E. (2001) Using Unsupervised Learning to Guide Re-Sampling in Imbalanced Data Sets. In: Proceedings of the Eighth International Workshop on AI and Statitsics
Nigam K. P. (2001) Using unlabeled data to improve text classification. PhD thesis. Carnegie Mellon University
Raskutti B. & Kowalczyk A. (2004) Extreme re-balancing for SVMs: a case study. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:60–69
Rijsbergen C. J. v. (1979) Information Retrieval. 2nd edn. Butterworths, London, UK
Ruiz M. E. & Srinivasan P. (2002) Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5:87–118
Salton G. & Buckley C. (1988) Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24:513–523
Salton G. & McGill M. J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, USA
Sebastiani F. (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34:1–47
Sun A., Lim E. P., Ng W. K. & Srivastava J. (2004) Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16:1305–1308
Vapnik V. N. (1999) The Nature of Statistical Learning Theory. 2nd edn. Springer-Verlag, New York
Weiss G. M. (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:7–19
Weiss G. M. & Provost F. (2003) Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19:315–354
Yang Y. (1996) Sampling Strategies and Learning Efficiency in Text Categorization. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access
Yang Y. & Liu X. (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Yang Y. & Pedersen J. O. (1997) A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning
Yu H., Zhai C. & Han J. (2003) Text Classification from Positive and Unlabeled Documents. In: Proceedings of the twelfth international conference on Information and knowledge management (CIKM 2003)
Zelikovitz S. & Hirsh H. (2000) Improving Short Text Classification Using Unlabeled Background Knowledge. In: Proceedings of the Seventeenth International Conference on Machine Learning(ICML2000)
Zheng Z., Wu X. & Srihari R. (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:80–89
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag London Limited
About this chapter
Cite this chapter
Liu, Y., Loh, H.T., Kamal, YT., Tor, S.B. (2007). Handling of Imbalanced Data in Text Classification: Category-Based Term Weights. In: Kao, A., Poteet, S.R. (eds) Natural Language Processing and Text Mining. Springer, London. https://doi.org/10.1007/978-1-84628-754-1_10
Download citation
DOI: https://doi.org/10.1007/978-1-84628-754-1_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-175-4
Online ISBN: 978-1-84628-754-1
eBook Packages: Computer ScienceComputer Science (R0)