Refinement Method of Post-processing and Training for Improvement of Automated Text Classification

Choi, Yun Jeong; Park, Seung Soo

doi:10.1007/11751588_32

Yun Jeong Choi²⁴ &
Seung Soo Park²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3981))

Included in the following conference series:

International Conference on Computational Science and Its Applications

615 Accesses
2 Citations

Abstract

The paper presents a method for improving text classification by using examples that are difficult to classify. Generally, researches to improve the text categorization performance are focused on enhancing existing classification models and algorithms itself, but the range of which has been limited by the feature-based statistical methodology. In this paper, we propose a new method to improve the accuracy and the performance using refinement training and post-processing. Especially, we focused on complex documents that are generally considered to be hard to classify. Our proposed method has a different style from traditional classification methods, and take a data mining strategy and fault tolerant system approaches. In experiments, we applied our system to documents which usually get low classification accuracy because they are laid on a decision boundary. The result shows that our system has high accuracy and stability in actual conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Bayardo, R., Srikant, R.: Athena: Mining-based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)
Chapter Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1), 67–88 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (1997)
Google Scholar
Lewis, D.D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the 11th international Conference on Machine Learning, pp. 148–156 (1994)
Google Scholar
Zheng, Z.: Naïve Bayesian Classifier Committees. In: Proceedings of European Conference on Machine Learning, pp. 196–207 (1998)
Google Scholar
Pedro, D., Michael, P.: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)
Google Scholar
Koller, D., Tong, S.: Active learning for parameter estimation in Bayesian networks. In: Neural Information Processing Systems (2001)
Google Scholar
Liu, B., Wu, H., Phang, T.H.: A Refinement Approach to Handling Model Misfit in Text Categorization. In: SIGKDD (2002)
Google Scholar
Castillo, M.D., Serrano, J.L.: A Multistrategy Approach for Digital Text Categorization form Imbalanced Documents. In: SIGKDD, vol. 6, pp. 70–79 (2004)
Google Scholar
Gao, S., Wu, W., et al.: A MFoM Learning Approach to Robust Multiclass Multi-Label Text Categorization. In: Proceedings of the 21st Intenational Conference on Machine Learning (2004)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Hasenager, M.: Active Data Selection in Supervised and Unsupervised Learning. PhD thesis, Technische Fakultat der Universitat Bielefeld (2000)
Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, p. 1. Springer, Heidelberg (2000)
Chapter Google Scholar
Newsgroup dataset: http://www.cs.cmu.edu/~textlearning/
BOW toolkit: http://www.cs.cmu.edu/~mccallum/bow/

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Ewha Womans University, Seoul, 127-150, Korea
Yun Jeong Choi & Seung Soo Park

Authors

Yun Jeong Choi
View author publications
You can also search for this author in PubMed Google Scholar
Seung Soo Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Calgary, 2500 University Drive N.W., T2N 1N4, Calgary, AB, Canada
Marina L. Gavrilova
Department of Mathematics and Computer Science, University of Perugia, via Vanvitelli, 1, I-06123, Perugia, Italy
Osvaldo Gervasi
William Norris Professor, Head of the Computer Science and Engineering Department, University of Minnesota, USA
Vipin Kumar
OptimaNumerics Ltd., Cathedral House, 23-31 Waring Street, BT1 2DX, Belfast, UK
C. J. Kenneth Tan
Clayton School of IT, Monash University, 3800, Clayton, Australia
David Taniar
Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, I-06123, Perugia, Italy
Antonio Laganá
School of Computing, Soongsil University, Seoul, Korea
Youngsong Mun
School of Information and Communication Engineering, Sungkyunkwan University, Korea
Hyunseung Choo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choi, Y.J., Park, S.S. (2006). Refinement Method of Post-processing and Training for Improvement of Automated Text Classification. In: Gavrilova, M.L., et al. Computational Science and Its Applications - ICCSA 2006. ICCSA 2006. Lecture Notes in Computer Science, vol 3981. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11751588_32

Download citation

DOI: https://doi.org/10.1007/11751588_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34072-0
Online ISBN: 978-3-540-34074-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics