Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

Forman, George

doi:10.1007/3-540-45681-3_13

George Forman⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2028 Accesses
12 Citations

Abstract

Good feature selection is essential for text classification to make it tractable for machine learning, and to improve classification performance. This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed for various objectives. For best accuracy, F- measure or recall, the findings reveal an outstanding new feature selection metric, “Bi-Normal Separation” (BNS). For precision alone, however, Information Gain (IG) was superior. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner who seeks to choose one or two metrics to try that are mostly likely to have the best performance for the single dataset at hand. This analysis determined, for example, that IG and Chi-Squared have correlated failures for precision, and that IG paired with BNS is a better choice.

Download to read the full chapter text

Chapter PDF

Selecting Features with SVM

Feature selection based on term frequency deviation rate for text classification

Article 11 November 2020

Analytic Feature Selection for Support Vector Machines

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Tech Report HPL-2002-147, Hewlett-Packard Laboratories. Submitted to Special Issue on Variable and Feature Selection, J. of Machine Learning Research. (2002)
Google Scholar
Forman, G.: Avoiding the Siren Song: Undistracted Feature Selection for Multi-Class Text Classification. TR HPL-2002-75, Hewlett-Packard Laboratories. Submitted as above. (2002)
Google Scholar
Han, E. S., Karypis, G.: Centroid-Based Document Classification: Analysis & Experimental Results. In: Principles of Data Mining and Knowledge Discovery (PKDD). (2000) 424–431
Google Scholar
Hanley, J. A.: The Robustness of the “Binormal” Assumptions Used in Fitting ROC Curves. Medical Decision Making, 8(3). (1988) 197–203
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: 16th International Conference on Machine Learning (ICML). (1999)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: ACM SIGIR Conference on Research and Development in Information Retrieval. (1999) 42–49
Google Scholar
Yang, Y., Pedersen, J. O.: A Comparative Study on Feature Selection in Text Categorization. In: International Conference on Machine Learning (ICML). (1997) 412–420
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, 1501 Page Mill Rd. MS 1143, 94304, Palo Alto, CA, USA
George Forman

Authors

George Forman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Forman, G. (2002). Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_13

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_13
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

Abstract

Chapter PDF

Similar content being viewed by others

Selecting Features with SVM

Feature selection based on term frequency deviation rate for text classification

Analytic Feature Selection for Support Vector Machines

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

Abstract

Chapter PDF

Similar content being viewed by others

Selecting Features with SVM

Feature selection based on term frequency deviation rate for text classification

Analytic Feature Selection for Support Vector Machines

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation