Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed

Párraga-Valle, José; García-Bermúdez, Rodolfo; Rojas, Fernando; Torres-Morán, Christian; Simón-Cuevas, Alfredo

doi:10.1007/978-3-030-45385-5_57

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12108))

Included in the following conference series:

International Work-Conference on Bioinformatics and Biomedical Engineering

1737 Accesses
10 Altmetric

Abstract

The aim of this work was to compare the behavior of mutual information and Chi-square as metrics in the evaluation of the relevance of the terms extracted from documents related to “software design” retrieved from PubMed database tested in two contexts: using a set of terms retrieved from the vectorization of the corpus of abstracts and using only the terms retrieved from the vocabulary defined by the IEEE standard ISO/IEC/IEEE 24765. A search was conducted concerning the subject “software” in the last 6 years and we used Medical Subject Headings (Mesh) term “software design” of the articles to label them. Then mutual information and Chi-square metrics were computed as metrics to sort and select features. Chi-square obtained the highest accuracy scores in documents classification by using a multinomial naive Bayes classifier. Although these results suggest that Chi-square is better than mutual information in feature relevance estimation in the context of this work, further research is necessary to obtain a consistent foundation of this conclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018). https://doi.org/10.1016/J.NEUCOM.2017.11.077. https://www.sciencedirect.com/science/article/pii/S0925231218302911
Article Google Scholar
Chandra, A.: Comparison of feature selection for imbalance text datasets. In: Proceedings of 2019 International Conference on Information Management and Technology, ICIMTech, August 2019, vol. 1, pp. 68–72 (2019). https://doi.org/10.1109/ICIMTech.2019.8843773
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimedia Tools and Applications 78(3), 3797–3816 (2018). https://doi.org/10.1007/s11042-018-6083-5. http://link.springer.com/10.1007/s11042-018-6083-5
Article Google Scholar
Farisi, A.A., Sibaroni, Y., Faraby, S.A.: Sentiment analysis on hotel reviews using multinomial Naive Bayes classifier. J. Phys: Conf. Ser. 1192, 12024 (2019). https://doi.org/10.1088/1742-6596/1192/1/012024. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012024
Article Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003). http://dl.acm.org/citation.cfm?id=944919.944974
Google Scholar
Mao, Y., Lu, Z.: Mesh now: automatic mesh indexing at pubmed scale via learning to rank. J. Biomed. Semant. 8(1), 15 (2017). https://doi.org/10.1186/s13326-017-0123-3
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification (1998)
Google Scholar
Mengle, S.S., Goharian, N.: Using ambiguity measure feature selection algorithm for support vector machine classifier. In: Proceedings of the 2008 ACM symposium on Applied computing, pp. 916–920 (2008)
Google Scholar
Rohman, H.N., Asror, I.: Automatic detection of argument components in text using multinomial Nave Bayes clasiffier. J. Phys: Conf. Ser. 1192, 12034 (2019). https://doi.org/10.1088/1742-6596/1192/1/012034. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012034
Article Google Scholar
Sulaiman, M.A., Labadin, J.: Feature selection based on mutual information. In: 2015 9th International Conference on IT in Asia (CITA), pp. 1–6. IEEE, August 2015. https://doi.org/10.1109/CITA.2015.7349827, http://ieeexplore.ieee.org/document/7349827/
Tran, T., Kavuluru, R.: Distant supervision for treatment relation extraction by leveraging mesh subheadings. Artif. Intell. Med. 98, 18–26 (2019). https://doi.org/10.1016/j.artmed.2019.06.002. http://www.sciencedirect.com/science/article/pii/S0933365718304913
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Técnica de Manabí, Portoviejo, Ecuador
José Párraga-Valle, Rodolfo García-Bermúdez & Christian Torres-Morán
Architecture and Technology, CITIC University of Granada, Granada, Spain
Fernando Rojas
Universidad Tecnológica de la Habana José Antonio Echeverría, Havana, Cuba
Alfredo Simón-Cuevas

Authors

José Párraga-Valle
View author publications
You can also search for this author in PubMed Google Scholar
Rodolfo García-Bermúdez
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Rojas
View author publications
You can also search for this author in PubMed Google Scholar
Christian Torres-Morán
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Simón-Cuevas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodolfo García-Bermúdez .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Granada, Granada, Spain
Olga Valenzuela
University of Granada, Granada, Spain
Fernando Rojas
University of Granada, Granada, Spain
Luis Javier Herrera
University of Chicago and Fundacion Progreso y Salud, Granada, Spain
Francisco Ortuño

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Párraga-Valle, J., García-Bermúdez, R., Rojas, F., Torres-Morán, C., Simón-Cuevas, A. (2020). Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_57

Download citation

DOI: https://doi.org/10.1007/978-3-030-45385-5_57
Published: 30 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45384-8
Online ISBN: 978-3-030-45385-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics