Abstract
The aim of this work was to compare the behavior of mutual information and Chi-square as metrics in the evaluation of the relevance of the terms extracted from documents related to “software design” retrieved from PubMed database tested in two contexts: using a set of terms retrieved from the vectorization of the corpus of abstracts and using only the terms retrieved from the vocabulary defined by the IEEE standard ISO/IEC/IEEE 24765. A search was conducted concerning the subject “software” in the last 6 years and we used Medical Subject Headings (Mesh) term “software design” of the articles to label them. Then mutual information and Chi-square metrics were computed as metrics to sort and select features. Chi-square obtained the highest accuracy scores in documents classification by using a multinomial naive Bayes classifier. Although these results suggest that Chi-square is better than mutual information in feature relevance estimation in the context of this work, further research is necessary to obtain a consistent foundation of this conclusion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018). https://doi.org/10.1016/J.NEUCOM.2017.11.077. https://www.sciencedirect.com/science/article/pii/S0925231218302911
Chandra, A.: Comparison of feature selection for imbalance text datasets. In: Proceedings of 2019 International Conference on Information Management and Technology, ICIMTech, August 2019, vol. 1, pp. 68–72 (2019). https://doi.org/10.1109/ICIMTech.2019.8843773
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimedia Tools and Applications 78(3), 3797–3816 (2018). https://doi.org/10.1007/s11042-018-6083-5. http://link.springer.com/10.1007/s11042-018-6083-5
Farisi, A.A., Sibaroni, Y., Faraby, S.A.: Sentiment analysis on hotel reviews using multinomial Naive Bayes classifier. J. Phys: Conf. Ser. 1192, 12024 (2019). https://doi.org/10.1088/1742-6596/1192/1/012024. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012024
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003). http://dl.acm.org/citation.cfm?id=944919.944974
Mao, Y., Lu, Z.: Mesh now: automatic mesh indexing at pubmed scale via learning to rank. J. Biomed. Semant. 8(1), 15 (2017). https://doi.org/10.1186/s13326-017-0123-3
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification (1998)
Mengle, S.S., Goharian, N.: Using ambiguity measure feature selection algorithm for support vector machine classifier. In: Proceedings of the 2008 ACM symposium on Applied computing, pp. 916–920 (2008)
Rohman, H.N., Asror, I.: Automatic detection of argument components in text using multinomial Nave Bayes clasiffier. J. Phys: Conf. Ser. 1192, 12034 (2019). https://doi.org/10.1088/1742-6596/1192/1/012034. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012034
Sulaiman, M.A., Labadin, J.: Feature selection based on mutual information. In: 2015 9th International Conference on IT in Asia (CITA), pp. 1–6. IEEE, August 2015. https://doi.org/10.1109/CITA.2015.7349827, http://ieeexplore.ieee.org/document/7349827/
Tran, T., Kavuluru, R.: Distant supervision for treatment relation extraction by leveraging mesh subheadings. Artif. Intell. Med. 98, 18–26 (2019). https://doi.org/10.1016/j.artmed.2019.06.002. http://www.sciencedirect.com/science/article/pii/S0933365718304913
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Párraga-Valle, J., García-Bermúdez, R., Rojas, F., Torres-Morán, C., Simón-Cuevas, A. (2020). Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_57
Download citation
DOI: https://doi.org/10.1007/978-3-030-45385-5_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45384-8
Online ISBN: 978-3-030-45385-5
eBook Packages: Computer ScienceComputer Science (R0)