Skip to main content

Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed

  • Conference paper
  • First Online:
Bioinformatics and Biomedical Engineering (IWBBIO 2020)

Abstract

The aim of this work was to compare the behavior of mutual information and Chi-square as metrics in the evaluation of the relevance of the terms extracted from documents related to “software design” retrieved from PubMed database tested in two contexts: using a set of terms retrieved from the vectorization of the corpus of abstracts and using only the terms retrieved from the vocabulary defined by the IEEE standard ISO/IEC/IEEE 24765. A search was conducted concerning the subject “software” in the last 6 years and we used Medical Subject Headings (Mesh) term “software design” of the articles to label them. Then mutual information and Chi-square metrics were computed as metrics to sort and select features. Chi-square obtained the highest accuracy scores in documents classification by using a multinomial naive Bayes classifier. Although these results suggest that Chi-square is better than mutual information in feature relevance estimation in the context of this work, further research is necessary to obtain a consistent foundation of this conclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018). https://doi.org/10.1016/J.NEUCOM.2017.11.077. https://www.sciencedirect.com/science/article/pii/S0925231218302911

    Article  Google Scholar 

  2. Chandra, A.: Comparison of feature selection for imbalance text datasets. In: Proceedings of 2019 International Conference on Information Management and Technology, ICIMTech, August 2019, vol. 1, pp. 68–72 (2019). https://doi.org/10.1109/ICIMTech.2019.8843773

  3. Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimedia Tools and Applications 78(3), 3797–3816 (2018). https://doi.org/10.1007/s11042-018-6083-5. http://link.springer.com/10.1007/s11042-018-6083-5

    Article  Google Scholar 

  4. Farisi, A.A., Sibaroni, Y., Faraby, S.A.: Sentiment analysis on hotel reviews using multinomial Naive Bayes classifier. J. Phys: Conf. Ser. 1192, 12024 (2019). https://doi.org/10.1088/1742-6596/1192/1/012024. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012024

    Article  Google Scholar 

  5. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003). http://dl.acm.org/citation.cfm?id=944919.944974

    Google Scholar 

  6. Mao, Y., Lu, Z.: Mesh now: automatic mesh indexing at pubmed scale via learning to rank. J. Biomed. Semant. 8(1), 15 (2017). https://doi.org/10.1186/s13326-017-0123-3

    Article  Google Scholar 

  7. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification (1998)

    Google Scholar 

  8. Mengle, S.S., Goharian, N.: Using ambiguity measure feature selection algorithm for support vector machine classifier. In: Proceedings of the 2008 ACM symposium on Applied computing, pp. 916–920 (2008)

    Google Scholar 

  9. Rohman, H.N., Asror, I.: Automatic detection of argument components in text using multinomial Nave Bayes clasiffier. J. Phys: Conf. Ser. 1192, 12034 (2019). https://doi.org/10.1088/1742-6596/1192/1/012034. https://doi.org/10.1088%2F1742-6596%2F1192%2F1%2F012034

    Article  Google Scholar 

  10. Sulaiman, M.A., Labadin, J.: Feature selection based on mutual information. In: 2015 9th International Conference on IT in Asia (CITA), pp. 1–6. IEEE, August 2015. https://doi.org/10.1109/CITA.2015.7349827, http://ieeexplore.ieee.org/document/7349827/

  11. Tran, T., Kavuluru, R.: Distant supervision for treatment relation extraction by leveraging mesh subheadings. Artif. Intell. Med. 98, 18–26 (2019). https://doi.org/10.1016/j.artmed.2019.06.002. http://www.sciencedirect.com/science/article/pii/S0933365718304913

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodolfo García-Bermúdez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Párraga-Valle, J., García-Bermúdez, R., Rojas, F., Torres-Morán, C., Simón-Cuevas, A. (2020). Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45385-5_57

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45384-8

  • Online ISBN: 978-3-030-45385-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics