Skip to main content

Explainable Artificial Intelligence for Breast Tumour Classification: Helpful or Harmful

  • Conference paper
  • First Online:
Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC 2022)

Abstract

Explainable Artificial Intelligence (XAI) is the field of AI dedicated to promoting trust in machine learning models by helping us to understand how they make their decisions. For example, image explanations show us which pixels or segments were deemed most important by a model for a particular classification decision. This research focuses on image explanations generated by LIME, RISE and SHAP for a model which classifies breast mammograms as either benign or malignant. We assess these XAI techniques based on (1) the extent to which they agree with each other, as decided by One-Way ANOVA, Kendall’s Tau and RBO statistical tests, and (2) their agreement with the diagnostically important areas as identified by a radiologist on a small subset of mammograms. The main contribution of this research is the discovery that the 3 techniques consistently disagree both with each other and with the medical truth. We argue that using these off-shelf techniques in a medical context is not a feasible approach, and discuss possible causes of this problem, as well as some potential solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alvarez-Melis, D., Jaakkola, T.S.: On the robustness of interpretability methods (2018). https://arxiv.org/abs/1806.08049

  2. Arun, N., et al.: Assessing the (un)trustworthiness of saliency maps for localizing abnormalities in medical imaging (2020). https://arxiv.org/abs/2008.02766

  3. Heath, M.D., Bowyer, K., Kopans, D.B., Moore, R.H.: The digital database for screening mammography (2007)

    Google Scholar 

  4. Hedström, A., et al.: Quantus: an explainable AI toolkit for responsible evaluation of neural network explanations (2022). https://arxiv.org/abs/2202.06861

  5. Holzinger, A., Saranti, A., Molnar, C., Biecek, P., Samek, W.: Explainable AI methods - a brief overview. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, K.R., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 13–38. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_2

    Chapter  Google Scholar 

  6. Huang, L.: An integrated method for cancer classification and rule extraction from microarray data. J. Biomed. Sci. 16(1), 25 (2009)

    Article  Google Scholar 

  7. scikit image.org: Scikit-image documentation. https://scikit-image.org/docs/stable/api/skimage.segmentation.html

  8. Jia, X., Ren, L., Cai, L.: Clinical implementation of AI techniques will require interpretable AI models. Med. Phys. 47, 1–4 (2020)

    Article  Google Scholar 

  9. Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–89 (1938)

    Article  Google Scholar 

  10. King, B.: Artificial intelligence and radiology: what will the future hold? J. Am. College Radiol. 15(3 Part B), 501–503 (2018)

    Google Scholar 

  11. Knapič, S., Malhi, A., Saluja, R., Främling, K.: Explainable artificial intelligence for human decision support system in the medical domain. Mach. Learn. Knowl. Extr. 3(3), 740–770 (2021)

    Article  Google Scholar 

  12. Lin, T., Huang, M.: Dataset of breast mammography images with masses. Mendeley Data, V5 (2020)

    Google Scholar 

  13. Lin, T., Huang, M.: Dataset of breast mammography images with masses. Data Brief 31, 105928 (2020)

    Article  Google Scholar 

  14. Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)

    Google Scholar 

  15. MohamedAliHabib: Brain tumour detection, Github repository. GitHub (2019). https://github.com/MohamedAliHabib/Brain-Tumor-Detection

  16. Moreira, I., Amaral, I., Domingues, I., Cardoso, A.J.O., Cardoso, M.J., Cardoso, J.S.: Inbreast: toward a full-field digital mammographic database. Acad. Radiol. 19(2), 236–248 (2012)

    Article  Google Scholar 

  17. Park, J., Jo, K., Gwak, D., Hong, J., Choo, J., Choi, E.: Evaluation of out-of-distribution detection performance of self-supervised learning in a controllable environment (2020). https://doi.org/10.48550/ARXIV.2011.13120. https://arxiv.org/abs/2011.13120

  18. Petsiuk, V., Das, A., Saenko, K.: RISE: randomized input sampling for explanation of black-box models. arXiv:1806.07421 (2018)

  19. Recht, M., Bryan, R.: Artificial intelligence: threat or boon to radiologists? J. Am. College Radiol. 14(11), 1476–1480 (2017)

    Article  Google Scholar 

  20. Ribeiro, M., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. arXiv:1602.04938v3 (2016)

  21. Rodriguez-Sampaio, M., Rincón, M., Valladares-Rodriguez, S., Bachiller-Mayoral, M.: Explainable artificial intelligence to detect breast cancer: a qualitative case-based visual interpretability approach. In: Ferrández Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Adeli, H. (eds.) Artificial Intelligence in Neuroscience: Affective Analysis and Health Applications. LNCS, vol. 13258, pp. 557–566. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06242-1_55

    Chapter  Google Scholar 

  22. Ross, A., Willson, V.L.: One-Way Anova, pp. 21–24. SensePublishers, Rotterdam (2017)

    Google Scholar 

  23. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019)

    Article  Google Scholar 

  24. Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. arXiv:1610.02391 (2017)

  25. Seyedeh, P., Zhaoyi, C., Pablo, R.: Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J. Am. Med. Inform. Assoc. 27, 1173–1185 (2020)

    Article  Google Scholar 

  26. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences (2017). https://arxiv.org/abs/1704.02685

  27. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualizing image classification models and saliency maps. https://arxiv.org/abs/1312.6034 (2014)

  28. Suckling, J., Parker, J., Dance, D.: Mammographic image analysis society (MIAS) database v1.21 (2015). https://www.repository.cam.ac.uk/handle/1810/250394

  29. Sun, Y., Chockler, H., Huang, X., Kroening, D.: Explaining image classifiers using statistical fault localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 391–406. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_24

    Chapter  Google Scholar 

  30. Sun, Y., Chockler, H., Kroening, D.: Explanations for occluded images. In: International Conference on Computer Vision (ICCV), pp. 1234–1243. IEEE (2021)

    Google Scholar 

  31. van der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med. Image Anal. 102470 (2022)

    Google Scholar 

  32. Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 1–38 (2010)

    Article  Google Scholar 

  33. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  34. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. arXiv:1512.04150 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amy Rafferty .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Model Training Results

The results of the experiment used to choose the 75 epoch model when considering the impact of overfitting on our CNN, as discussed in Sect. 3.2 of this report.

Table 1. Validation accuracy and F1 Score for CNNs trained with different numbers of epochs.

1.2 A.2 Choosing L Parameter for LIME

The results of the experiment used to choose L, as discussed in Sect. 4.1 of this report.

Fig. 4.
figure 4

Average % pixel agreement values between techniques taken over 30 images from the Validation set.

1.3 A.3 One-Way ANOVA Results

We present here the statistical hypotheses used for the One-Way ANOVA test, as well as the results gathered. This statistical test and its implications is discussed in Sect. 5.1 of this report. The results are shown in Table 2.

The hypotheses for One-Way ANOVA are as follows:

  • H0: There is no statistically significant difference between the means of the groups.

  • H1: There is a statistically significant difference between the means of the groups.

Table 2. Results of One-Way ANOVA tests as described in the text. Bold results are statistically significant (alpha value 0.05).

1.4 A.4 Pixel Agreement Statistics

Figure 5 presents a box plot representation of the % pixel agreement values between XAI techniques, taken over all images in our test set. These results are discussed in Sect. 5.1 of this report. Table 3 also represents these agreement values.

Fig. 5.
figure 5

% Pixel Agreement between techniques for n most important pixels. Medians are orange lines, means are green triangles. (Color figure online)

Table 3. Statistical overview of percentage pixel agreements for all method comparisons.

1.5 A.5 Ranked Biased Overlap (RBO) Results

RBO [32] weights each rank position by considering the depth of the ranking being examined, minimising the effect of the least important pixels. Taking two ranked lists as inputs, RBO outputs a value between 0 and 1, where 0 indicates that the lists are disjoint, and 1 indicates that they are identical. The results of RBO depend on the tuneable parameter p [32]. Small p values place more weight on items at the top of an ordered list. While this is desirable, we must consider the difference in pixel importance value allocation methods between techniques. RISE applies a decimal score to each pixel. SHAP applies the same decimal score to each pixel within a given image segment. LIME uses binary values indicating whether the pixels are in the top 6 most important features. We use large p values to properly encompass similarities between larger groups of pixels with identical values.

Table 4 shows the average, minimum and maximum RBO values for each pairwise pixel list comparison. The average RBO values for each comparison tell us that the pixel lists are almost disjoint. This is expected due to the differing pixel importance allocation methods as discussed. Instead we consider the maximum values - LIME and SHAP generate lists that are hugely identical for at least one instance in the test set, with maximum RBO values in the range 0.69–0.78. The other pairwise comparisons do not come close to these numbers. This observation supports Kendall’s Tau - both tests have identified LIME and SHAP as the techniques with the highest agreement regarding pixel orderings.

Table 4. RBO results performed on full ordered pixel importance lists for each technique, with differing p values. Values shown to 3 decimal places, though we note here that these values are never exactly zero, just extremely small.

1.6 A.6 Kendall’s Tau Results

We present here the statistical hypotheses used for the Kendall’s Tau test, as well as the results gathered. This statistical test and its implications is discussed in Sect. 5.2 of this report. The results are shown in Table 5.

The following hypotheses are used:

  • H0: There is no statistically significant correlation, the lists are independent.

  • H1: There is a statistically significant correlation in pixel orderings between lists, they are not independent.

Table 5. Kendall’s Tau comparison results. n is defined in the text. Values are averages taken over the test set, shown to 3 decimal places. Bold results are statistically significant.

1.7 A.7 Radiologist Opinions

Here we present the results as received from two independent radiologists, as well as definitions of the scoring system used to evaluate explanations,

We requested each explanation be scored between 0 and 3 to represent its agreement to radiologist identified image regions. The definition of the scores provided to the radiologists are as follows:

0 = Explanation completely differs from expert opinion

1 = Explanation has some similarities, but mostly differs from expert opinion

2 = Explanation mostly agrees with expert opinion, though some areas differ

3 = Explanation and expert opinion completely agree

Table 6. Radiologist evaluation regarding explanations generated on a subset of 10 images. B denotes benign, and M denotes malignant.
Table 7. Second radiologist evaluation regarding explanations generated on a subset of 10 images. B denotes benign, M denotes malignant.

We note that the opinions of the two radiologists above do not entirely agree with each other - this is due to the fact that identifying all cancerous regions by eye, especially on benign mammograms, is extremely difficult. The scans are also fairly noisy and in parts blurry by nature. The purpose of this form of evaluation was not to have radiologists perfectly highlight all cancerous regions - the goal was to simply analyse their responses to explanations generated by each XAI technique, in order to judge the usefulness of the techniques as diagnostic tools.

Fig. 6.
figure 6

Examples of LIME explanations generated for benign (Ben) and malignant (Mal) breast mammograms.

1.8 A.8 Explanation Examples

This section contains examples of image explanations as generated by LIME, RISE and SHAP, described in this report. Figure 6 shows LIME explanations, Fig. 7 shows RISE explanations, and Fig. 8 shows SHAP explanations.

Fig. 7.
figure 7

Examples of RISE explanations generated for benign (Ben) and malignant (Mal) breast mammograms.

Fig. 8.
figure 8

Examples of SHAP explanations generated for benign (Ben) and malignant (Mal) breast mammograms.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rafferty, A., Nenutil, R., Rajan, A. (2022). Explainable Artificial Intelligence for Breast Tumour Classification: Helpful or Harmful. In: Reyes, M., Henriques Abreu, P., Cardoso, J. (eds) Interpretability of Machine Intelligence in Medical Image Computing. iMIMIC 2022. Lecture Notes in Computer Science, vol 13611. Springer, Cham. https://doi.org/10.1007/978-3-031-17976-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17976-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17975-4

  • Online ISBN: 978-3-031-17976-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics