Explainable Artificial Intelligence for Breast Tumour Classification: Helpful or Harmful

Rafferty, Amy; Nenutil, Rudolf; Rajan, Ajitha

doi:10.1007/978-3-031-17976-1_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13611))

Included in the following conference series:

International Workshop on Interpretability of Machine Intelligence in Medical Image Computing

451 Accesses
2 Citations

Abstract

Explainable Artificial Intelligence (XAI) is the field of AI dedicated to promoting trust in machine learning models by helping us to understand how they make their decisions. For example, image explanations show us which pixels or segments were deemed most important by a model for a particular classification decision. This research focuses on image explanations generated by LIME, RISE and SHAP for a model which classifies breast mammograms as either benign or malignant. We assess these XAI techniques based on (1) the extent to which they agree with each other, as decided by One-Way ANOVA, Kendall’s Tau and RBO statistical tests, and (2) their agreement with the diagnostically important areas as identified by a radiologist on a small subset of mammograms. The main contribution of this research is the discovery that the 3 techniques consistently disagree both with each other and with the medical truth. We argue that using these off-shelf techniques in a medical context is not a feasible approach, and discuss possible causes of this problem, as well as some potential solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alvarez-Melis, D., Jaakkola, T.S.: On the robustness of interpretability methods (2018). https://arxiv.org/abs/1806.08049
Arun, N., et al.: Assessing the (un)trustworthiness of saliency maps for localizing abnormalities in medical imaging (2020). https://arxiv.org/abs/2008.02766
Heath, M.D., Bowyer, K., Kopans, D.B., Moore, R.H.: The digital database for screening mammography (2007)
Google Scholar
Hedström, A., et al.: Quantus: an explainable AI toolkit for responsible evaluation of neural network explanations (2022). https://arxiv.org/abs/2202.06861
Holzinger, A., Saranti, A., Molnar, C., Biecek, P., Samek, W.: Explainable AI methods - a brief overview. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, K.R., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 13–38. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_2
Chapter Google Scholar
Huang, L.: An integrated method for cancer classification and rule extraction from microarray data. J. Biomed. Sci. 16(1), 25 (2009)
Article Google Scholar
scikit image.org: Scikit-image documentation. https://scikit-image.org/docs/stable/api/skimage.segmentation.html
Jia, X., Ren, L., Cai, L.: Clinical implementation of AI techniques will require interpretable AI models. Med. Phys. 47, 1–4 (2020)
Article Google Scholar
Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–89 (1938)
Article Google Scholar
King, B.: Artificial intelligence and radiology: what will the future hold? J. Am. College Radiol. 15(3 Part B), 501–503 (2018)
Google Scholar
Knapič, S., Malhi, A., Saluja, R., Främling, K.: Explainable artificial intelligence for human decision support system in the medical domain. Mach. Learn. Knowl. Extr. 3(3), 740–770 (2021)
Article Google Scholar
Lin, T., Huang, M.: Dataset of breast mammography images with masses. Mendeley Data, V5 (2020)
Google Scholar
Lin, T., Huang, M.: Dataset of breast mammography images with masses. Data Brief 31, 105928 (2020)
Article Google Scholar
Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)
Google Scholar
MohamedAliHabib: Brain tumour detection, Github repository. GitHub (2019). https://github.com/MohamedAliHabib/Brain-Tumor-Detection
Moreira, I., Amaral, I., Domingues, I., Cardoso, A.J.O., Cardoso, M.J., Cardoso, J.S.: Inbreast: toward a full-field digital mammographic database. Acad. Radiol. 19(2), 236–248 (2012)
Article Google Scholar
Park, J., Jo, K., Gwak, D., Hong, J., Choo, J., Choi, E.: Evaluation of out-of-distribution detection performance of self-supervised learning in a controllable environment (2020). https://doi.org/10.48550/ARXIV.2011.13120. https://arxiv.org/abs/2011.13120
Petsiuk, V., Das, A., Saenko, K.: RISE: randomized input sampling for explanation of black-box models. arXiv:1806.07421 (2018)
Recht, M., Bryan, R.: Artificial intelligence: threat or boon to radiologists? J. Am. College Radiol. 14(11), 1476–1480 (2017)
Article Google Scholar
Ribeiro, M., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. arXiv:1602.04938v3 (2016)
Rodriguez-Sampaio, M., Rincón, M., Valladares-Rodriguez, S., Bachiller-Mayoral, M.: Explainable artificial intelligence to detect breast cancer: a qualitative case-based visual interpretability approach. In: Ferrández Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Adeli, H. (eds.) Artificial Intelligence in Neuroscience: Affective Analysis and Health Applications. LNCS, vol. 13258, pp. 557–566. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06242-1_55
Chapter Google Scholar
Ross, A., Willson, V.L.: One-Way Anova, pp. 21–24. SensePublishers, Rotterdam (2017)
Google Scholar
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019)
Article Google Scholar
Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. arXiv:1610.02391 (2017)
Seyedeh, P., Zhaoyi, C., Pablo, R.: Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J. Am. Med. Inform. Assoc. 27, 1173–1185 (2020)
Article Google Scholar
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences (2017). https://arxiv.org/abs/1704.02685
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualizing image classification models and saliency maps. https://arxiv.org/abs/1312.6034 (2014)
Suckling, J., Parker, J., Dance, D.: Mammographic image analysis society (MIAS) database v1.21 (2015). https://www.repository.cam.ac.uk/handle/1810/250394
Sun, Y., Chockler, H., Huang, X., Kroening, D.: Explaining image classifiers using statistical fault localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 391–406. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_24
Chapter Google Scholar
Sun, Y., Chockler, H., Kroening, D.: Explanations for occluded images. In: International Conference on Computer Vision (ICCV), pp. 1234–1243. IEEE (2021)
Google Scholar
van der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med. Image Anal. 102470 (2022)
Google Scholar
Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 1–38 (2010)
Article Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. arXiv:1512.04150 (2016)

Download references

Author information

Authors and Affiliations

University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK
Amy Rafferty & Ajitha Rajan
Masaryk Memorial Cancer Institute, Žlutý kopec 543/7, 602 00, Brno-střed-Staré Brno, Czechia
Rudolf Nenutil

Authors

Amy Rafferty
View author publications
You can also search for this author in PubMed Google Scholar
Rudolf Nenutil
View author publications
You can also search for this author in PubMed Google Scholar
Ajitha Rajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amy Rafferty .

Editor information

Editors and Affiliations

University of Bern, Bern, Switzerland
Mauricio Reyes
University of Coimbra, Coimbra, Portugal
Pedro Henriques Abreu
University of Porto, Porto, Portugal
Jaime Cardoso

A Appendix

1.1 A.1 Model Training Results

The results of the experiment used to choose the 75 epoch model when considering the impact of overfitting on our CNN, as discussed in Sect. 3.2 of this report.

Table 1. Validation accuracy and F1 Score for CNNs trained with different numbers of epochs.

Full size table

1.2 A.2 Choosing L Parameter for LIME

The results of the experiment used to choose L, as discussed in Sect. 4.1 of this report.

1.3 A.3 One-Way ANOVA Results

We present here the statistical hypotheses used for the One-Way ANOVA test, as well as the results gathered. This statistical test and its implications is discussed in Sect. 5.1 of this report. The results are shown in Table 2.

The hypotheses for One-Way ANOVA are as follows:

H0: There is no statistically significant difference between the means of the groups.
H1: There is a statistically significant difference between the means of the groups.

Table 2. Results of One-Way ANOVA tests as described in the text. Bold results are statistically significant (alpha value 0.05).

Full size table

1.4 A.4 Pixel Agreement Statistics

Figure 5 presents a box plot representation of the % pixel agreement values between XAI techniques, taken over all images in our test set. These results are discussed in Sect. 5.1 of this report. Table 3 also represents these agreement values.

Table 3. Statistical overview of percentage pixel agreements for all method comparisons.

Full size table

1.5 A.5 Ranked Biased Overlap (RBO) Results

RBO [32] weights each rank position by considering the depth of the ranking being examined, minimising the effect of the least important pixels. Taking two ranked lists as inputs, RBO outputs a value between 0 and 1, where 0 indicates that the lists are disjoint, and 1 indicates that they are identical. The results of RBO depend on the tuneable parameter p [32]. Small p values place more weight on items at the top of an ordered list. While this is desirable, we must consider the difference in pixel importance value allocation methods between techniques. RISE applies a decimal score to each pixel. SHAP applies the same decimal score to each pixel within a given image segment. LIME uses binary values indicating whether the pixels are in the top 6 most important features. We use large p values to properly encompass similarities between larger groups of pixels with identical values.

Table 4 shows the average, minimum and maximum RBO values for each pairwise pixel list comparison. The average RBO values for each comparison tell us that the pixel lists are almost disjoint. This is expected due to the differing pixel importance allocation methods as discussed. Instead we consider the maximum values - LIME and SHAP generate lists that are hugely identical for at least one instance in the test set, with maximum RBO values in the range 0.69–0.78. The other pairwise comparisons do not come close to these numbers. This observation supports Kendall’s Tau - both tests have identified LIME and SHAP as the techniques with the highest agreement regarding pixel orderings.

Table 4. RBO results performed on full ordered pixel importance lists for each technique, with differing p values. Values shown to 3 decimal places, though we note here that these values are never exactly zero, just extremely small.

Full size table

1.6 A.6 Kendall’s Tau Results

We present here the statistical hypotheses used for the Kendall’s Tau test, as well as the results gathered. This statistical test and its implications is discussed in Sect. 5.2 of this report. The results are shown in Table 5.

The following hypotheses are used:

H0: There is no statistically significant correlation, the lists are independent.
H1: There is a statistically significant correlation in pixel orderings between lists, they are not independent.

Table 5. Kendall’s Tau comparison results. n is defined in the text. Values are averages taken over the test set, shown to 3 decimal places. Bold results are statistically significant.

Full size table

1.7 A.7 Radiologist Opinions

Here we present the results as received from two independent radiologists, as well as definitions of the scoring system used to evaluate explanations,

We requested each explanation be scored between 0 and 3 to represent its agreement to radiologist identified image regions. The definition of the scores provided to the radiologists are as follows:

0 = Explanation completely differs from expert opinion

1 = Explanation has some similarities, but mostly differs from expert opinion

2 = Explanation mostly agrees with expert opinion, though some areas differ

3 = Explanation and expert opinion completely agree

Table 6. Radiologist evaluation regarding explanations generated on a subset of 10 images. B denotes benign, and M denotes malignant.

Full size table

Table 7. Second radiologist evaluation regarding explanations generated on a subset of 10 images. B denotes benign, M denotes malignant.

Full size table

We note that the opinions of the two radiologists above do not entirely agree with each other - this is due to the fact that identifying all cancerous regions by eye, especially on benign mammograms, is extremely difficult. The scans are also fairly noisy and in parts blurry by nature. The purpose of this form of evaluation was not to have radiologists perfectly highlight all cancerous regions - the goal was to simply analyse their responses to explanations generated by each XAI technique, in order to judge the usefulness of the techniques as diagnostic tools.

1.8 A.8 Explanation Examples

This section contains examples of image explanations as generated by LIME, RISE and SHAP, described in this report. Figure 6 shows LIME explanations, Fig. 7 shows RISE explanations, and Fig. 8 shows SHAP explanations.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rafferty, A., Nenutil, R., Rajan, A. (2022). Explainable Artificial Intelligence for Breast Tumour Classification: Helpful or Harmful. In: Reyes, M., Henriques Abreu, P., Cardoso, J. (eds) Interpretability of Machine Intelligence in Medical Image Computing. iMIMIC 2022. Lecture Notes in Computer Science, vol 13611. Springer, Cham. https://doi.org/10.1007/978-3-031-17976-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-17976-1_10
Published: 07 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17975-4
Online ISBN: 978-3-031-17976-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Explainable Artificial Intelligence for Breast Tumour Classification: Helpful or Harmful

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Model Training Results

1.2 A.2 Choosing L Parameter for LIME

1.3 A.3 One-Way ANOVA Results

1.4 A.4 Pixel Agreement Statistics

1.5 A.5 Ranked Biased Overlap (RBO) Results

1.6 A.6 Kendall’s Tau Results

1.7 A.7 Radiologist Opinions

1.8 A.8 Explanation Examples

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation