Automated identification of chest radiographs with referable abnormality with deep learning: need for recalibration

Hwang, Eui Jin; Kim, Hyungjin; Lee, Jong Hyuk; Goo, Jin Mo; Park, Chang Min

doi:10.1007/s00330-020-07062-7

Automated identification of chest radiographs with referable abnormality with deep learning: need for recalibration

Imaging Informatics and Artificial Intelligence
Published: 14 July 2020

Volume 30, pages 6902–6912, (2020)
Cite this article

European Radiology Aims and scope Submit manuscript

Eui Jin Hwang¹^na1,
Hyungjin Kim¹^na1,
Jong Hyuk Lee^1,2,
Jin Mo Goo¹ &
…
Chang Min Park ORCID: orcid.org/0000-0003-1884-3738¹

583 Accesses
7 Citations
Explore all metrics

Abstract

Objectives

To evaluate the calibration of a deep learning (DL) model in a diagnostic cohort and to improve model’s calibration through recalibration procedures.

Methods

Chest radiographs (CRs) from 1135 consecutive patients (M:F = 582:553; mean age, 52.6 years) who visited our emergency department were included. A commercialized DL model was utilized to identify abnormal CRs, with a continuous probability score for each CR. After evaluation of the model calibration, eight different methods were used to recalibrate the original model based on the probability score. The original model outputs were recalibrated using 681 randomly sampled CRs and validated using the remaining 454 CRs. The Brier score for overall performance, average and maximum calibration error, absolute Spiegelhalter’s Z for calibration, and area under the receiver operating characteristic curve (AUROC) for discrimination were evaluated in 1000-times repeated, randomly split datasets.

Results

The original model tended to overestimate the likelihood for the presence of abnormalities, exhibiting average and maximum calibration error of 0.069 and 0.179, respectively; an absolute Spiegelhalter’s Z value of 2.349; and an AUROC of 0.949. After recalibration, significant improvements in the average (range, 0.015–0.036) and maximum (range, 0.057–0.172) calibration errors were observed in eight and five methods, respectively. Significant improvement in absolute Spiegelhalter’s Z (range, 0.809–4.439) was observed in only one method (the recalibration constant). Discriminations were preserved in six methods (AUROC, 0.909–0.949).

Conclusion

The calibration of DL algorithm can be augmented through simple recalibration procedures. Improved calibration may enhance the interpretability and credibility of the model for users.

Key Points

• A deep learning model tended to overestimate the likelihood of the presence of abnormalities in chest radiographs.

• Simple recalibration of the deep learning model using output scores could improve the calibration of model while maintaining discrimination.

• Improved calibration of a deep learning model may enhance the interpretability and the credibility of the model for users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning-based detection system for multiclass lesions on chest radiographs: comparison with observer readings

Article 20 November 2019

Validation of deep learning-based computer-aided detection software use for interpretation of pulmonary abnormalities on chest radiographs and examination of factors that influence readers’ performance and final diagnosis

Article Open access 19 September 2022

Generalizable Inter-Institutional Classification of Abnormal Chest Radiographs Using Efficient Convolutional Neural Networks

Article 05 March 2019

Abbreviations

AUROC:: Area under the receiver operating characteristic curve
CI:: Confidence interval
CR:: Chest radiograph
DL:: Deep learning
NPV:: Negative predictive value
PPV:: Positive predictive value

References

Ehteshami Bejnordi B, Veta M, Johannes van Diest P et al (2017) Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318:2199–2210
Article Google Scholar
Esteva A, Kuprel B, Novoa RA et al (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542:115–118
Article CAS Google Scholar
Gulshan V, Peng L, Coram M et al (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316:2402–2410
Article Google Scholar
Ardila D, Kiraly AP, Bharadwaj S et al (2019) End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 25:954–961
Article CAS Google Scholar
Byrne MF, Chapados N, Soudan F et al (2019) Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut 68:94–100
Article Google Scholar
De Fauw J, Ledsam JR, Romera-Paredes B et al (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med 24:1342–1350
Article Google Scholar
Kather JN, Pearson AT, Halama N et al (2019) Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med 25:1054–1056
Article CAS Google Scholar
Hwang EJ, Park S, Jin KN et al (2019) Development and validation of a deep learning-based automated detection algorithm for major thoracic diseases on chest radiographs. JAMA Netw Open 2:e191095
Article Google Scholar
Hwang EJ, Park S, Jin KN et al (2018) Development and validation of a deep learning-based automatic detection algorithm for active pulmonary tuberculosis on chest radiographs. Clin Infect Dis. https://doi.org/10.1093/cid/ciy967
Nam JG, Park S, Hwang EJ et al (2019) Development and validation of deep learning-based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology 290:218–228
Article Google Scholar
Lakhani P, Sundaram B (2017) Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 284:574–582
Article Google Scholar
Rajpurkar P, Irvin J, Ball RL et al (2018) Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 15:e1002686
Article Google Scholar
Taylor AG, Mielke C, Mongan J (2018) Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: a retrospective study. PLoS Med 15:e1002697
Article Google Scholar
Annarumma M, Withey SJ, Bakewell RJ, Pesce E, Goh V, Montana G (2019) Automated triaging of adult chest radiographs with deep artificial neural networks. Radiology 291:196–202
Article Google Scholar
Park S, Lee SM, Kim N et al (2019) Application of deep learning-based computer-aided detection system: detecting pneumothorax on chest radiograph after biopsy. Eur Radiol 29:5341–5348
Article Google Scholar
Chassagnon G, Vakalopolou M, Paragios N, Revel MP (2020) Deep learning: definition and perspectives for thoracic imaging. Eur Radiol 30:2021–2030
Article Google Scholar
Park S, Lee SM, Lee KH et al (2020) Deep learning-based detection system for multiclass lesions on chest radiographs: comparison with observer readings. Eur Radiol 30:1359–1368
Article Google Scholar
Park A, Chute C, Rajpurkar P et al (2019) Deep learning-assisted diagnosis of cerebral aneurysms using the HeadXNet model. JAMA Netw Open 2:e195600
Article Google Scholar
Hwang EJ, Nam JG, Lim WH et al (2019) Deep learning for chest radiograph diagnosis in the emergency department. Radiology 293:573–580
Article Google Scholar
Moons KG, Altman DG, Reitsma JB et al (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 162:W1–W73
Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21:128–138
Article Google Scholar
Park SH, Han K (2018) Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 286:800–809
Article Google Scholar
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, pp 1321–1330
Damen JA, Pajouheshnia R, Heus P et al (2019) Performance of the Framingham risk models and pooled cohort equations for predicting 10-year risk of cardiovascular disease: a systematic review and meta-analysis. BMC Med 17:109
Article Google Scholar
Winter A, Aberle DR, Hsu W (2019) External validation and recalibration of the Brock model to predict probability of cancer in pulmonary nodules using NLST data. Thorax 74:551–563
Article Google Scholar
Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers 10:61–74
Google Scholar
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD (2004) Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med 23:2567–2586
Article Google Scholar
Kull M, Silva Filho T, Flach P (2017) Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. Artificial Intelligence and Statistics, pp 623–631
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 694–699
Naeini MP, Cooper G, Hauskrecht M (2015) Obtaining well calibrated probabilities using bayesian binning. Proc Conf AAAI Artif Intell
Schwarz J, Heider D (2018) GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decision making. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty984
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78:1–3
Article Google Scholar
Spiegelhalter DJ (1986) Probabilistic prediction in patient management and clinical trials. Stat Med 5:421–433
Article CAS Google Scholar
Rufibach K (2010) Use of Brier score to assess binary predictions. J Clin Epidemiol 63:938–939 author reply 939
Article Google Scholar
Moons KG, Kengne AP, Grobbee DE et al (2012) Risk prediction models: II. External validation, model updating, and impact assessment. Heart 98:691–698
Article Google Scholar
Royston P, Altman DG (2013) External validation of a Cox prognostic model: principles and methods. BMC Med Res Methodol 13:33
Article Google Scholar

Download references

Funding

This study has received funding from the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Science, ICT & Future Planning (grant number: 2019R1A2C1087960), the Seoul National University Hospital Research fund (grant number: 03-2019-0190), and the Seoul Research & Business Development Program (grant number: FI170002). The funding sources and Lunit Inc. did not have any role either in the design of the study; the acquisition, analyses, and interpretation of the data; and manuscript preparation.

Author information

Eui Jin Hwang and Hyungjin Kim contributed equally to this work.

Authors and Affiliations

Department of Radiology and Institute of Radiation Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea
Eui Jin Hwang, Hyungjin Kim, Jong Hyuk Lee, Jin Mo Goo & Chang Min Park
Department of Radiology, Armed Forces Seoul Hospital, Seoul, South Korea
Jong Hyuk Lee

Authors

Eui Jin Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Hyungjin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jong Hyuk Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jin Mo Goo
View author publications
You can also search for this author in PubMed Google Scholar
Chang Min Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chang Min Park.

Ethics declarations

Guarantor

The scientific guarantor of this publication is Chang Min Park.

Conflict of interest

The authors of this manuscript declare relationships with Lunit Inc.

Eui Jin Hwang, Hyungjin Kim, Jin Mo Goo, and Chang Min Park report grants from the Lunit Inc., outside the present study.

Statistics and biometry

No complex statistical methods were necessary for this paper.

Informed consent

Written informed consent was waived by the institutional review board.

Ethical approval

Institutional review board approval was obtained.

Study subjects or cohorts overlap

The study cohort in the present study (1135/1135) have been reported in a previous study (Hwang EJ, et al Radiology 293(3):573–580); however, the purpose of the previous study was to evaluate the discrimination of a DL algorithm and its potential to improve physicians’ diagnostic performances, which was entirely different from that of the present study.

Methodology

• retrospective

• diagnostic or prognostic study

• performed at one institution

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 28 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hwang, E.J., Kim, H., Lee, J.H. et al. Automated identification of chest radiographs with referable abnormality with deep learning: need for recalibration. Eur Radiol 30, 6902–6912 (2020). https://doi.org/10.1007/s00330-020-07062-7

Download citation

Received: 15 January 2020
Revised: 07 May 2020
Accepted: 01 July 2020
Published: 14 July 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00330-020-07062-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated identification of chest radiographs with referable abnormality with deep learning: need for recalibration