Privacy-Preserving Knowledge Transfer with Bootstrap Aggregation of Teacher Ensembles

Yoon, Hong-Jun; Klasky, Hilda B.; Durbin, Eric B.; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Coyle, Linda; Penberthy, Lynne; Stanley, Christopher; Christian, J. Blair; Tourassi, Georgia D.

doi:10.1007/978-3-030-71055-2_9

Hong-Jun Yoon ORCID: orcid.org/0000-0002-5450-5878¹⁶,
Hilda B. Klasky ORCID: orcid.org/0000-0001-7235-2521¹⁶,
Eric B. Durbin¹⁷,
Xiao-Cheng Wu¹⁸,
Antoinette Stroup¹⁹,
Jennifer Doherty²⁰,
Linda Coyle²¹,
Lynne Penberthy²²,
Christopher Stanley¹⁶,
J. Blair Christian¹⁶ &
…
Georgia D. Tourassi²³

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12633))

Included in the following conference series:

VLDB Workshop on Data Management and Analytics for Medicine and Healthcare
VLDB Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances

590 Accesses
1 Citations

Abstract

There is a need to transfer knowledge among institutions and organizations to save effort in annotation and labeling or in enhancing task performance. However, knowledge transfer is difficult because of restrictions that are in place to ensure data security and privacy. Institutions are not allowed to exchange data or perform any activity that may expose personal information. With the leverage of a differential privacy algorithm in a high-performance computing environment, we propose a new training protocol, Bootstrap Aggregation of Teacher Ensembles (BATE), which is applicable to various types of machine learning models. The BATE algorithm is based on and provides enhancements to the PATE algorithm, maintaining competitive task performance scores on complex datasets with underrepresented class labels.

We conducted a proof-of-the-concept study of the information extraction from cancer pathology report data from four cancer registries and performed comparisons between four scenarios: no collaboration, no privacy-preserving collaboration, the PATE algorithm, and the proposed BATE algorithm. The results showed that the BATE algorithm maintained competitive macro-averaged F1 scores, demonstrating that the suggested algorithm is an effective yet privacy-preserving method for machine learning and deep learning solutions.

This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 2016), pp. 265–283 (2016)
Google Scholar
Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)
Google Scholar
Abadi, M., et al.: On the protection of private information in machine learning systems: two recent approches. In: 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 1–6. IEEE (2017)
Google Scholar
Alawad, M., et al.: Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J. Am. Med. Inform. Assoc. 27(1), 89–98 (2020)
Article Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter MATH Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Fay, D., Sjölund, J., Oechtering, T.J.: Private learning for high-dimensional targets with pate (2020)
Google Scholar
Fay, D., Sjölund, J., Oechtering, T.J.: Decentralized differentially private segmentation with pate. arXiv preprint arXiv:2004.06567 (2020)
Fung, B.C., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. (CSUR) 42(4), 1–53 (2010)
Article Google Scholar
Jordon, J., Yoon, J., van der Schaar, M.: Pate-GAN: generating synthetic data with differential privacy guarantees (2018)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Long, Y., Lin, S., Yang, Z., Gunter, C.A., Li, B.: Scalable differentially private generative student model via pate. arXiv preprint arXiv:1906.09338 (2019)
McMahan, H.B., et al.: A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210 (2018)
Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I., Talwar, K.: Machine learning with privacy by knowledge aggregation and transfer
Google Scholar
Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 (2016)
Papernot, N., McDaniel, P., Sinha, A., Wellman, M.: Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814 (2016)
Papernot, N., McDaniel, P., Sinha, A., Wellman, M.P.: Sok: security and privacy in machine learning. In: 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 399–414. IEEE (2018)
Google Scholar
Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., Erlingsson, Ú.: Scalable private learning with pate. arXiv preprint arXiv:1802.08908 (2018)
Qiu, J.X., Yoon, H.J., Fearn, P.A., Tourassi, G.D.: Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J. Biomed. Health Inform. 22(1), 244–251 (2017)
Article Google Scholar
Wang, L., Zheng, J., Cao, Y., Wang, H.: Enhance pate on complex tasks with knowledge transferred from non-private data. IEEE Access 7, 50081–50094 (2019)
Article Google Scholar
Yoon, H.J., et al.: Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports- manuscript submitted for publication
Google Scholar

Download references

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by DOE and the National Cancer Institute of the National Institutes of Health. This work was performed under the auspices of DOE by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and ORNL under Contract DE-AC05-00OR22725.

KCR data were collected with funding from NCI Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800013I), the CDC National Program of Cancer Registries (NPCR) (U58DP00003907) and the Commonwealth of Kentucky.

LTR data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800007I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006332-02-00) as well as the State of Louisiana.

NJSCR data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201300021I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006279-02-00) as well as the State of New Jersey and the Rutgers Cancer Institute of New Jersey.

The Utah Cancer Registry is funded by the National Cancer Institute’s SEER Program, Contract No. HHSN261201800016I, and the US Centers for Disease Control and Prevention’s National Program of Cancer Registries, Cooperative Agreement No. NU58DP0063200, with additional support from the University of Utah and Huntsman Cancer Foundation.

The study was supported by the Laboratory Directed Research and Development (LDRD) program of Oak Ridge National Laboratory, under LDRD project No. 9831.

This research used resources of the Oak Ridge Leadership Computing Facility at ORNL, which is supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA
Hong-Jun Yoon, Hilda B. Klasky, Christopher Stanley & J. Blair Christian
College of Medicine, University of Kentucky, Lexington, KY, 40536, USA
Eric B. Durbin
Louisiana Tumor Registry, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA, 70112, USA
Xiao-Cheng Wu
New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, 08901, USA
Antoinette Stroup
Utah Cancer Registry, Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, 84132, USA
Jennifer Doherty
Information Management Services Inc., Calverton, MD, 20705, USA
Linda Coyle
Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD, 20814, USA
Lynne Penberthy
National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA
Georgia D. Tourassi

Authors

Hong-Jun Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Hilda B. Klasky
View author publications
You can also search for this author in PubMed Google Scholar
Eric B. Durbin
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Cheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Antoinette Stroup
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Doherty
View author publications
You can also search for this author in PubMed Google Scholar
Linda Coyle
View author publications
You can also search for this author in PubMed Google Scholar
Lynne Penberthy
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Stanley
View author publications
You can also search for this author in PubMed Google Scholar
J. Blair Christian
View author publications
You can also search for this author in PubMed Google Scholar
Georgia D. Tourassi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Jun Yoon .

Editor information

Editors and Affiliations

Massachusetts Institute of Technology, Lexington, MA, USA
Vijay Gadepally
Intel Corporation, Portland, OR, USA
Timothy Mattson
Massachusetts Institute of Technology, Cambridge, MA, USA
Michael Stonebraker
Massachusetts Institute of Technology, Cambridge, MA, USA
Tim Kraska
Stony Brook University, Stony Brook, NY, USA
Fusheng Wang
University of Washington, Seattle, WA, USA
Gang Luo
Georgia State University, Atlanta, GA, USA
Jun Kong
Lucerne Unviersity of Applied Sciences, Rotkreuz, Switzerland
Alevtina Dubovitskaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoon, HJ. et al. (2021). Privacy-Preserving Knowledge Transfer with Bootstrap Aggregation of Teacher Ensembles. In: Gadepally, V., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2020 2020. Lecture Notes in Computer Science(), vol 12633. Springer, Cham. https://doi.org/10.1007/978-3-030-71055-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-71055-2_9
Published: 04 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71054-5
Online ISBN: 978-3-030-71055-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics