Knowledge distillation for BERT unsupervised domain adaptation

Ryu, Minho; Lee, Geonseok; Lee, Kichun

doi:10.1007/s10115-022-01736-y

Knowledge distillation for BERT unsupervised domain adaptation

Regular Paper
Published: 20 August 2022

Volume 64, pages 3113–3128, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Minho Ryu¹,
Geonseok Lee² &
Kichun Lee²

1008 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

A pre-trained language model, BERT, has brought significant performance improvements across a range of natural language processing tasks. Since the model is trained on a large corpus of diverse topics, it shows robust performance for domain shift problems in which data distributions at training (source data) and testing (target data) differ while sharing similarities. Despite its great improvements compared to previous models, it still suffers from performance degradation due to domain shifts. To mitigate such problems, we propose a simple but effective unsupervised domain adaptation method, adversarial adaptation with distillation (AAD), which combines the adversarial discriminative domain adaptation (ADDA) framework with knowledge distillation. We evaluate our approach in the task of cross-domain sentiment classification on 30 domain pairs, advancing the state-of-the-art performance for unsupervised domain adaptation in text sentiment classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

Article 29 March 2023

A Unified Adversarial Learning Framework for Semi-supervised Multi-target Domain Adaptation

Semi-supervised adversarial discriminative domain adaptation

Article 29 November 2022

Notes

Given probability distributions p and q, KL divergence of q from p is defined to be \(KL(p \parallel q)=-\sum _j{p_j\log (q_j/p_j)}\).
https://github.com/huggingface/transformers.
linear(i, o) stands for a fully connected layer with a matrix of \(i \times o\), and leakyReLU(\(\alpha \)) represents the leakyReLU activation layer with negative slope \(\alpha \). sigmoid() represents a sigmoid activation layer.
https://github.com/bzantium/bert-AAD.

References

Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing, Association for Computational Linguistics, Sydney, Australia, pp 120–128, https://www.aclweb.org/anthology/W06-1615
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 440–447, https://www.aclweb.org/anthology/P07-1056
Chadha A, Andreopoulos Y (2018) Improving adversarial discriminative domain adaptation. arXiv:1809.03625
Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805, http://arxiv.org/abs/1810.04805, arxiv:1810.04805
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, March M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35, http://jmlr.org/papers/v17/15-239.html
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27, Curran Associates, Inc., pp 2672–2680, http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819
Article Google Scholar
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. In: NIPS deep learning and representation learning workshop, http://arxiv.org/abs/1503.02531
Hoffman J, Tzeng E, Park T, Zhu JY, Isola P, Saenko K, Efros AA, Darrell T (2018) Cycada: cycle-consistent adversarial domain adaptation. In: Dy JG, Krause A (eds) ICML, PMLR, vol 80, pp 1994–2003, http://dblp.uni-trier.de/db/conf/icml/icml2018.html#HoffmanTPZISED18
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2019) Spanbert: improving pre-training by representing and predicting spans. CoRR abs/1907.10529, http://arxiv.org/abs/1907.10529, arxiv:1907.10529
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arxiv:1412.6980
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2016) Overcoming catastrophic forgetting in neural networks. arxiv:1612.00796
Kumar A, Sattigeri P, Fletcher PT (2017) Improved semi-supervised learning with gans using manifold invariances. CoRR abs/1705.08850, http://arxiv.org/abs/1705.08850, arxiv:1705.08850
Liu MY, Tuzel O (2016) Coupled generative adversarial networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 469–477, http://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692, http://arxiv.org/abs/1907.11692, arxiv:1907.11692
Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, JMLR org, ICML’15, pp 97–105, http://dl.acm.org/citation.cfm?id=3045118.3045130
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150, http://www.aclweb.org/anthology/P11-1015
Nguyen Q (2015) The airline review dataset. Scraped from www.airlinequality.com, https://github.com/quankiquanki/skytrax-reviews-dataset
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS autodiff workshop
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. CoRR abs/1802.05365, http://arxiv.org/abs/1802.05365, arxiv:1802.05365
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. OpenAI. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In: NeurIPS EMC2 workshop
Sun B, Saenko K (2016) Deep CORAL: correlation alignment for deep domain adaptation. CoRR abs/1607.01719, http://arxiv.org/abs/1607.01719, arxiv:1607.01719
Sun B, Feng J, Saenko K (2016) Return of frustratingly easy domain adaptation. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16, pp 2058–2065, http://dl.acm.org/citation.cfm?id=3016100.3016186
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474, http://arxiv.org/abs/1412.3474, arxiv:1412.3474
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2962–2971
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 5998–6008, http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Wang L, Yoon K-J (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transact Pattern Anal Mach Intell 44(6):3048–3068. https://doi.org/10.1109/TPAMI.2021.3055564
Article Google Scholar
Weng R, Yu H, Huang S, Cheng S, Luo W (2020) Acquiring knowledge from pre-trained model to neural machine translation. Proceedings of the AAAI conference on artificial intelligence 34:9266–9273
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: state-of-the-art natural language processing. arxiv:1910.03771
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237, http://arxiv.org/abs/1906.08237, arxiv:1906.08237
Ziser Y, Reichart R (2017) Neural structural correspondence learning for domain adaptation. In: Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada, pp 400–410, 10.18653/v1/K17-1040, https://www.aclweb.org/anthology/K17-1040
Ziser Y, Reichart R (2018) Pivot based language modeling for improved neural domain adaptation. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 1241–1251, 10.18653/v1/N18-1112, https://www.aclweb.org/anthology/N18-1112

Download references

Acknowledgements

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2020R1F1A1076278). This work was also supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600090).

Author information

Authors and Affiliations

SK Telecom, Seoul, South Korea
Minho Ryu
Department of Industrial Engineering, Hanyang University, Seoul, South Korea
Geonseok Lee & Kichun Lee

Authors

Minho Ryu
View author publications
You can also search for this author in PubMed Google Scholar
Geonseok Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kichun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kichun Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was done while the author was a graduate student in the Department of Industrial Engineering, Hanyang University.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ryu, M., Lee, G. & Lee, K. Knowledge distillation for BERT unsupervised domain adaptation. Knowl Inf Syst 64, 3113–3128 (2022). https://doi.org/10.1007/s10115-022-01736-y

Download citation

Received: 23 February 2021
Revised: 18 July 2022
Accepted: 24 July 2022
Published: 20 August 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s10115-022-01736-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge distillation for BERT unsupervised domain adaptation

Abstract

Access this article

Similar content being viewed by others

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

A Unified Adversarial Learning Framework for Semi-supervised Multi-target Domain Adaptation

Semi-supervised adversarial discriminative domain adaptation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Knowledge distillation for BERT unsupervised domain adaptation

Abstract

Access this article

Similar content being viewed by others

Moka-ADA: adversarial domain adaptation with model-oriented knowledge adaptation for cross-domain sentiment analysis

A Unified Adversarial Learning Framework for Semi-supervised Multi-target Domain Adaptation

Semi-supervised adversarial discriminative domain adaptation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation