Abstract
A pre-trained language model, BERT, has brought significant performance improvements across a range of natural language processing tasks. Since the model is trained on a large corpus of diverse topics, it shows robust performance for domain shift problems in which data distributions at training (source data) and testing (target data) differ while sharing similarities. Despite its great improvements compared to previous models, it still suffers from performance degradation due to domain shifts. To mitigate such problems, we propose a simple but effective unsupervised domain adaptation method, adversarial adaptation with distillation (AAD), which combines the adversarial discriminative domain adaptation (ADDA) framework with knowledge distillation. We evaluate our approach in the task of cross-domain sentiment classification on 30 domain pairs, advancing the state-of-the-art performance for unsupervised domain adaptation in text sentiment classification.
Similar content being viewed by others
Notes
Given probability distributions p and q, KL divergence of q from p is defined to be \(KL(p \parallel q)=-\sum _j{p_j\log (q_j/p_j)}\).
linear(i, o) stands for a fully connected layer with a matrix of \(i \times o\), and leakyReLU(\(\alpha \)) represents the leakyReLU activation layer with negative slope \(\alpha \). sigmoid() represents a sigmoid activation layer.
References
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing, Association for Computational Linguistics, Sydney, Australia, pp 120–128, https://www.aclweb.org/anthology/W06-1615
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 440–447, https://www.aclweb.org/anthology/P07-1056
Chadha A, Andreopoulos Y (2018) Improving adversarial discriminative domain adaptation. arXiv:1809.03625
Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805, http://arxiv.org/abs/1810.04805, arxiv:1810.04805
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, March M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35, http://jmlr.org/papers/v17/15-239.html
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27, Curran Associates, Inc., pp 2672–2680, http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. In: NIPS deep learning and representation learning workshop, http://arxiv.org/abs/1503.02531
Hoffman J, Tzeng E, Park T, Zhu JY, Isola P, Saenko K, Efros AA, Darrell T (2018) Cycada: cycle-consistent adversarial domain adaptation. In: Dy JG, Krause A (eds) ICML, PMLR, vol 80, pp 1994–2003, http://dblp.uni-trier.de/db/conf/icml/icml2018.html#HoffmanTPZISED18
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2019) Spanbert: improving pre-training by representing and predicting spans. CoRR abs/1907.10529, http://arxiv.org/abs/1907.10529, arxiv:1907.10529
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arxiv:1412.6980
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2016) Overcoming catastrophic forgetting in neural networks. arxiv:1612.00796
Kumar A, Sattigeri P, Fletcher PT (2017) Improved semi-supervised learning with gans using manifold invariances. CoRR abs/1705.08850, http://arxiv.org/abs/1705.08850, arxiv:1705.08850
Liu MY, Tuzel O (2016) Coupled generative adversarial networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 469–477, http://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692, http://arxiv.org/abs/1907.11692, arxiv:1907.11692
Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, JMLR org, ICML’15, pp 97–105, http://dl.acm.org/citation.cfm?id=3045118.3045130
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150, http://www.aclweb.org/anthology/P11-1015
Nguyen Q (2015) The airline review dataset. Scraped from www.airlinequality.com, https://github.com/quankiquanki/skytrax-reviews-dataset
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS autodiff workshop
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. CoRR abs/1802.05365, http://arxiv.org/abs/1802.05365, arxiv:1802.05365
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. OpenAI. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In: NeurIPS EMC2 workshop
Sun B, Saenko K (2016) Deep CORAL: correlation alignment for deep domain adaptation. CoRR abs/1607.01719, http://arxiv.org/abs/1607.01719, arxiv:1607.01719
Sun B, Feng J, Saenko K (2016) Return of frustratingly easy domain adaptation. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16, pp 2058–2065, http://dl.acm.org/citation.cfm?id=3016100.3016186
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474, http://arxiv.org/abs/1412.3474, arxiv:1412.3474
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2962–2971
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 5998–6008, http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Wang L, Yoon K-J (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transact Pattern Anal Mach Intell 44(6):3048–3068. https://doi.org/10.1109/TPAMI.2021.3055564
Weng R, Yu H, Huang S, Cheng S, Luo W (2020) Acquiring knowledge from pre-trained model to neural machine translation. Proceedings of the AAAI conference on artificial intelligence 34:9266–9273
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: state-of-the-art natural language processing. arxiv:1910.03771
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237, http://arxiv.org/abs/1906.08237, arxiv:1906.08237
Ziser Y, Reichart R (2017) Neural structural correspondence learning for domain adaptation. In: Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada, pp 400–410, 10.18653/v1/K17-1040, https://www.aclweb.org/anthology/K17-1040
Ziser Y, Reichart R (2018) Pivot based language modeling for improved neural domain adaptation. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 1241–1251, 10.18653/v1/N18-1112, https://www.aclweb.org/anthology/N18-1112
Acknowledgements
This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2020R1F1A1076278). This work was also supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600090).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was done while the author was a graduate student in the Department of Industrial Engineering, Hanyang University.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ryu, M., Lee, G. & Lee, K. Knowledge distillation for BERT unsupervised domain adaptation. Knowl Inf Syst 64, 3113–3128 (2022). https://doi.org/10.1007/s10115-022-01736-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01736-y