Skip to main content
Log in

Knowledge distillation for BERT unsupervised domain adaptation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A pre-trained language model, BERT, has brought significant performance improvements across a range of natural language processing tasks. Since the model is trained on a large corpus of diverse topics, it shows robust performance for domain shift problems in which data distributions at training (source data) and testing (target data) differ while sharing similarities. Despite its great improvements compared to previous models, it still suffers from performance degradation due to domain shifts. To mitigate such problems, we propose a simple but effective unsupervised domain adaptation method, adversarial adaptation with distillation (AAD), which combines the adversarial discriminative domain adaptation (ADDA) framework with knowledge distillation. We evaluate our approach in the task of cross-domain sentiment classification on 30 domain pairs, advancing the state-of-the-art performance for unsupervised domain adaptation in text sentiment classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Given probability distributions p and q, KL divergence of q from p is defined to be \(KL(p \parallel q)=-\sum _j{p_j\log (q_j/p_j)}\).

  2. https://github.com/huggingface/transformers.

  3. linear(i, o) stands for a fully connected layer with a matrix of \(i \times o\), and leakyReLU(\(\alpha \)) represents the leakyReLU activation layer with negative slope \(\alpha \). sigmoid() represents a sigmoid activation layer.

  4. https://github.com/bzantium/bert-AAD.

References

  1. Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing, Association for Computational Linguistics, Sydney, Australia, pp 120–128, https://www.aclweb.org/anthology/W06-1615

  2. Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 440–447, https://www.aclweb.org/anthology/P07-1056

  3. Chadha A, Andreopoulos Y (2018) Improving adversarial discriminative domain adaptation. arXiv:1809.03625

  4. Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805, http://arxiv.org/abs/1810.04805, arxiv:1810.04805

  5. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, March M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35, http://jmlr.org/papers/v17/15-239.html

  6. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27, Curran Associates, Inc., pp 2672–2680, http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

  7. Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789–1819

    Article  Google Scholar 

  8. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. In: NIPS deep learning and representation learning workshop, http://arxiv.org/abs/1503.02531

  9. Hoffman J, Tzeng E, Park T, Zhu JY, Isola P, Saenko K, Efros AA, Darrell T (2018) Cycada: cycle-consistent adversarial domain adaptation. In: Dy JG, Krause A (eds) ICML, PMLR, vol 80, pp 1994–2003, http://dblp.uni-trier.de/db/conf/icml/icml2018.html#HoffmanTPZISED18

  10. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2019) Spanbert: improving pre-training by representing and predicting spans. CoRR abs/1907.10529, http://arxiv.org/abs/1907.10529, arxiv:1907.10529

  11. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arxiv:1412.6980

  12. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2016) Overcoming catastrophic forgetting in neural networks. arxiv:1612.00796

  13. Kumar A, Sattigeri P, Fletcher PT (2017) Improved semi-supervised learning with gans using manifold invariances. CoRR abs/1705.08850, http://arxiv.org/abs/1705.08850, arxiv:1705.08850

  14. Liu MY, Tuzel O (2016) Coupled generative adversarial networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 469–477, http://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf

  15. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692, http://arxiv.org/abs/1907.11692, arxiv:1907.11692

  16. Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, JMLR org, ICML’15, pp 97–105, http://dl.acm.org/citation.cfm?id=3045118.3045130

  17. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150, http://www.aclweb.org/anthology/P11-1015

  18. Nguyen Q (2015) The airline review dataset. Scraped from www.airlinequality.com, https://github.com/quankiquanki/skytrax-reviews-dataset

  19. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS autodiff workshop

  20. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. CoRR abs/1802.05365, http://arxiv.org/abs/1802.05365, arxiv:1802.05365

  21. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. OpenAI. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

  22. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In: NeurIPS EMC2 workshop

  23. Sun B, Saenko K (2016) Deep CORAL: correlation alignment for deep domain adaptation. CoRR abs/1607.01719, http://arxiv.org/abs/1607.01719, arxiv:1607.01719

  24. Sun B, Feng J, Saenko K (2016) Return of frustratingly easy domain adaptation. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16, pp 2058–2065, http://dl.acm.org/citation.cfm?id=3016100.3016186

  25. Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474, http://arxiv.org/abs/1412.3474, arxiv:1412.3474

  26. Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2962–2971

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 5998–6008, http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  28. Wang L, Yoon K-J (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transact Pattern Anal Mach Intell 44(6):3048–3068. https://doi.org/10.1109/TPAMI.2021.3055564

    Article  Google Scholar 

  29. Weng R, Yu H, Huang S, Cheng S, Luo W (2020) Acquiring knowledge from pre-trained model to neural machine translation. Proceedings of the AAAI conference on artificial intelligence 34:9266–9273

  30. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: state-of-the-art natural language processing. arxiv:1910.03771

  31. Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237, http://arxiv.org/abs/1906.08237, arxiv:1906.08237

  32. Ziser Y, Reichart R (2017) Neural structural correspondence learning for domain adaptation. In: Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada, pp 400–410, 10.18653/v1/K17-1040, https://www.aclweb.org/anthology/K17-1040

  33. Ziser Y, Reichart R (2018) Pivot based language modeling for improved neural domain adaptation. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (long papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 1241–1251, 10.18653/v1/N18-1112, https://www.aclweb.org/anthology/N18-1112

Download references

Acknowledgements

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2020R1F1A1076278). This work was also supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600090).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kichun Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was done while the author was a graduate student in the Department of Industrial Engineering, Hanyang University.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ryu, M., Lee, G. & Lee, K. Knowledge distillation for BERT unsupervised domain adaptation. Knowl Inf Syst 64, 3113–3128 (2022). https://doi.org/10.1007/s10115-022-01736-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01736-y

Keywords

Navigation