Towards Similar User Utterance Augmentation for Out-of-Domain Detection

Azpeitia, Andoni; Serras, Manex; García-Sardiña, Laura; Fernández-Bhogal, Mikel D.; del Pozo, Arantza

doi:10.1007/978-981-15-8395-7_22

Andoni Azpeitia³⁷,
Manex Serras³⁷,
Laura García-Sardiña³⁷,
Mikel D. Fernández-Bhogal³⁷ &
…
Arantza del Pozo³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 704))

785 Accesses

Abstract

Data scarcity is a common issue in the development of Dialogue Systems from scratch, where it is difficult to find dialogue data. This scenario is more likely to happen when the system’s language differs from English. This paper proposes a first text augmentation approach that selects samples similar to annotated user utterances from existing corpora, even if they differ in style, domain or content, in order to improve the detection of Out-of-Domain (OOD) user inputs. Three different sampling methods based on word-vectors extracted from BERT language representation model are compared. The evaluation is carried out using a Spanish chatbot corpus for OOD utterances detection, which has been artificially reduced to simulate various scenarios with different amounts of data. The presented approach is shown to be capable of enhancing the detection of OOD user utterances, achieving greater improvements when less annotated data is available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.opensubtitles.org: a webpage containing subtitles for a vast amount of movies in many different languages.

References

Cruz JCB, Cheng C (2019) Evaluating language model finetuning techniques for low-resource languages. arXiv preprint arXiv:1907.00409
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers), pp 567–573
Google Scholar
Gorin AL, Riccardi G, Wright JH (1997) How may i help you? Speech Commun 23(1–2):113–127
Article Google Scholar
Kobayashi S (2018) Contextual augmentation: Data augmentation by words with paradigmatic relations. In: Proceedings of NAACL-HLT, pp. 452–457
Google Scholar
Lison P, Tiedemann J (May 2016) OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929. https://www.aclweb.org/anthology/L16-1147
Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4487–4496
Google Scholar
Mulcaire P, Kasai J, Smith NA (2019) Low-resource parsing with crosslingual contextualized representations. In: Proceedings of the 23rd conference on computational natural language learning (CoNLL), pp 304–315
Google Scholar
Roy D, Paul D, Mitra M, Garain U (2016) Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608
Settles B (2009) Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Technical report
Google Scholar
Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3):491–504
Article Google Scholar
Sun C, Huang L, Qiu X (2019) Utilizing bert for aspect-based sentiment analysis via constructing auxiliary sentence. In: Proceedings of NAACL-HLT, pp. 380–385
Google Scholar
Tao T, Wang X, Mei Q, Zhai C (2016) Language model information retrieval with document expansion. In: Proceedings of the main conference on human language technology conference of the north American chapter of the association of computational linguistics. Association for Computational Linguistics, pp 407–414
Google Scholar
Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical NLP pipeline. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4593–4601
Google Scholar
Tiedemann J (May 2012) Parallel data, tools and interfaces in opus. In: Chair NCC, Choukri K, Declerck T, Doğan MU, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the eight international conference on language resources and evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey
Google Scholar
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J (2019) Huggingface’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771
Google Scholar
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems. pp 649–657
Google Scholar

Download references

Acknowledgments

This work has been partially supported by the HAZITEK program (CONTACT ZL-2020/00237) of the Economic Development and Infrastructure department of the Basque Government.

Author information

Authors and Affiliations

Vicomtech Parque Científico y Tecnológico de Gipuzkoa, Paseo Mikeletegi 57, Donostia/San, Sebastián, Spain
Andoni Azpeitia, Manex Serras, Laura García-Sardiña, Mikel D. Fernández-Bhogal & Arantza del Pozo

Authors

Andoni Azpeitia
View author publications
You can also search for this author in PubMed Google Scholar
Manex Serras
View author publications
You can also search for this author in PubMed Google Scholar
Laura García-Sardiña
View author publications
You can also search for this author in PubMed Google Scholar
Mikel D. Fernández-Bhogal
View author publications
You can also search for this author in PubMed Google Scholar
Arantza del Pozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andoni Azpeitia .

Editor information

Editors and Affiliations

Speech Technology Group - Information Processing and Telecommunications Center (IPTC), Universidad Politécnica de Madrid, Madrid, Spain
Luis Fernando D'Haro
Department of Languages and Computer Systems, Universidad de Granada, CITIC-UGR, Granada, Spain
Zoraida Callejas
Information Science, Nara Institute of Science and Technology, Ikoma, Japan
Satoshi Nakamura

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Azpeitia, A., Serras, M., García-Sardiña, L., Fernández-Bhogal, M.D., del Pozo, A. (2021). Towards Similar User Utterance Augmentation for Out-of-Domain Detection. In: D'Haro, L.F., Callejas, Z., Nakamura, S. (eds) Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-15-8395-7_22

Download citation

DOI: https://doi.org/10.1007/978-981-15-8395-7_22
Published: 25 October 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8394-0
Online ISBN: 978-981-15-8395-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics