Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

Lu, Shixiang; Peng, Xingyuan; Chen, Zhenbiao; Xu, Bo

doi:10.1007/978-3-642-41644-6_12

Shixiang Lu⁴,
Xingyuan Peng⁴,
Zhenbiao Chen⁴ &
…
Bo Xu⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 400))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1817 Accesses

Abstract

This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared with the traditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in- and general-domain baseline system, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (1991)
Book Google Scholar
Eck, M., Vogel, S., Waibel, A.: Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of LREC, pp. 327–330 (2004)
Google Scholar
Foster, G., Kuhn, R.: Mixture-model adaptation for SMT. In: Proceedings of ACL, pp. 128–135 (2007)
Google Scholar
Hildebrand, A.S.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT, pp. 133–142 (2005)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of NAACL, pp. 48–54 (2003)
Google Scholar
Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of WMT (2007)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of ICML, pp. 296–304 (1998)
Google Scholar
Liu, P., Zhou, Y., Zong, C.: Approach to selecting best development set for phrase-base statistical machine translation. In: Proceedings of PACLIC, pp. 325–334 (2009)
Google Scholar
Liu, P., Zhou, Y., Zong, C.: Data selection for statistical machine translation. In: Proceedings of NLP-KE, pp. 232–236 (2010)
Google Scholar
Lu, S., Fu, X., Wei, W., Peng, X., Xu, B.: Joint and coupled bilingual topic model based sentence representations for language model adpataiton. In: Proceedings of IJCAI, pp. 2141–2147 (2013)
Google Scholar
Lu, S., Wei, W., Fu, X., Xu, B.: Translation model based cross-lingual language model adaptation: from word models to phrase models. In: Proceedings of EMNLP-CoNLL, pp. 512–522 (2012)
Google Scholar
Lv, Y., Huang, J., Liu, Q.: Improving statistical machine translation peformance by training data selection and optimization. In: Proceedings of EMNLP, pp. 343–350 (2007)
Google Scholar
Moore, R., Lewis, W.: Intelligent selection for language model training data. In: Proceedings of ACL, pp. 220–224 (2010)
Google Scholar
Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of WMT (2008)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Google Scholar
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of ACL, pp. 440–447 (2000)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Interactive Digital Media Technology Research Center (IDMTech), Institute of Automation, Chinese Academy of Sciences, Beijing, China
Shixiang Lu, Xingyuan Peng, Zhenbiao Chen & Bo Xu

Authors

Shixiang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xingyuan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Zhenbiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Soochow University, 1 Shizi Street, 215006, Suzhou, China
Guodong Zhou
Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Juanzi Li
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Yansong Feng &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, S., Peng, X., Chen, Z., Xu, B. (2013). Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-41644-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics