Introduction

Skadiņa, Inguna; Gaizauskas, Robert; Vasiļjevs, Andrejs; Paramita, Monica Lestari

doi:10.1007/978-3-319-99004-0_1

Inguna Skadiņa¹⁰,
Robert Gaizauskas¹¹,
Andrejs Vasiļjevs¹⁰ &
…
Monica Lestari Paramita¹¹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

366 Accesses

Abstract

This book addresses the full set of questions that arise when attempting to exploit comparable corpora to overcome the bottleneck of insufficient parallel corpora that affects any data-driven machine translation approach, particularly in relation to under-resourced languages and narrow domains. It describes methods and tools for identifying and assessing comparability, for gathering comparable corpora from the Web, for extracting translation equivalents from within comparable texts and discusses the evaluation of this pipeline of methods and tools by incorporating their outputs into a machine translation system and assessing its performance in real application settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.
Google Scholar
Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Article Google Scholar
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.
Google Scholar
Azpeitia, A., Etchegoyhen, T., & Martinez Garcia, E. (2018). Extracting parallel sentences from comparable corpora with STACC variants. Proceedings of the 11th Workshop on Building and Using Comparable Corpora (pp. 48–52).
Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0. article. Retrieved from http://arxiv.org/abs/1409.0473
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., et al. (2016). Findings of the 2016 conference on machine translation. Proceedings of the First Conference on Machine Translation (WMT 2016), Vol. 2: Shared Task Papers (pp. 131–198).
Google Scholar
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). Proceedings of the Second Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 169–214). Association for Computational Linguistics, Copenhagen, Denmark. Retrieved from http://www.aclweb.org/anthology/W17-4717
Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 2134–2137).
Google Scholar
Chiao, Y., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable. COLING '02 Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2, pp. 1–5).
Google Scholar
Daille, B., & Morin, E. (2008). An effective compositional model for lexical alignment. Proceedings, 3rd International Joint Conference on Natural Language Processing (IJCLNP) (pp. 95–102).
Google Scholar
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. ACL (1) (pp. 1370–1380). In Proceedings.
Google Scholar
EAGLES. (1996). Preliminary recommendations on corpus typology. Electronic Resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html
Etchegoyhen, T., & Azpeitia, A. (2016). Set-theoretic alignment for comparable corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers, pp. 2009–2018).
Google Scholar
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.
Google Scholar
Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).
Google Scholar
Ion, R., & Tufiş, D. (2007). RACAI: Meaning affinity models. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) (pp. 282–287), Association for Computational Linguistics, Prague, Czech Republic, June 2007.
Google Scholar
Irimia, E. (2009). Metode de traducere automată prin analogie. Aplicaţii pentru limbile română şi engleză. (Methods for Analogy-based Machine Translation. Applications for Romanian and English). PhD thesis, March 2009.
Google Scholar
Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262—270).
Google Scholar
Jean, S., Firat, O., Cho, K., Memisevic, R., & Bengio, Y. (2015). Montreal neural machine translation systems for WMT15. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 134–140).
Google Scholar
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37.
Article Google Scholar
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of Machine Translation Summit X.
Google Scholar
Koehn, P. (2010). Statistical machine translation. Cambridge University Press.
Google Scholar
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, NMT@ACL 2017 (pp. 28–39), Vancouver, Canada, August 4, 2017.
Google Scholar
Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.
Google Scholar
Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora, Valletta, Malta (pp. 42–48).
Google Scholar
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412–1421).
Google Scholar
McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.
Chapter Google Scholar
Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings, 45th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 664–671).
Google Scholar
Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Article Google Scholar
Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A. et al. (2012). ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of the ACL 2012 System Demonstrations (pp. 91–96). Association for Computational Linguistics, Jeju, South Korea.
Google Scholar
Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 320–322).
Google Scholar
Rayson, P., & Garside, R. (2000) Comparing corpora using frequency profiling. Proceedings of the Comparing Corpora Workshop at ACL’00 (pp. 1–6).
Google Scholar
Rehm, G., & Uszkoreit, H. (Eds.). (2012). White paper series. Springer.
Google Scholar
Sennrich, R., Haddow, B., & Birch, A. (2016a). Edinburgh neural machine translation systems for WMT 16. Proceedings of the First Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 368–373), Berlin, Germany.
Google Scholar
Sennrich, R., Hadow, B., & Birch, A. (2016b). Improving neural machine translation models with monolingual data. Proceedings of Annual Meeting of ACL (pp. 86–96).
Google Scholar
Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium
Google Scholar
Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M. et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 438–445).
Google Scholar
Smith, J.R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. NAACL-HLT 2010 (pp. 403–411).
Google Scholar
Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation: LREC’06.
Google Scholar
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of LREC’2012 (pp. 454–459), Istanbul, Turkey.
Google Scholar
Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing (BJMC), 4(2). Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016.
Google Scholar
Tyers, F. M., & Alpren, M. S. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.
Google Scholar
Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese-English news articles and sentences. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 7–12).
Google Scholar
Xu, J., Kennington, C., Przywara, C., & Wanzare, L. (2012). Comparable corpora in Wikipedia text for machine translation. Proceedings of the 6th NIC Symposium 2012: 25 Years HLRZ/NIC (Book Section). ISBN: 9783893367580, Jülich, Germany, February 2012.
Google Scholar
Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. Proceedings of the 2002 I.E. International Conference on Data Mining (ICDM’02) (p. 74).
Google Scholar
Zweigenbaum, P., Sharoff, S., & Rapp, R. (2018). Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. Proceedings of 11th Workshop on Building and Using Comparable Corpora (pp. 39–42).
Google Scholar

Download references

Author information

Authors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa & Andrejs Vasiļjevs
University of Sheffield, Sheffield, UK
Robert Gaizauskas & Monica Lestari Paramita

Authors

Inguna Skadiņa
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar
Andrejs Vasiļjevs
View author publications
You can also search for this author in PubMed Google Scholar
Monica Lestari Paramita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Inguna Skadiņa .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Skadiņa, I., Gaizauskas, R., Vasiļjevs, A., Paramita, M.L. (2019). Introduction. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_1
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics