Abstract
This book addresses the full set of questions that arise when attempting to exploit comparable corpora to overcome the bottleneck of insufficient parallel corpora that affects any data-driven machine translation approach, particularly in relation to under-resourced languages and narrow domains. It describes methods and tools for identifying and assessing comparability, for gathering comparable corpora from the Web, for extracting translation equivalents from within comparable texts and discusses the evaluation of this pipeline of methods and tools by incorporating their outputs into a machine translation system and assessing its performance in real application settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.
Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.
Azpeitia, A., Etchegoyhen, T., & Martinez Garcia, E. (2018). Extracting parallel sentences from comparable corpora with STACC variants. Proceedings of the 11th Workshop on Building and Using Comparable Corpora (pp. 48–52).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0. article. Retrieved from http://arxiv.org/abs/1409.0473
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., et al. (2016). Findings of the 2016 conference on machine translation. Proceedings of the First Conference on Machine Translation (WMT 2016), Vol. 2: Shared Task Papers (pp. 131–198).
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). Proceedings of the Second Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 169–214). Association for Computational Linguistics, Copenhagen, Denmark. Retrieved from http://www.aclweb.org/anthology/W17-4717
Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 2134–2137).
Chiao, Y., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable. COLING '02 Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2, pp. 1–5).
Daille, B., & Morin, E. (2008). An effective compositional model for lexical alignment. Proceedings, 3rd International Joint Conference on Natural Language Processing (IJCLNP) (pp. 95–102).
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. ACL (1) (pp. 1370–1380). In Proceedings.
EAGLES. (1996). Preliminary recommendations on corpus typology. Electronic Resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html
Etchegoyhen, T., & Azpeitia, A. (2016). Set-theoretic alignment for comparable corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers, pp. 2009–2018).
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.
Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).
Ion, R., & Tufiş, D. (2007). RACAI: Meaning affinity models. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) (pp. 282–287), Association for Computational Linguistics, Prague, Czech Republic, June 2007.
Irimia, E. (2009). Metode de traducere automată prin analogie. Aplicaţii pentru limbile română şi engleză. (Methods for Analogy-based Machine Translation. Applications for Romanian and English). PhD thesis, March 2009.
Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262—270).
Jean, S., Firat, O., Cho, K., Memisevic, R., & Bengio, Y. (2015). Montreal neural machine translation systems for WMT15. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 134–140).
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of Machine Translation Summit X.
Koehn, P. (2010). Statistical machine translation. Cambridge University Press.
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, NMT@ACL 2017 (pp. 28–39), Vancouver, Canada, August 4, 2017.
Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.
Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora, Valletta, Malta (pp. 42–48).
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412–1421).
McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.
Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings, 45th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 664–671).
Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A. et al. (2012). ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of the ACL 2012 System Demonstrations (pp. 91–96). Association for Computational Linguistics, Jeju, South Korea.
Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 320–322).
Rayson, P., & Garside, R. (2000) Comparing corpora using frequency profiling. Proceedings of the Comparing Corpora Workshop at ACL’00 (pp. 1–6).
Rehm, G., & Uszkoreit, H. (Eds.). (2012). White paper series. Springer.
Sennrich, R., Haddow, B., & Birch, A. (2016a). Edinburgh neural machine translation systems for WMT 16. Proceedings of the First Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 368–373), Berlin, Germany.
Sennrich, R., Hadow, B., & Birch, A. (2016b). Improving neural machine translation models with monolingual data. Proceedings of Annual Meeting of ACL (pp. 86–96).
Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium
Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M. et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 438–445).
Smith, J.R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. NAACL-HLT 2010 (pp. 403–411).
Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation: LREC’06.
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of LREC’2012 (pp. 454–459), Istanbul, Turkey.
Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing (BJMC), 4(2). Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016.
Tyers, F. M., & Alpren, M. S. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.
Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese-English news articles and sentences. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 7–12).
Xu, J., Kennington, C., Przywara, C., & Wanzare, L. (2012). Comparable corpora in Wikipedia text for machine translation. Proceedings of the 6th NIC Symposium 2012: 25 Years HLRZ/NIC (Book Section). ISBN: 9783893367580, Jülich, Germany, February 2012.
Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. Proceedings of the 2002 I.E. International Conference on Data Mining (ICDM’02) (p. 74).
Zweigenbaum, P., Sharoff, S., & Rapp, R. (2018). Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. Proceedings of 11th Workshop on Building and Using Comparable Corpora (pp. 39–42).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Skadiņa, I., Gaizauskas, R., Vasiļjevs, A., Paramita, M.L. (2019). Introduction. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-99004-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)