The hare and the tortoise: speed and accuracy in translation retrieval

Baldwin, Timothy

doi:10.1007/s10590-009-9064-7

The hare and the tortoise: speed and accuracy in translation retrieval

Published: 26 February 2010

Volume 23, pages 195–240, (2009)
Cite this article

Machine Translation

Timothy Baldwin¹

179 Accesses
2 Citations
Explore all metrics

Abstract

This research looks at the effects of segment order and segmentation on translation retrieval performance for an experimental Japanese–English translation memory system. We implement a number of both bag-of-words and segment-order-sensitive string comparison methods, and test each over character-based and word-based indexing using n-grams of various orders. To evaluate accuracy, we propose an automatic method which identifies the target-language string(s) which would lead to the optimal translation for a given input, based on analysis of the held-out translation and the current contents of the translation memory. Our results indicate that character-based indexing is superior to word-based indexing, and also that bag-of-words methods are equivalent to segment-order-sensitive methods in terms of accuracy but vastly superior in terms of retrieval speed, suggesting that word segmentation and segment-order sensitivity are unnecessary luxuries for translation retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aramaki E, Kurohashi S, Kashioka H, Tanaka H (2005) Probabilistic model for example-based machine translation. In: Proceedings of the tenth machine translation summit (MT Summit X), Phuket, Thailand, pp 219–226
Backhouse AE (1993) The Japanese language: an introduction. Oxford University Press, Oxford
Google Scholar
Baldwin T (2001a) Low-cost, high-performance translation retrieval: dumber is better. In: Association for Computational Linguistics, 39th annual meeting and 10th conference of the European Chapter, Toulouse, France, pp 18–25
Baldwin T (2001b) Making lexical sense of Japanese–English machine translation: a disambiguation extravaganza. PhD thesis, Tokyo Institute of Technology, Japan
Baldwin T, Tanaka H (2000) The effects of word order and segmentation on translation retrieval performance. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, Saarbrücken, Germany, pp 35–41
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 Workshop, Ann Arbor, Michigan, pp 65–72
Brown PF, Della Pietra VJ, deSouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Ling 18: 467–479
Google Scholar
Carl, M, Way, A (eds) (2003) Recent advances in example-based machine translation. Kluwer Academic, Dordrecht, Netherlands
MATH Google Scholar
Chen SF (1993) Aligning sentences in bilingual corpora using lexical information. In: 31st annual meeting of the Association for Computational Linguistics, Columbus, Ohio, pp 9–16
Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: 34th annual meeting of the Association for Computational Linguistics, Santa Cruz, CA, pp 310–318
Culy C, Riehemann SZ (2003) The limits of n-gram translation evaluation metrics. In: MT Summit IX, New Orleans, USA, pp 71–78
Doi T, Yamamoto H, Sumita E (2005) Example-based machine translation using efficient sentence retrieval based on edit-distance. ACM T Asian Lang Info Proc (TALIP) 4: 377–399
Article Google Scholar
Fujii H, Croft WB (1993) A comparison of indexing techniques for Japanese text retrieval. In: Proceedings of 16th international ACM-SIGIR conference on research and development in information retrieval (SIGIR’93), Pittsburgh, PA, pp 237–246
Gale W, Church K (1993) A program for aligning sentences in bilingual corpora. Comput Ling 19: 75–102
Google Scholar
Gale W, Church K, Yarowsky D (1992) A method for disambiguating word senses in a large corpus. Comput Hum 26: 415–439
Article Google Scholar
Isahara H (1998) JEIDA’s English–Japanese bilingual corpus project. In: Proceedings of the 1st international conference on language resources and evaluation (LREC’98), Granada, Spain, pp 471–481
Kitamura M [ ], Yamamoto H [ ] (1996) [Translation retrieval system using alignment data from parallel texts]. In: [Proceedings of the 53rd annual meeting of the Information Processing Society of Japan], Osaka, Japan, vol 2, pp 385–386
Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24: 377–439
Article Google Scholar
Kurohashi S [ ], Nagao M [ ] (1998) JUMAN 3.5 [Japanese morphological analysis system JUMAN version 3.5]. Technical report, Kyoto University, Japan
Langlais P, Simard M (2002) Merging example-based and statistical machine translation. In: Richardson SD (ed) Machine translation: from research to real users: 5th conference of the Association for Machine Translation in the Americas, AMTA 2002, Tiburon, CA, October 2002, Berlin, Springer Verlag, PP 104–113
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
MATH Google Scholar
Marcu D (2001) Towards a unified approach to memory- and statistical-based machine translation. In: ACL-EACL-2001: 39th annual meeting and 10th conference of the European chapter, Toulouse, France, pp 386–393
Masek W, Paterson M (1980) A faster algorithm computing string edit distances. J Comput Syst Sci 20: 18–31
Article MATH MathSciNet Google Scholar
Matsumoto Y, Kitauchi A, Yamashita T, Hirano Y (1999) Japanese morphological analysis system ChaSen version 2.0 manual. Technical report, NAIST-IS-TR99009, Nara Institute of Science and Technology, Japan
Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM T Inform Syst 14: 349–379
Article Google Scholar
Nakamura N [ ] (1989) [Translation support system using example-based retrieval]. In: [Proceedings of the 38th annual meeting of the Information Processing Society of Japan], vol 1, pp 357–358
Nagao M, Mori S (1994) A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: COLING 94, The 15th international conference on computational linguistics, Kyoto, Japan, pp 611–615
Nirenburg S, Domashnev C, Grannes DJ (1993) Two approaches to matching in example-based machine translation. In: TMI-93: The fifth international conference on theoretical and methodological issues in machine translation, Kyoto, Japan, pp 47–57
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Ling 29: 19–51
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation. In: 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, pp 311–318
Planas E, Furuse O (1999) Formalizing translation memories. In: Proceedings of the seventh machine translation summit (MT Summit VII), Singapore, pp 331–339
Planas E, Furuse O (2000) Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, Saarbrücken, Germany, pp 621–627
Planas E, Furuse O (2003) Formalizing translation memory. In: Carl and Way (2003), pp 157–188
Salton G (1971) The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall
Google Scholar
Sato S (1992) CTM: An example-based translation aid system. In: Coling-92: Proceedings of the fourteenth [sic] international conference on computational linguistics, Nantes, pp 1259–1263
Sato S (1994) Example-based translation and its MIMD implementation. In: Kitano H, Hendler J (eds) Massively parallel artificial intelligence. MIT Press, Cambridge, MA, pp 171–201
Google Scholar
Sato S, Kawase T (1994) A high-speed best match retrieval method for Japanese text. Technical report IS-RR-94-9I, Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Sato S, Nagao M (1990) Toward memory-based translation. In: Coling-90: Papers presented to the 13th international conference on computational linguistics, vol 3, Helsinki, pp 247–252
Somers H (2003a) Recent advances in example-based machine translation. In: Carl and Way (2003), pp 3–57
Somers H (2003b) Translation memory systems. In: Somers H (ed) Computers and translation: a translator’s guide, John Benjamins, Amsterdam/Philadelphia, pp 31–47
Sumita E [ ] Tsutsumi Y [ ] (1991) [A practical method of retrieving similar examples for translation aid]. [Transactions of the Institute of Electronics, Information and Communication Engineers] J74-D-II(10), 1437–1447
Tanaka H [ ] (1997) [An efficient way of gauging similarity between long Japanese expressions]. In: [Information processing society of Japan: natural language processing study group report], 97.85, pp 69–74
Trujillo A (1999) Translation engines: techniques for machine translation. Springer Verlag, London
Veale T, Way A (1997) Gaijin: a bootstrapping, template-driven approach to example-based MT. In: Proceedings of RANLP 1997, Recent advances in natural language processing, Tzigov Chark, Bulgaria, pp 239–244
Véronis, J (eds) (2000) Parallel text processing: alignment and use of translation corpora. Kluwer Academic, Dordrecht, Netherlands
MATH Google Scholar
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. JACM 21: 168–173
Article MATH MathSciNet Google Scholar
Wu D (1994) Aligning parallel English–Chinese text statistically with lexical criteria. In: 32nd annual meeting of the Association for Computational Linguistics, Las Cruces, NM, pp 80–87
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM T Inform Syst 22: 179–214
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, University of Melbourne, Melbourne, VIC, 3010, Australia
Timothy Baldwin

Authors

Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timothy Baldwin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baldwin, T. The hare and the tortoise: speed and accuracy in translation retrieval. Machine Translation 23, 195–240 (2009). https://doi.org/10.1007/s10590-009-9064-7

Download citation

Received: 28 January 2009
Accepted: 24 November 2009
Published: 26 February 2010
Issue Date: November 2009
DOI: https://doi.org/10.1007/s10590-009-9064-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The hare and the tortoise: speed and accuracy in translation retrieval

Abstract

Access this article

Similar content being viewed by others

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Improving translation memory matching and retrieval using paraphrases

Anubis - Speeding Up Computer-Aided Translation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The hare and the tortoise: speed and accuracy in translation retrieval

Abstract

Access this article

Similar content being viewed by others

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Improving translation memory matching and retrieval using paraphrases

Anubis - Speeding Up Computer-Aided Translation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation