Skip to main content

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

The beginning of the 1990s marked a radical turn in various NLP applications towards using large collections of texts.

“That is not said right,” said the Caterpillar. “Not quite right, I’m afraid,” said Alice, timidly: “some of the words have got altered.” Lewis Carroll, Alice’s Adventures in Wonderland.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.tausdata.org/

References

  1. Abdul Rauf, S., Schwenk, H.: Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 46–54. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3109.pdf

  2. Adafre, S., de Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), pp. 62–69. Trento (2006)

    Google Scholar 

  3. Andrade, D., Matsuzaki, T., Tsujii, J.: Learning the optimal use of dependency-parsing information for finding translations with comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 10–18. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1203

  4. Babych, B., Hartley, A., Sharoff, S.: Translating from under-resourced languages: comparing direct transfer against pivot translation. In: Proceedings of the MT Summit XI, pp. 412–418. Copenhagen (2007), http://corpus.leeds.ac.uk/serge/publications/2007-mt-summit.pdf

  5. Babych, B., Hartley, A., Sharoff, S., Mudraya, O.: Assisting translators in indirect lexical transfer. In: Proceedings of 45\(^{th}\) ACL, pp. 739–746. Prague (2007), http://corpus.leeds.ac.uk/serge/publications/2007-ACL.pdf

  6. Barbosa, L., Bangalore, S., Rangarajan Sridhar, V.K.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai (November 2011)

    Google Scholar 

  7. Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the Web. In: Proceedings of LREC2004. Lisbon (2004), http://sslmit.unibo.it/baroni/publications/lrec2004/bootcat_lrec_2004.pdf

  8. Bel, N., Papavasiliou, V., Prokopidis, P., Toral, A., Arranz, V.: Mining and exploiting domain-specific corpora in the PANACEA platform. In: The 5th Workshop on Building and Using Comparable Corpora (2012)

    Google Scholar 

  9. Blancafort, H., Heid, U., Gornostay, T., Méchoulam, C., Daille, B., Sharoff, S.: User-centred views on terminology extraction tools: usage scenarios and integration into MT and CAT tools. In: Proceedings TRALOGY Conference "Translation Careers and Technologies: Convergence Points for the Future" (2011)

    Google Scholar 

  10. Brown, P., Pietra, S.D., Pietra, V.D., Mercer, R.: The mathematics of statistical machine translation: parameter estimation. Computat. Linguist. 19(2), 263–312 (1993)

    Google Scholar 

  11. Brown, P.F., Cocke, J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computat. Linguist. 16(2), 79–85 (1990)

    Google Scholar 

  12. Budge, E.A.T.W.: The Rosetta Stone. British Museum. London (1913)

    Google Scholar 

  13. Chen, J., Nie, J.: Parallel Web text mining for cross-language ir. In: Proceedings of RIAO, pp. 62–77 (2000)

    Google Scholar 

  14. Chiao, Y.C., Sta, J.D., Zweigenbaum, P.: A novel approach to improve word translations extraction from non-parallel, comparable corpora. In: Proceedings International Joint Conference on Natural Language Processing, Hainan (2004)

    Google Scholar 

  15. Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: COLING 2002 (2002)

    Google Scholar 

  16. Defrancq, B.: Establishing cross-linguistic semantic relatedness through monolingual corpora. Int. J. Corpus Linguist. 13(4), 465–490 (2008)

    Article  Google Scholar 

  17. Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING 2002 (2002)

    Google Scholar 

  18. Deléger, L., Cartoni, B., Zweigenbaum, P.: Paraphrase detection in monolingual specialized/lay comparable corpora. In: Sharoff, S., Rapp, R., Fung, P., Zweigenbaum, P. (eds.) Building and Using Comparable Corpora. Springer, Dordrecht (2012)

    Google Scholar 

  19. Deléger, L., Zweigenbaum, P.: Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Fung, P., Zweigenbaum, P., Rapp, R. (eds.) Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-Parallel Corpora, pp. 2–10. Association for Computational Linguistics, Singapore (August 2009), http://aclweb.org/anthology/W/W09/W09-3102

  20. Diab, M., Finch, S.: A statistical wordlevel translation model for comparable corpora. In: Proceedings of the Conference on Content-Based Multimedia Information Access (RIAO) (2000)

    Google Scholar 

  21. Dorow, B., Laws, F., Michelbacher, L., Scheible, C., Utt, J.: A graph-theoretic algorithm for automatic extension of translation lexicons. In: EACL 2009 Workshop on Geometrical Models of Natural Language Semantics (2009)

    Google Scholar 

  22. Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., Chen, Y.: Using moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the Third Workshop on Statistical Machine Translation at ACL2008, pp. 179–182 (2008)

    Google Scholar 

  23. Eisele, A., Chen, Y.: MultiUN: A multilingual corpus from United Nations documents. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). Valletta, Malta (2010), http://www.euromatrixplus.net/multi-un/

  24. Elhadad, N., Sutaria, K.: Mining a lexicon of technical terms and lay equivalents. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 49–56. Association for Computational Linguistics (2007)

    Google Scholar 

  25. Enright, J., Kondrak, G.: A fast method for parallel document identification. In: NAACL / Human Language Technologies, pp. 29–32. Rochester (2007)

    Google Scholar 

  26. Esplà-Gomis, M., Forcada, M.L.: Combining content-based and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)

    Google Scholar 

  27. Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of Third Annual Workshop on Very Large Corpora, pp. 173–183. Boston (1995)

    Google Scholar 

  28. Fung, P.: Extracting key terms from chinese and japanese texts. Int. J. Comput. Process. Orient. Lang. 12(1), 99–121 (1998)

    Google Scholar 

  29. Fung, P.: A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp. 1–17. Springer, Berlin (1998), http://www.springerlink.com/content/pqkpwpw32f5r74ev/

  30. Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202 (1997)

    Google Scholar 

  31. Gahbiche-Braham, S., Bonneau-Maynard, H., Yvon, F.: Two ways to use a noisy parallel news corpus for improving statistical machine translation. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 44–51. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1207

  32. Gale, W., Church, K.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)

    Google Scholar 

  33. Gamallo Otero, P., Garcia, M.: Extraction of bilingual cognates from wikipedia. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigo, F. (eds.) Computational Processing of the Portuguese Language. Lecture Notes in Artificial Intelligence, vol. 7243, pp. 63–72. Springer, Berlin (2012)

    Google Scholar 

  34. Gamallo Otero, P., Pichel Campos, J.R.: An approach to acquire word translations from nonparallel texts. In: EPIA, pp. 600–610 (2005)

    Google Scholar 

  35. Gamallo Otero, P., Pichel Campos, J.R.: Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. Comput. Linguist. Intell. Text Process. 6008, 473–483 (2010)

    Google Scholar 

  36. Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, p. 129137. Morristown (2009)

    Google Scholar 

  37. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, p. 526533. Barcelona (2004)

    Google Scholar 

  38. Germann, U.: Aligned Hansards of the 36th Parliament of Canada (2001), http://www.isi.edu/natural-language/download/hansard/

  39. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp. 771–779. Columbus (2008)

    Google Scholar 

  40. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)

    Google Scholar 

  41. Hassan, S., Mihalcea, R.: Cross-lingual semantic relatedness using encyclopedic knowledge. In: EMNLP (2009)

    Google Scholar 

  42. Hazem, A., Morin, E.: Qalign: a new method for bilingual lexicon extraction from comparable corpora. Comput. Linguist. Intell. Text Process. 7182, 83–96 (2012)

    Google Scholar 

  43. Hazem, A., Morin, E.: Extraction de lexiques bilingues partir de corpus comparables par combinaison de reprsentations contextuelles. In: Proceedings of the TALN 2013. ATALA, Les Sables d’Olonne (2013), in Press

    Google Scholar 

  44. Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 61–68. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1209

  45. Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 34–37. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3107

  46. Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 1–37 (2001)

    Article  Google Scholar 

  47. Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the kelly project. In: Proceedings of workshop on Building and Using Comparable Corpora at LREC, Malta (2010)

    Google Scholar 

  48. Knight, K., Megyesi, B., Schaefer, C.: The [copiale] cipher. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 2–9. Portland (June 2011), http://www.aclweb.org/anthology/W11-1202

  49. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit 2005 (2005), http://www.iccs.inf.ed.ac.uk/ pkoehn/publications/europarl-mtsummit05.pdf

  50. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 9–16 (2002)

    Google Scholar 

  51. Lahaussois, A., Guillaume, S.: A viewing and processing tool for the analysis of a comparable corpus of kiranti mythology. In: Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pp. 33–41. ELDA, Istanbul (2012)

    Google Scholar 

  52. Langlais, P., Patry, A.: Translating unknown words by analogical learning. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 877–886 (2007)

    Google Scholar 

  53. Li, B.: Measuring and improving comparable corpus quality. Ph.D. thesis, Universit de Grenoble, Grenoble (June 2012)

    Google Scholar 

  54. Michelbacher, L., Laws, F., Dorow, B., Heid, U.,, Schütze, H.: Building a cross-lingual relatedness thesaurus using a graph similarity measure. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta (2010)

    Google Scholar 

  55. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45\(^{th}\) Annual Meeting of the Association for Computational Linguistics, pp. 664-671. Prague, Czech Republic (2007)

    Google Scholar 

  56. Morin, E., Prochasson, E.: Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In: BUCC2011 (2011)

    Google Scholar 

  57. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Article  Google Scholar 

  58. Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of International Conference on Computational Linguistics and Association of Computational Linguistics, COLING-ACL 2006. Sydney (2006)

    Google Scholar 

  59. Patry, A., Langlais, P.: Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 87–95. Portland (June 2011), http://www.aclweb.org/anthology/W11-1212

  60. Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)

    Article  Google Scholar 

  61. Peters, C., Picchi, E.: Using linguistic tools and resources in cross-language retrieval. In: Hull, D., Oard, D. (eds.) Cross-Language Text and Speech Retrieval Papers from the 1997 AAAI Spring Symposium, pp. 179–188. AAAI Press, San Francisco (1997)

    Google Scholar 

  62. Picchi, E., Peters, C.: Exploiting lexical resources and linguistic tools in cross-language information retrieval: the EuroSearch approach. In: First International Conference on Language Resources & Evaluation, pp. 865–872. Granada (1998)

    Google Scholar 

  63. Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of ACL-HLT, Portland (2011)

    Google Scholar 

  64. Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd ACL, pp. 320–322. Cambridge (1995)

    Google Scholar 

  65. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th ACL, pp. 395–398. Maryland (1999)

    Google Scholar 

  66. Rapp, R., Sharoff, S., Babych, B.: Identifying word translations from comparable documents without a seed lexicon. In: Proceedings of the Eighth Language Resources and Evaluation Conference, LREC 2012. Istanbul (2012)

    Google Scholar 

  67. Rapp, R., Zock, M.: Automatic dictionary expansion using non-parallel corpora. In: Fink, A., Lausen, B., Ultsch, W.S.A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg (2010)

    Google Scholar 

  68. Rapp, R., Zock, M.: The noisier the better: identifying multilingual word translations using a single monolingual corpus. In: Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING. pp. 16–25. Beijing (2010)

    Google Scholar 

  69. Resnik, P., Smith, N.: The Web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003), http://www.umiacs.umd.edu/ resnik/strand/

    Google Scholar 

  70. Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the Web. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232. Trento (2006)

    Google Scholar 

  71. Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.: Structured named entities in two distinct press corpora: contemporary broadcast news and old newspapers. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 40–48. Association for Computational Linguistics, Jeju, Republic of Korea (July 2012), http://www.aclweb.org/anthology/W12-3606

  72. Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: Proceedings of CoNLL (2002)

    Google Scholar 

  73. Segouat, J., Braffort, A.: Toward categorization of sign language corpora. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 64–67. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3111

  74. Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Gedit, Bologna (2006), http://wackybook.sslmit.unibo.it

  75. Sharoff, S.: In the garden and in the jungle: comparing genres in the BNC and Internet. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web: Computational Models and Empirical Studies, pp. 149–166. Springer, Berlin (2010)

    Chapter  Google Scholar 

  76. Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. p. 98107. Uppsala (2010)

    Google Scholar 

  77. Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., Gornostay, T.: Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In: Proc. 3rd Workshop on Building and Using Comparable Corpora. Malta (2010).

    Google Scholar 

  78. Tanaka, K., Iwasaki, H.: Extraction of lexical translations from non-aligned corpora. In: Proceedings of the 16th conference on Computational linguistics (COLING96), vol. 2, pp. 580–585 (1996)

    Google Scholar 

  79. Tillmann, C.: A beam-search extraction algorithm for comparable data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 225–228 (2009)

    Google Scholar 

  80. Tsvetkov, Y., Wintner, S.: Automatic acquisition of parallel corpora from websites with dynamic content. In: Proceedings of The Seventh International Conference on, Language Resources and Evaluation (LREC-2010) (2010)

    Google Scholar 

  81. Uszkoreit, J., Ponte, J.M., Popat, A.C., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1101–1109 (2010)

    Google Scholar 

  82. Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V.: Parallel corpora for medium density languages. In: N. Nicolov, K. Bontcheva, G.A., Mitkov, R. (eds.) Recent Advances in Natural Language Processing IV. Selected papers from RANLP-05, pp. 247–258. Benjamins (2007), http://www.kornai.com/Papers/ranlp05parallel.pdf

  83. Wang, R., Callison-Burch, C.: Paraphrase fragment extraction from monolingual comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 52–60. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1208

  84. Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124. Boulder (2009)

    Google Scholar 

  85. Zhao, B., Vogel, S.: Adaptive parallel sentences mining from Web bilingual news collection. In: Proceeding of the 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2013). Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics