Skip to main content

Translingual Mining from Text Data

  • Chapter
  • First Online:
Mining Text Data

Abstract

Like full-text translation, cross-language information retrieval (CLIR) is a task that requires some form of knowledge transfer across languages. Although robust translation resources are critical for constructing high quality translation tools, manually constructed resources are limited both in their coverage and in their adaptability to a wide range of applications. Automatic mining of translingual knowledge makes it possible to complement hand-curated resources. This chapter describes a growing body of work that seeks to mine translingual knowledge from text data, in particular, data found on the Web. We review a number of mining and filtering strategies, and consider them in the context of statistical machine translation, showing that these techniques can be effective in collecting large quantities of translingual knowledge necessary for CLIR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adafre, S.F. and de Rijke, M. (2006). Finding similar sentences acorss multiple languages in Wikipedia. 11 th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69.

    Google Scholar 

  2. Ballesteros, L. and Croft, W. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of SIGIR Conf. pp. 84-91.

    Google Scholar 

  3. Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR Conf., pp. 222-229.

    Google Scholar 

  4. Braschler, M., and Schäuble, P. (1998). Multilingual information retrieval based on document alignment techniques. ECDL ’98: Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, pp. 183–197.

    Google Scholar 

  5. Braschler, M., and Schäuble, P. (2001). Experiments with the Eurospider Retrieval System for CLEF 2000, in Proceedings of CLEF Conference. pp. 140-148.

    Google Scholar 

  6. Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp. 263-311.

    Google Scholar 

  7. Cao, G., Gao, J., Nie, J.Y. (2007) A system to mine large-scale bilingual dictionaries from monolingual Web pages, MT Summit, pp. 57-64.

    Google Scholar 

  8. Carbonell, J.G, Yang, Y, Frederking, R.E., Brown, R., Geng, Y. and Lee, D. (1997) Translingual information retrieval: A comparative evaluation. In: Proceedings of the International Joint Conference on Arti?cial Intelligence (IJCAI ’97).

    Google Scholar 

  9. Chiang, D., (2005) A Hierarchical Phrase-Based Model for Statistical Machine Translation. ACL.

    Google Scholar 

  10. Chen, J., Nie, J.Y., (2000) Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. ANLP pp. 21-28

    Google Scholar 

  11. Chen, H.H., Lin, W.C. and Yang, C.H. (2006). Translation-Transliterating Named Entities for Multilingual Information Access. Journal of the American Society for Information Science and Technology, 57(5):645-659

    Article  Google Scholar 

  12. Cheng, P., Teng, J., Chen, R., Wang, J., Lu, W., and Chien, L. (2004). Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval. In Proceedings of SIGIR Conf., pp.162-169.

    Google Scholar 

  13. Dumais, S. T., Letsche, T. A., Littman, M. L. and Landauer, T. K. (1997) Automatic cross-language retrieval using Latent Semantic Indexing. AAAI Spring Symposuim on Cross-Language Text and Speech Retrieval, March 1997.

    Google Scholar 

  14. Franz, M., McCarley, J.S. and Koukos, S. (1999) Ad hoc and multilingual information retrieval at IBM. Proceedings of the Seventh Text Retrieval Conference (TREC-7), pp. 157–168.

    Google Scholar 

  15. Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. Proceedings of the Association for Computational Linguistics, pp. 236-243.

    Google Scholar 

  16. Pascale Fung and Yuen Yee Lo. 1998. An IR approach for translating new words from nonparallel, comparable texts. Proceedings of COLING-ACL98, pp. 414– 420.

    Google Scholar 

  17. Fung, P. and McKeown, K. (1997) Finding terminology translations from non-parallel corpora. In: The 5th Annual Workshop on Very Large Corpora.

    Google Scholar 

  18. Fung, P. and Cheung, P. (2004) Multilevel boot-strapping for extracting parallel sentences from a quasi parallel corpus. Conference on Empirical Methods in Natural Language Processing (EMNLP 04), pp. 1051–1057.

    Google Scholar 

  19. Gale, W. A., Church K. W. 1993. A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(3): 75-102.

    Google Scholar 

  20. Galley, M., Hopkins, M., Knight, K., Marcu, D., (2004) What’s in a translation rule? HLT-NAACL, pp. 273-280

    Google Scholar 

  21. Pablo Gamallo Otero, Isaac Gonzalez Lopez, (2009) Wikipedia as Multilingual Source of Comparable Corpora, Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC 2010, pp. 21–25

    Google Scholar 

  22. Gao, J., Nie, J.Y., Xun, E., Zhang, J., Zhou, M., and Huang, C. (2001). Improving query translation for cross-language information retrieval using statistical models. In Proceedings of SIGIR Conf., pp. 96-104.

    Google Scholar 

  23. Gao, J., Zhou, M., Nie, J.Y., He, H., Chen, W. (2002) Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. SIGIR, pp. 183-190

    Google Scholar 

  24. Gao, J., Nie, J.Y. (2006) Study of Statistical Models for Query Translation: Finding a Good Unit of Translation. SIGIR, pp 194- 201, 2006.

    Google Scholar 

  25. Gao, J., He, X., Nie. J.Y. (2010) Clickthrough-based translation models for web search: from word models to phrase models. CIKM, pp 1139-1148, 2010.

    Google Scholar 

  26. Hong, Gumwon, Li, Chi-Ho, Zhou, Ming and Rim, Hae-Chang (2010) An Empirical Study on Web Mining of Parallel Data, COLING, pp. 474–482.

    Google Scholar 

  27. Huang, Degen, Zhao, Lian, Li, Lishuang Yu, Haitao (2010) Mining Large-scale Comparable Corpora from Chinese-English News Collections, COLING, pp. 472-480.

    Google Scholar 

  28. Huang, F., Zhang, Y., and Vogel, S. (2005). Mining Key Phrase Translations from Web Corpora. In Proceedings of HLT-EMNLP Conf., pp. 483-490.

    Google Scholar 

  29. Jeon, J. Lavrenko, V. and Manmatha, R. (2003) Automatic Image Annotation and Retrieval using Cross-Media Relevance Models, SIGIR, pp. 119-126.

    Google Scholar 

  30. Jeong, K.S., Myaeng, S.H., Lee, J.S, and Choi, K.S., (1999) Automatic identification and back-transliteration of foreign words for information retrieval, Information Processing and Management, 35(4), pp. 523-540.

    Article  Google Scholar 

  31. Ji, Heng (2009) Mining Name Translations from Comparable Corpora by Creating Bilingual Information Networks, Proceedings of the 2 nd Workshop on Building and Using Comparable Corpora, ACL-IJCNLP 2009, pages34–37.

    Google Scholar 

  32. Koehn, P., Och, F.J., Marcus, D., (2003) Statistical phrase-based translation, In Proceedings of HLT-NAACL, pp. 48-54.

    Google Scholar 

  33. Koehn, P. (2009) Statistical Machine Translation. Cambridge University Press.

    Google Scholar 

  34. Kraaij, W., Nie, J.Y., and Simard, M. (2003). Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval. Computational Linguistics, 29(3): 381-420.

    Google Scholar 

  35. Kumano, T. and Tanaka, H., Tokunaga, T. (2007) Extracting phrasal alignments from comparable corpora by using joint probability SMT model. 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI’07).

    Google Scholar 

  36. Kuo, J.S., Li, H., and Yang Y.K (2006). Learning Transliteration Lexicon from the Web. In the Proceedings of COLING/ACL, pp.1129-1136

    Google Scholar 

  37. Lam, W., Chan, S.K., and Huang, R. (2007). Named Entity Translation Matching and Learning: With Application for Mining Unseen Translations. ACM Transactions on Information Systems, 25(1), pp.

    Google Scholar 

  38. Liu, Y., Jin R. and Chai, Joyce Y. (2005). A maximum coherence model for dictionary-based cross-language information retrieval, In Proceedings of SIGIR conf., pp. 536-543.

    Google Scholar 

  39. Lu, W. Chien, L.F. and Lee, H. (2004). Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems, Vol.22, pp. 242-269.

    Article  Google Scholar 

  40. Ma, X. and Liberman, M., (1999). Bits: A Method for Bilingual Text Search over the Web. Proceedings of Machine Translation Summit VII.

    Google Scholar 

  41. Munteanu, D. S., Marcu, D. (2005) Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. 2005. Computational Linguistics. 31(4). pp: 477-504.

    Google Scholar 

  42. Munteanu, D. S. and Marcu D. (2006). Extracting parallel subsentential fragments from non-parallel corpora. ACL, pp. 81–88.

    Google Scholar 

  43. Nagata, M., Saito, T., and Suzuki, K. (2001). Using the web as a bilingual dictionary. In Proceedings of the Workshop on Data-Driven Methods in Machine Translation (with ACL Conf.), pp. 1-8.

    Google Scholar 

  44. Nie, J.Y., Cai, J. (2001) Filtering parallel corpora of web pages, IEEE symposium on NLP and Knowledge Engineering, pp. 453-458.

    Google Scholar 

  45. Nie, J.Y., Simard, M., Isabelle, P., Durand, R. (1999) Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web, In Proceedings of SIGIR Conf., pp. 74-81

    Google Scholar 

  46. Och, F., and Ney, H. (2002) Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. ACL, pp. 295-302

    Google Scholar 

  47. Och, F. (2003). Minimum error rate training in statistical machine translation. In Proceedings of ACL. pp. 160-67

    Google Scholar 

  48. Oumohmed, A.I., Mignotte, M., Nie, J.Y. (2005) Semantic-Based Cross-Media Image Retrieval, Pattern Recognition and Image Analysis: Third International Conference on Advances in Pattern Recognition (ICAPR), LNCS 3687, pp. 414-423.

    Google Scholar 

  49. Potthast, M., Stein, B., Anderka, M. (2008) A Wikipedia-based Multilingual Retrieval Model. ECIR, LNCS 4956, pp. 522–530.

    Google Scholar 

  50. Qu, Y., Grefenstette, G., and Evans, D. A. (2003). Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of SIGIR Conference, pp. 353-360.

    Google Scholar 

  51. Rapp, R. (1995). Identifying Word Translations in Non-Parallel Texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320-322.

    Google Scholar 

  52. Resnik, P., (1999) Mining the Web for Bilingual Text, 37th Annual Meeting of the Association for Computational Linguistics (ACL’99).

    Google Scholar 

  53. Resnik P. and Smith. N.A. (2003) The Web as a Parallel Corpus, Computational Linguistics, 29(3), pp. 349-380, September 2003.

    Google Scholar 

  54. Sheridan, P. and Ballerini, J. P. (1996). Experiments in multilingual information retrieval using the SPIDER system. In Proceedings of SIGIR Conf., pp. 58-65.

    Google Scholar 

  55. Schönhofen, P., Benczúr, A., Bíró, I., Csalogány, K. (2007) Performing cross-language retrieval with Wikipedia, CLEF-2007 (http://www.clefcampaign.org/2007/working notes/schonhofenCLEF2007.pdf)

    Google Scholar 

  56. Shi, L., Niu, C., Zhou, M., and Gao, J. (2006) A DOM Tree Alignment Model for Mining Parallel Data from the Web, ACL, pp. 489-496.

    Google Scholar 

  57. Smith, J. R., Quirk, C., and Toutanova, K. (2010) Extracting parallel sentences from comparable corpora using document level alignment. HLT, pp. 403–411

    Google Scholar 

  58. Sproat, R., Tao, T., Zhai, C. (2006) Named Entity Transliteration with Comparable Corpora. In Proceedings of ACL.

    Google Scholar 

  59. Tuomas Talvensaari, Jorma Laurikkala, Kalervo Järvelin, Martti Juhola (2006) A study on automatic creation of a comparable document collection in cross-language information retrieval, Journal of Documentation, Vol. 62 No. 3, pp. 372-387

    Google Scholar 

  60. Tuomas Talvensaari, Jorma Laurikkala, Kalervo Järvelin, Martti Juhola, and Heikki Keskustalo (2007). Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inf. Syst. 25, 1, Article 4.

    Google Scholar 

  61. Utiyama M. and Isahara, H. (2003) Reliable Measures for Aligning Japanese-English News Articles and Sentences. ACL, pp. 72–79.

    Google Scholar 

  62. Jinxi Xu, W. Bruce Croft (1996) Query Expansion Using Local and Global Document Analysis. SIGIR, pp. 4-11

    Google Scholar 

  63. Yang, Christopher C., and Kar Wing Li. 2003. Automatic construction of English/Chinese parallel corpora. Journal of the American Society for Information Science and Technology, 54(8), pp. 730–742.

    Article  Google Scholar 

  64. Zhang, Y. and Vines, P. (2004). Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval. In Proceedings of SIGIR Conf., pp.162-169.

    Google Scholar 

  65. Zhang, Y., Huang, F., Vogel, S. (2005) Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion, SIGIR, pp. 669-670.

    Google Scholar 

  66. Zhao, B., and Vogel, S. (2002). Adaptive Parallel Sentences Mining from Web Bilingual News Collection. In Proceedings of IEEE international conference on data mining, pages 745-750.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Yun Nie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Nie, JY., Gao, J., Cao, G. (2012). Translingual Mining from Text Data. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-3223-4_10

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4614-3222-7

  • Online ISBN: 978-1-4614-3223-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics