Skip to main content

Comparable Multilingual Patents as Large-Scale Parallel Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora
  • 1127 Accesses

Abstract

Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.

This chapter is based on the authors’ previous work described in Lu et al. (2009, 2010a, 2010b, 2011)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.itl.nist.gov/iad/mig/tests/mt/.

  2. 2.

    Anyone interested in the corpus are invited to contact the authors for more details.

  3. 3.

    Retrieved March 2010, from http://www.collinslanguage.com/.

  4. 4.

    Retrieved April, 2010 from http://www.wipo.int/pctdb/en/. The data below involving PCT patents comes from the website of WIPO.

  5. 5.

    http://www.sipo.gov.cn/

  6. 6.

    Some contents are in image format. Thus the images were OCRed and the characters recognized were manually verified.

  7. 7.

    Some contents of the English patents were OCRed by WIPO.

  8. 8.

    http://www.ipdl.inpit.go.jp/homepg.ipdl

  9. 9.

    http://chasen.naist.jp/hiki/ChaSen/

  10. 10.

    http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm

  11. 11.

    Correct means the English sentence is exactly the literal translation of the Chinese one, or the content overlap between them are above 80 % with no need to consider phrasal reordering during the translation; partially correct means the Chinese sentence and the English one are not the literal translation of each other, but the content of each sentence can cover more than 50 % of the other; incorrect means the contents of the Chinese sentence and the English one are not related, or more than 50 % of the content of one sentence is not translated in the other. Please see [17] for more details.

  12. 12.

    http://en.wikipedia.org/wiki/Wikipedia_database

References

  1. Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia. In: Proceedings of EACL, pp. 62–69 (2006)

    Google Scholar 

  2. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of ACL, pp. 169–176 (1991)

    Google Scholar 

  3. Brown, P.F., Della, S.A., Pietra, V.J., Pietra, D., Mercer, R.L.: Mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  4. Cao, G., Gao, J., Nie, J.: A system to mine large-scale bilingual dictionaries from monolingual web pages. In: Proceedings of MT Summit, pp. 57–64 (2007)

    Google Scholar 

  5. Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of ACL, pp. 9–16 (1993)

    Google Scholar 

  6. Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)

    Article  MATH  Google Scholar 

  7. Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T.: Overview of the patent translation task at the NTCIR-7 workshop. In: Proceedings of the NTCIR-7 Workshop, pp. 389–400. Tokyo, Japan (2008)

    Google Scholar 

  8. Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T., Ehara, T., Echizen-ya, H., Shimohata, S.: Overview of the patent translation task at the NTCIR-8 workshop. In: Proceedings of the NTCIR-8 Workshop. Tokyo, Japan (2010)

    Google Scholar 

  9. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: Proceedings of ACL, pp. 79–85 (1991)

    Google Scholar 

  10. Ha, L.A., Fernandez, G., Mitkov, R., Corpas, G.: Mutual bilingual terminology extraction. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC), pp. 28–30 (2008)

    Google Scholar 

  11. Higuchi, S., Fukui, M., Fujii, A., Ishikawa, T.: PRIME: a system for multi-lingual patent retrieval. In: Proceedings of MT Summit VIII, pp. 163–167 (2001)

    Google Scholar 

  12. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X (2005)

    Google Scholar 

  13. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL Demo Session, pp. 177–180 (2007)

    Google Scholar 

  14. Kupiec, J.: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In: Proceedings of ACL-93, pp. 17–22 (1993)

    Google Scholar 

  15. Lin, D., Zhao, S., Durme, B.V., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: Proceedings of ACL-08, pp. 994–1002 (2008)

    Google Scholar 

  16. Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL-IJCNLP, pp. 870–878 (2009)

    Google Scholar 

  17. Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of an English-Chinese patent parallel corpus. In: Proceedings of MT Summit XII 3rd Workshop on Patent Translation (2009)

    Google Scholar 

  18. Lu, B., Tsou, B.K.: Towards bilingual term extraction in comparable patents. In: Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23), pp. 755–762 (2009)

    Google Scholar 

  19. Lu, B., Tsou, B.K., Jiang, T., Kwong, O.Y., Zhu, J.: Mining large-scale parallel corpora from multilingual patents: an English-Chinese example and its application to SMT. In: Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2010). Beijing, China. August, 2010 (2010a)

    Google Scholar 

  20. Lu, B., Jiang, T., Chow, K., Tsou, B.K.: Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of Workshop on Building and Using Comparable Corpora. Malta (2010b)

    Google Scholar 

  21. Lu, B., Chow, K.P., Tsou, B.K.: The cultivation of a trilingual Chinese-English-Japanese parallel corpus from comparable patents. In: Proceedings of Machine Translation Summit XIII (MT Summit-XIII). Xiamen (2011a)

    Google Scholar 

  22. Lu, B., Tsou, B.K., Jiang, T., Zhu, J., Kwong, O.: Mining parallel knowledge from comparable patents. In: Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances. IGI Global ( 2011b)

    Google Scholar 

  23. Ma, X.: Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). Genova, Italy (2006)

    Google Scholar 

  24. Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of AMTA, pp. 135–144 (2002)

    Google Scholar 

  25. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Google Scholar 

  26. Och, F.J, Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Google Scholar 

  27. Och, F.J., Ney, H.: The alignment template approach to machine translation. Comput. Linguist. 30(4), 417–449 (2004)

    Google Scholar 

  28. Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)

    Google Scholar 

  29. Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of NAACL-HLT, pp. 403–411 (2010)

    Google Scholar 

  30. Simard, M., Plamondon, P.: Bilingual sentence alignment: balancing robustness and accuracy. Mach. Transl. 13(1), 59–80 (1998)

    Google Scholar 

  31. Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceeding of MT Summit XI, pp. 475–482 (2007)

    Google Scholar 

  32. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of IJCNLP2005 (2005)

    Google Scholar 

  33. Wu, D., Xia, X.: Learning an English-Chinese lexicon from a parallel corpus, In: Proceedings of the First Conference of the Association for Machine Translation in the Americas (1994)

    Google Scholar 

  34. Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of Second IEEE International Conference on Data Mining (ICDM-02) (2002)

    Google Scholar 

Download references

Acknowledgments

We wish to thank our colleagues, Dr. Kataoka S. and Mr. Wrong B. and others, for their help in evaluating the sampled sentence pairs and triplets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Lu, B., Chow, K.P., Tsou, B.K. (2013). Comparable Multilingual Patents as Large-Scale Parallel Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics