Comparable Multilingual Patents as Large-Scale Parallel Corpora

Lu, Bin; Chow, Ka Po; Tsou, Benjamin K.

doi:10.1007/978-3-642-20128-8_9

Bin Lu^5,6,
Ka Po Chow^6,7 &
Benjamin K. Tsou^5,6

1127 Accesses

Abstract

Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.

This chapter is based on the authors’ previous work described in Lu et al. (2009, 2010a, 2010b, 2011)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.itl.nist.gov/iad/mig/tests/mt/.
2.
Anyone interested in the corpus are invited to contact the authors for more details.
3.
Retrieved March 2010, from http://www.collinslanguage.com/.
4.
Retrieved April, 2010 from http://www.wipo.int/pctdb/en/. The data below involving PCT patents comes from the website of WIPO.
5.
http://www.sipo.gov.cn/
6.
Some contents are in image format. Thus the images were OCRed and the characters recognized were manually verified.
7.
Some contents of the English patents were OCRed by WIPO.
8.
http://www.ipdl.inpit.go.jp/homepg.ipdl
9.
http://chasen.naist.jp/hiki/ChaSen/
10.
http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm
11.
Correct means the English sentence is exactly the literal translation of the Chinese one, or the content overlap between them are above 80 % with no need to consider phrasal reordering during the translation; partially correct means the Chinese sentence and the English one are not the literal translation of each other, but the content of each sentence can cover more than 50 % of the other; incorrect means the contents of the Chinese sentence and the English one are not related, or more than 50 % of the content of one sentence is not translated in the other. Please see [17] for more details.
12.
http://en.wikipedia.org/wiki/Wikipedia_database

References

Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia. In: Proceedings of EACL, pp. 62–69 (2006)
Google Scholar
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of ACL, pp. 169–176 (1991)
Google Scholar
Brown, P.F., Della, S.A., Pietra, V.J., Pietra, D., Mercer, R.L.: Mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Cao, G., Gao, J., Nie, J.: A system to mine large-scale bilingual dictionaries from monolingual web pages. In: Proceedings of MT Summit, pp. 57–64 (2007)
Google Scholar
Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of ACL, pp. 9–16 (1993)
Google Scholar
Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)
Article MATH Google Scholar
Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T.: Overview of the patent translation task at the NTCIR-7 workshop. In: Proceedings of the NTCIR-7 Workshop, pp. 389–400. Tokyo, Japan (2008)
Google Scholar
Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T., Ehara, T., Echizen-ya, H., Shimohata, S.: Overview of the patent translation task at the NTCIR-8 workshop. In: Proceedings of the NTCIR-8 Workshop. Tokyo, Japan (2010)
Google Scholar
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: Proceedings of ACL, pp. 79–85 (1991)
Google Scholar
Ha, L.A., Fernandez, G., Mitkov, R., Corpas, G.: Mutual bilingual terminology extraction. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC), pp. 28–30 (2008)
Google Scholar
Higuchi, S., Fukui, M., Fujii, A., Ishikawa, T.: PRIME: a system for multi-lingual patent retrieval. In: Proceedings of MT Summit VIII, pp. 163–167 (2001)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X (2005)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL Demo Session, pp. 177–180 (2007)
Google Scholar
Kupiec, J.: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In: Proceedings of ACL-93, pp. 17–22 (1993)
Google Scholar
Lin, D., Zhao, S., Durme, B.V., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: Proceedings of ACL-08, pp. 994–1002 (2008)
Google Scholar
Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL-IJCNLP, pp. 870–878 (2009)
Google Scholar
Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of an English-Chinese patent parallel corpus. In: Proceedings of MT Summit XII 3rd Workshop on Patent Translation (2009)
Google Scholar
Lu, B., Tsou, B.K.: Towards bilingual term extraction in comparable patents. In: Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23), pp. 755–762 (2009)
Google Scholar
Lu, B., Tsou, B.K., Jiang, T., Kwong, O.Y., Zhu, J.: Mining large-scale parallel corpora from multilingual patents: an English-Chinese example and its application to SMT. In: Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2010). Beijing, China. August, 2010 (2010a)
Google Scholar
Lu, B., Jiang, T., Chow, K., Tsou, B.K.: Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of Workshop on Building and Using Comparable Corpora. Malta (2010b)
Google Scholar
Lu, B., Chow, K.P., Tsou, B.K.: The cultivation of a trilingual Chinese-English-Japanese parallel corpus from comparable patents. In: Proceedings of Machine Translation Summit XIII (MT Summit-XIII). Xiamen (2011a)
Google Scholar
Lu, B., Tsou, B.K., Jiang, T., Zhu, J., Kwong, O.: Mining parallel knowledge from comparable patents. In: Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances. IGI Global ( 2011b)
Google Scholar
Ma, X.: Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). Genova, Italy (2006)
Google Scholar
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of AMTA, pp. 135–144 (2002)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Google Scholar
Och, F.J, Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Google Scholar
Och, F.J., Ney, H.: The alignment template approach to machine translation. Comput. Linguist. 30(4), 417–449 (2004)
Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of NAACL-HLT, pp. 403–411 (2010)
Google Scholar
Simard, M., Plamondon, P.: Bilingual sentence alignment: balancing robustness and accuracy. Mach. Transl. 13(1), 59–80 (1998)
Google Scholar
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceeding of MT Summit XI, pp. 475–482 (2007)
Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of IJCNLP2005 (2005)
Google Scholar
Wu, D., Xia, X.: Learning an English-Chinese lexicon from a parallel corpus, In: Proceedings of the First Conference of the Association for Machine Translation in the Americas (1994)
Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of Second IEEE International Conference on Data Mining (ICDM-02) (2002)
Google Scholar

Download references

Acknowledgments

We wish to thank our colleagues, Dr. Kataoka S. and Mr. Wrong B. and others, for their help in evaluating the sampled sentence pairs and triplets.

Author information

Authors and Affiliations

Department of Chinese, Translation and Linguistics, City University of Hong Kong, Kowloon, Hong Kong
Bin Lu & Benjamin K. Tsou
Research Centre on Linguistics and Language Information Sciences, Hong Kong Institute of Education, New Territories, Hong Kong
Bin Lu, Ka Po Chow & Benjamin K. Tsou
Hong Kong Institute of Education, New Territories, Hong Kong
Ka Po Chow

Authors

Bin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Ka Po Chow
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin K. Tsou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Lu .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lu, B., Chow, K.P., Tsou, B.K. (2013). Comparable Multilingual Patents as Large-Scale Parallel Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_9
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics