Skip to main content
Log in

TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Bilingual termbanks are important for many natural language processing applications, especially in translation workflows in industrial settings. In this paper, we apply a log-likelihood comparison method to extract monolingual terminology from the source and target sides of a parallel corpus. The initial candidate terminology list is prepared by taking all arbitrary n-gram word sequences from the corpus. Then, a well-known statistical measure (the Dice coefficient) is employed in order to remove any multi-word terms with weak associations from the candidate term list. Thereafter, the log-likelihood comparison method is applied to rank the phrasal candidate term list. Then, using a phrase-based statistical machine translation model, we create a bilingual terminology with the extracted monolingual term lists. We integrate an external knowledge source—the Wikipedia cross-language link databases—into the terminology extraction (TE) model to assist two processes: (a) the ranking of the extracted terminology list, and (b) the selection of appropriate target terms for a source term. First, we report the performance of our monolingual TE model compared to a number of the state-of-the-art TE models on English-to-Turkish and English-to-Hindi data sets. Then, we evaluate our novel bilingual TE model on an English-to-Turkish data set, and report the automatic evaluation results. We also manually evaluate our novel TE model on English-to-Spanish and English-to-Hindi data sets, and observe excellent performance for all domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Before the TE process begins, we apply a number of preprocessing methods including tokenisation to the input domain corpus and the generic reference corpus. We introduce the notion of generic corpus in Sect. 3.1.3.

  2. For example, if in one case the values of |x|, |y| and \(|x\,y|\) are 1, 1 and 1, respectively, and 100, 100 and 100, respectively in another case, the DC measure gives the same score (i.e. 1) in both cases.

  3. Equation (11) is derived from the right-hand side of Eq. (10) for a single source–target phrase-pair.

  4. http://ufallab.ms.mff.cuni.cz/~bojar/hindencorp/.

  5. http://www.statmt.org/wmt14/.

  6. http://opus.lingfil.uu.se/KDE4.php.

  7. English-to-Indian Language Machine Translation (EILMT) is a Ministry of IT, Govt. of India sponsored project.

  8. http://opus.lingfil.uu.se/MultiUN.php.

  9. http://opus.lingfil.uu.se.

  10. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl.

  11. https://code.google.com/p/jatetoolkit/.

  12. Note that this reference set was created by expert linguists familiar with this particular dataset. While the amount of terms might seem small given that the initial corpus has over 57K sentences, in general the number of terms does not grow linearly with corpus size, as the nature of the corpus might be repetitive with only a core set of terms being used to describe a large number of items.

  13. We cannot pick a single best system from M1, M2 and M3 as the evaluation results from the English and Turkish monolingual TE tasks show that there are no significant differences in the evaluation scores of M1, M2 and M3. For analysis we arbitrarily select M3 from the best-performing TE systems (M1, M2, and M3).

  14. In this case, the top-four ranked target terms were considered for evaluation. Therefore, the TP score is increased if any of the four source–target term-pairs matches with the reference set. We discuss why we considered a maximum of four target terms for a source term later in this section.

  15. We cannot pick a single best system from B1, B2, B3 and B4 as the evaluation results from the two tasks (English-to-Turkish and Turkish-to-English bilingual TE) show that there are no significant differences in the evaluation scores of B1, B2, B3 and B4. For analysis we arbitrarily select B4 from the best-performing four bilingual TE systems (B1, B2, B3 and B4).

References

  • Ahmad, K., Gillam, L., & Tostevin L. (1999). University of Surrey participation in TREC8: Weirdness indexing for logical document extrapolation and retrieval (WILDER). In The eighth text retrieval conference (TREC-8), Gaithersburg, MD (pp. 717–724).

  • Aker, A, Paramita , M, & Gaizauskas R. (2013). Extracting bilingual terminologies from comparable corpora. In Proceedings of The 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria (pp. 402–411).

  • Ananiadou, S. (1994). A methodology for automatic term recognition. In Proceedings of COLING: 15th international conference on computational linguistics, Kyoto, Japan (pp. 1034–1038).

  • Arcan, M., Giuliano, C., Turchi, M., & Buitelaar, P. (2014a). Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th international workshop on computational terminology (Computerm), Dublin, Ireland (pp. 22–31).

  • Arcan, M., Turchi, M., Tonelli, S., & Buitelaar, P. (2014b). Enhancing statistical machine translation with bilingual terminology in a cat environment. In Proceedings of the 11th biennial conference of the association for machine translation in the Americas (AMTA 2014), Vancouver, BC, Canada (pp. 54–68).

  • Basili, R., Moschitti, A., Pazienza, M. T., & Zanzotto, F. M. (2001). A contrastive approach to term extraction. In Proceedings of the 4th conference on terminology and artificial intelligence (TIA 2001), Nancy, France (pp. 119–128).

  • Bojar, O., Diatka, V., Rychlý, P., Straňák, P., Suchomel, V., Tamchyna, A., & Zeman, D. (2014). HindEnCorp—Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the ninth international language resources and evaluation conference (LREC’14), Reykjavik, Iceland (pp. 3550–3555).

  • Bouamor, D., Semmar, N., & Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. Language resources and evaluation (LREC) (pp. 674–679). Istanbul, Turkey.

  • Church, K. W., & Gale, W. A.(1995). Inverse document frequency (IDF): A measure of deviation from poisson. In Proceedings of the 3rd workshop on very large corpora, Cambridge, MA (pp. 121–130).

  • Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Daille, B., Gaussier, É., & Langé, J. M. (1994). Towards automatic extraction of monolingual and bilingual terminology. COLING 94, The 15th international conference on computational linguistics (pp. 515–521). Kyoto, Japan.

  • Damerau, F. J. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing & Management, 29(4), 433–447.

    Article  Google Scholar 

  • Deane, P. (2005). A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd annual meeting of the association for computational linguistics, Ann Arbor, Michigan, USA (pp. 605–613).

  • Désilets, A., Melanon, C., Patenaude, G., Brunette, L. (2009). How translators use tools and resources to resolve translation problems: an ethnographic study. In MT Summit XII—Workshop: Beyond translation memories: New tools for translators MT, Ontario, Canada.

  • Dice, L. R. (1945). Measures of the amount of ecologic association between species. Journal of Ecology, 26(3), 297–302.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Ehrmann, M., Turchi, M., & Steinberger, R. (2011). Building a multilingual named entity-annotated corpus using annotation projection. In Proceedings of recent advances in natural language processing, Hissar, Bulgaria (pp. 118–124).

  • Fellbaum, C. (1998). WordNet: An electronic lexical database. language, speech, and communication. Cambridge, MA: MIT Press. https://books.google.co.uk/books?id=Rehu8OOzMIMC.

  • Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal of Digital Libraries, 3(2), 115–130.

    Article  Google Scholar 

  • Frantzi, K. T., Ananiadou, S. (1996). Extracting nested collocations. In Proceedings of the 16th international conference on computational linguistics, Copenhagen, Denmark (pp. 41–46).

  • Gaussier, É. (1998). Flow network models for word alignment and terminology extraction from bilingual corpora. In COLING-ACL’98, 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (pp. 444–450). Montreal, QC, Canada.

  • Gelbukh, A., Sidorov, G., Lavin-Villa, E., & Chanona-Hernandez, L. (2010). Automatic term extraction using log-likelihood based comparison with general reference corpus. In NLDB’10: Proceedings of the Natural language processing and information systems, and 15th international conference on applications of natural language to information systems, Cardiff University, Cardiff, Wales, UK (pp. 248–255).

  • Gonçalo Oliveira, H., & Gomes, P. (2010). Towards the automatic creation of a wordnet from a term-based lexical network. In Proceedings of the ACL workshop TextGraphs-5: Graph-based methods for natural language processing (pp. 10–18). Uppsala, Sweden: ACL Press.

  • Ha, L. A., Fernandez, G., Mitkov, R., & Corpas, G. (2008). Mutual bilingual terminology extraction. In LREC 2008: 6th language resources and evaluation conference (pp. 1818–1824). Marrakech, Morocco.

  • Haque, R., Penkale , S., & Way, A. (2014). Bilingual termbank creation via log-likelihood comparison and phrase-based statistical machine translation. In Proceedings of 4th international workshop on computational terminology (Computerm), COLING 2014, Dublin, Ireland (pp. 42–51).

  • Hasler, E., Haddow, B., & Koehn, P. (2011). Margin infused relaxed algorithm for moses. The Prague Bulletin of Mathematical Linguistics, 96, 69–78.

    Article  Google Scholar 

  • He, T., Zhang, X., & Ye, X. (2006). An approach to automatically constructing domain ontology. In Proceedings of the 20th Pacific Asia conference on language, information and computation, PACLIC 2006, Wuhan, China (pp. 150–157).

  • Itagaki, M., Aikawa, T., & He, X. (2007). Automatic validation of terminology translation consistency with statistical method. In Proceedings of MT Summit XI, Copenhagen, Denmark (pp. 269–274).

  • Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1), 9–27.

    Article  Google Scholar 

  • Kim, S. N., Baldwin, T., Kan, M. Y. (2009). An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian language technology association workshop 2009, Sydney, Australia (pp. 94–98).

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT Summit X: The Tenth Machine Translation Summit (pp. 79–86). Phuket, Thailand.

  • Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In HLT-NAACL 2003: Conference combining human language technology conference series and the North American chapter of the association for computational linguistics conference series (pp. 48–54). Edmonton, Canada.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., & Bertoldi, N. et al. (2007). Moses: Open source toolkit for statistical machine translation. In ACL 2007, Proceedings of the Interactive Poster and Demonstration Sessions, Prague, Czech Republic (pp. 177–180).

  • Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., & Cofino, T. (2004). Glossary extraction and utilization in the information search and delivery system for IBM technical support. IBM Systems Journal, 43(3), 546–563.

    Article  Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. In 31st annual meeting of the association for computational linguistics, Proceedings of the Conference, Columbus, OH, USA (pp. 17–22).

  • Lefever, E., Macken, L., & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. In EACL ’09 Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, Greece (pp. 496–504).

  • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10, 707–710.

    Google Scholar 

  • Lu, B., & Tsou, B. K. (2009). Towards bilingual term extraction in comparable patents. In Proceedings of the 23rd Pacific Asia conference on language, information and computation, Hong Kong (pp. 755–762).

  • Och, F. J., Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In 40th annual meeting of the association for computational linguistics, Proceedings of the Conference, Philadelphia, PA, USA (pp. 295–302).

  • Pantel, P., & Lin, D. (2001). A statistical corpus-based term extractor. In E. Stroulia & S. Matwin (Eds.), Advances in artificial intelligence (Vol. 2056, pp. 36–46)., Lecture notes in computer science Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  • Park, Y., Byrd, R. J., & Boguraev, B. K. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics, association for computational linguistics (ACL), Taipei, Taiwan (pp. 772–778).

  • Pinnis, M., Ljubešić, N., Skadiņa , I., Tadić, M., & Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the terminology and knowledge engineering (TKE2012) conference, Madrid, Spain (pp. 193–208).

  • Rayson, P., Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on comparing corpora, held in conjunction with the 38th annual meeting of the association for computational linguistics (ACL 2000), Hong Kong (pp. 1–6).

  • Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y. (2009). Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the workshop on multiword expressions: Identification, interpretation, disambiguation and applications, association for computational linguistics, Suntec, Singapore (pp. 47–54).

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.

    Article  Google Scholar 

  • Sclano, F., Velardi, P. (2007). TermExtractor: A web application to learn the shared terminology of emergent web communities. In Proceedings of the 3rd international conference on interoperability for enterprise software and applications (I-ESA 2007), Funchal, Madeira Island, Portugal (pp. 287–290).

  • Tiedemann, J. (2009). News from OPUS—A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, G. Angelova & R. Mitkov (Eds.), Recent advances in natural language processing V (pp. 237–248). Amsterdam/Philadelphia: John Benjamins Publishing, 10.1075/cilt.309.

  • Vintar, V., Fišer, D. (2008). Harvesting multi-word expressions from parallel corpora. In Proceedings of the sixth international conference on language resources and evaluation, (LREC 2008), Marrakech, Morocco (pp. 1091–1096).

  • Wu, X., Haque, R., Okita, T., Arora, P., Way, A., & Liu, Q. (2014). Dcu-lingo24 participation in wmt 2014 hindi-english translation task. In Proceedings of WMT 2014: The ninth workshop on statistical machine translation, Baltimore, MD (pp. 215–220).

  • Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on computational linguistics—volume 2, COLING 2000, Stroudsburg, PA (pp. 947–953).

  • Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of The sixth international conference on language resources and evaluation, (LREC 2008), Marrakech, Morocco (pp. 2108–2113).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergio Penkale.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haque, R., Penkale, S. & Way, A. TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction. Lang Resources & Evaluation 52, 365–400 (2018). https://doi.org/10.1007/s10579-018-9412-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9412-4

Keywords

Navigation