Skip to main content

Abstract

Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Published as part of TildeNER in the ‘Toolkit for multi-level alignment and information extraction from comparable corpora’, public deliverable of the project ACCURAT, 2011.

  2. 2.

    A detailed list of available workflows is listed in the deliverable D2.6 of the ACCURAT project.

  3. 3.

    StanfordNER English model from the University of Stanford: ‘conll.distsim.iob2.crf.ser.gz’, available for download from: http://nlp.stanford.edu/software/crf-faq.shtml (point 11).

  4. 4.

    As reported by the University of Stanford in: http://nlp.stanford.edu/software/crf-faq.shtml (point11).

  5. 5.

    If many L2 candidates were correct translations of the L1 lexeme, it would be more reasonable to use mean average precision (MAP).

References

  • Apidianaki, M., Ljubešić, N., & Fišer, D. (2013). Vector disambiguation for translation extraction from comparable corpora resources used comparable corpus. Informatica (Slovenia), 37(2), 193–201.

    Google Scholar 

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.

    Article  Google Scholar 

  • Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of the 14th Conference on Computational Linguistics (Vol. 3, pp. 977–981). Association for Computational Linguistics.

    Google Scholar 

  • Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–393.

    Article  Google Scholar 

  • Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2). Association for Computational Linguistics.

    Google Scholar 

  • Chinchor, N. (1997). MUC-7 named entity task definition. Proceedings of the 7th Conference on Message Understanding.

    Google Scholar 

  • Cohen, J. (1968). Weighted Kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.

    Article  Google Scholar 

  • Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. Proceedings of the Fourth Conference on Applied Natural Language Processing (pp. 34–40). Association for Computational Linguistics.

    Google Scholar 

  • Daille, B. (1994). Study and implementation of combined techniques for automatic extraction of terminology. Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language (Language, Speech, and Communication) (pp. 29–36). Association for Computational Linguistics, Las Cruces, NM.

    Google Scholar 

  • Daille, B., & Morin, E. (2008). Effective compositional model for lexical alignment. Proceedings, IJCNLP 2008: Third International Joint Conference on Natural Language Processing (Vol. 1, pp. 95–102).

    Google Scholar 

  • Damerau, F. J. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.

    Article  Google Scholar 

  • Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111–124.

    Article  Google Scholar 

  • Delač, D., Krleža, Z., Šnajder, J., Bašić, B. D., & Šarić, F. (2009). TermeX: A tool for collocation extraction. In Computational Linguistics and Intelligent Text Processing (pp. 149–157). Springer.

    Google Scholar 

  • Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363–370). Association for Computational Linguistics.

    Google Scholar 

  • Fišer, D., & Ljubešic, N. (2011). Bilingual lexicon extraction from comparable corpora for closely related languages. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’11) (pp. 125–131).

    Google Scholar 

  • Fišer, D., Vintar, Š., Ljubešić, N., & Pollak, S. (2011). Building and using comparable corpora for domain-specific bilingual lexicon extraction. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 19–26). Association for Computational Linguistics.

    Google Scholar 

  • Fišer, D., Ljubešić, N., & Kubelka, O. (2012). Addressing polysemy in bilingual lexicon extraction from comparable corpora. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC’12 (pp. 3031–3035).

    Google Scholar 

  • Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: The C-value/NC-Value Method. International Journal on Digital Libraries, 3(2), 115–130.

    Article  Google Scholar 

  • Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine translation and the information soup (pp. 1–17). Springer.

    Google Scholar 

  • Fung, P., & McKeown, K. (1997). A technical word- and term-translation aid using noisy parallel corpora across language groups. Machine Translation, 12(1–2), 53–87.

    Article  Google Scholar 

  • Georgantopoulos, B., & Piperidis, S. (2000). A hybrid technique for automatic term extraction. Proceedings of the ACIDCA 2000 Conference.

    Google Scholar 

  • Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Heidelberg: Springer.

    Book  Google Scholar 

  • Grefenstette, G. (1999). The World Wide Web as a resource for example-based machine translation tasks. Proceedings of the ASLIB Conference on Translating and the Computer (Vol. 21).

    Google Scholar 

  • Grigonyte, G., Rimkute, E., Utka, A., & Boizou, L. (2011). Experiments on lithuanian term extraction. Proceedings of the NODALIDA 2011 Conference (pp. 82–89).

    Google Scholar 

  • Ion, R. (2007). Word sense disambiguation methods applied to English and Romanian. PhD Thesis, Romanian Academy, Bucharest.

    Google Scholar 

  • Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(01), 9–27.

    Article  Google Scholar 

  • Kageura, K., & Umino, B. (1996). Methods of automatic term recognition: A review. Terminology, 3(2), 259–289.

    Article  Google Scholar 

  • Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97–133.

    Article  Google Scholar 

  • Kochanski, G. (2006). Lecture 4-good-turing probability estimation. Oxford.

    Google Scholar 

  • Koehn, P., & Knight, K. (2002). Learning a translation lexicon from monolingual corpora. Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (Vol. 9, pp. 9–16). Association for Computational Linguistics.

    Google Scholar 

  • Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics.

    Google Scholar 

  • Kravalová, J., & Žabokrtský, Z. (2009). Czech named entity corpus and SVM-based recognizer. Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (pp. 194–201). Association for Computational Linguistics.

    Google Scholar 

  • Krugļevskis, V. (2010). Semi-automatic term extraction from Latvian texts and related language technologies. Magyar Terminologia (Journal of Hungarian Terminology).

    Google Scholar 

  • Kruglevskis, V., & Vancane, I. (2005). Term extraction from legal texts in Latvian. Proceedings of the Second Baltic Conference on Human Language Technologies (pp. 155–161).

    Google Scholar 

  • Lee, L., Aw, A., Zhang, M., & Li, H. (2010). EM-based hybrid model for bilingual terminology extraction from comparable corpora. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 639–646). Association for Computational Linguistics.

    Google Scholar 

  • Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 58–65). Association for Computational Linguistics.

    Google Scholar 

  • Ljubešić, N., & Erjavec, T. (2011). hrWaC and slWac: Compiling web corpora for Croatian and Slovene. Text, Speech and Dialogue 2011 Conference Proceedings (pp. 395–402). Springer.

    Google Scholar 

  • Ljubešić, N., & Fišer, D. (2011). Bootstrapping bilingual lexicons from comparable corpora for closely related languages. Text, Speech and Dialogue (pp. 91–98).

    Chapter  Google Scholar 

  • Ljubešić, N., Fišer, D., Vintar, Š., & Pollak, S. (2011). Bilingual lexicon extraction from comparable corpora: A comparative study. First International Workshop on Lexical Resources.

    Google Scholar 

  • Ljubešić, N., Vintar, Š., & Fišer, D. (2012). Multi-word term extraction from comparable corpora by combining contextual and constituent clues. Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012) (pp. 143–147). ELRA, Istanbul.

    Google Scholar 

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Marsi, E., & Krahmer, E. (2010). Automatic analysis of semantic similarity in comparable text through syntactic tree matching. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 752–760). Association for Computational Linguistics.

    Google Scholar 

  • Mima, H., & Ananiadou, S. (2000). An application and evaluation of the C/NC-value approach for the automatic term recognition of multi-word units in Japanese. Terminology, 6(2), 175–194.

    Article  Google Scholar 

  • Morin, E., & Prochasson, E. (2011). Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 27–34).

    Google Scholar 

  • Morin, E., Daille, B., Takeuchi, K., Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 664–671). Association for Computational Linguistics.

    Google Scholar 

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.

    Article  Google Scholar 

  • Och, F. J., & Ney, H. (2000). Improved statistical alignment models. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 440–447). Association for Computational Linguistics.

    Google Scholar 

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Otero, P. G. (2007). Learning bilingual lexicons from comparable English and Spanish corpora. Proceedings of MT Summit XI (pp. 191–198).

    Google Scholar 

  • Pantel, P., & Lin, D. (2001). A statistical corpus-based term extractor. Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence – Advances in Artificial Intelligence (AI 2001) (pp. 36–46). Ottawa, Canada. Berlin: Springer.

    Chapter  Google Scholar 

  • Paukkeri, M.-S., Nieminen, I. T., Pöllä, M., & Honkela, T. (2008). A language-independent approach to keyphrase extraction and evaluation. Proceedings of COLING 2008 (pp. 83–86).

    Google Scholar 

  • Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 1258–1265). European Language Resources Association (ELRA), Istanbul, Turkey.

    Google Scholar 

  • Pinnis, M., & Goba, K. (2011). Maximum entropy model for disambiguation of rich morphological tags. In C. Mahlow & M. Piotrowski (Eds.), Proceedings of the 2nd International Workshop on Systems and Frameworks for Computational Morphology (pp. 14–22). Zurich: Springer.

    Chapter  Google Scholar 

  • Pinnis, M., & Skadiņš, R. (2012). MT adaptation for under-resourced domains – What works and what not. Human Language Technologies – The Baltic Perspective – Proceedings of the Fifth International Conference Baltic HLT 2012 (Vol. 247, pp. 176–184). Tartu, Estonia: IOS Press.

    Google Scholar 

  • Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012) (pp. 193–208), Madrid.

    Google Scholar 

  • Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322). Computation and Language, Association for Computational Linguistics.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics, Stroudsburg, PA.

    Google Scholar 

  • Saralegi, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of Building and Using Comparable Corpora Workshop (pp. 27–32).

    Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing (Vol. 12, pp. 44–49).

    Google Scholar 

  • Schütze, H. (1998). The hypertext concordance: A better back-of-the-book index. Proceedings of First Workshop on Computational Terminology.

    Google Scholar 

  • Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA.

    Google Scholar 

  • Shezaf, D., & Rappoport, A. (2010). Bilingual lexicon generation using non-aligned signatures. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 98–107). Association for Computational Linguistics.

    Google Scholar 

  • Skadiņa, I. (2009). Jaunas iespējas attēlu meklēšanā: ģeotelpiskajā informācijā un valodu tehnoloģijās balstīta attēlu meklēšanas platforma TRIPOD. Latvijas Nacionālās bibliotēkas zinātniskie raksti (pp. 182–192). National Library of Latvia.

    Google Scholar 

  • Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.

    Google Scholar 

  • Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Ştefănescu, D. (2010). Intelligent information mining from multilingual corpora. PhD Thesis, Romanian Academy, Bucharest.

    Google Scholar 

  • Ştefănescu, D. (2012). Mining for term translations in comparable corpora. The 5th Workshop on Building and Using Comparable Corpora (pp. 98–103). Turkey, Istanbul.

    Google Scholar 

  • Ştefănescu, D., Tufiş, D., & Irimia, E. (2006). Automatic identification and extraction of collocations from texts. Proceedings of the 2nd Romanian Workshop for Linguistic Tools and Resources (Vol. 3). Bucharest, Romania.

    Google Scholar 

  • Ştefănescu, D., Ion, R., & Boroş, T. (2011). TiradeAI: An ensemble of spellcheckers. Proceedings of the Spelling Alteration for Web Search Workshop (pp. 20–23).

    Google Scholar 

  • Steinberger, R., Pouliquen, B., & Hagman, J. (2002). Cross-lingual document similarity calculation using the multilingual thesaurus EuroVoc. Computational Linguistics and Intelligent Text Processing (pp. 115–424).

    Chapter  Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufi, D., Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006) (Vol. 4, pp. 2142–2147).

    Google Scholar 

  • Tadić, M., & Šojat, K. (2003). Finding multiword term candidates in Croatian. In Proceedings of Information Extraction for Slavic Languages 2003 Workshop (pp. 102–107).

    Google Scholar 

  • Tiedemann, J. (2005). Optimization of word alignment clues. Natural Language Engineering, 11(03), 279–293.

    Article  Google Scholar 

  • Tjong, E. F., & Sang, K. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. Proceedings of the 6th Conference on Natural Language Learning (Vol. 20, pp. 142–147). Association for Computational Linguistics, Taipei, Taiwan.

    Google Scholar 

  • Todirascu, A., Gledhill, C., & Stefanescu, D. (2009). Extracting collocations in contexts. Human Language Technology. Challenges of the Information Society (pp. 336–349). Springer.

    Google Scholar 

  • Tufi, D., & Irimia, E. (2006). RoCo-news: A hand validated journalistic corpus of Romanian. Proceedings of the 5th LREC Conference (pp. 869–872). Genoa, Italy.

    Google Scholar 

  • Tufi, D., Ion, R., Ceauşu, A., & Ştefănescu, D. (2008). RACAI’s linguistic web services. Proceedings of the 6th Language Resources and Evaluation Conference-LREC (pp. 327–333).

    Google Scholar 

  • Vintar, Š. (2010). Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology, 16(2), 141–158.

    Article  Google Scholar 

  • Voorhees, E. M. (2001). Overview of the TREC-9 question answering track. Proceedings of the Ninth Text REtrieval Conference (TREC-9).

    Google Scholar 

  • Weller, M., Gojun, A., Heid, U., Daille, B., & Harastani, R. (2011). Simple methods for dealing with term variation and term alignment. Proceedings of the 9th International Conference on Terminology and Artificial Intelligence (TIA 2011) (pp. 86–92).

    Google Scholar 

  • Xiao, R., & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103–129.

    Article  Google Scholar 

  • Yu, K., & Tsujii, J. (2009). Bilingual dictionary extraction from Wikipedia. Proceedings of Machine Translation Summit XII (pp. 379–386).

    Google Scholar 

  • Zeller, I. (2005). Automatinis terminu atpazinimas ir apdorojimas. VDU Lietuviu Kalbos Institutas.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Additional information

Chapter editors: Mārcis Pinnis and Nikola Ljubešić

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Pinnis, M. et al. (2019). Extracting Data from Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics