Skip to main content

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

In this article we describe two tools we have built, one for compiling comparable corpora out of the Internet and the other for bilingual terminology extraction out of comparable corpora, and an evaluation we have subjected them to: bilingual terminology has been extracted out of automatically collected domain-comparable web corpora, in Basque and English, and the resulting terminology lists have been validated automatically using a specialized dictionary, in order to evaluate the quality of the extracted terminology lists. Thus, this evaluation measures the usefulness of putting these two automatic tools to work together in a real-world task, that is, specialized dictionary making.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., Urizar, R.: Euslem: A lemmatiser/tagger for basque. In: Proceedings of 7th EURALEX International Conference, vol. 1, pp. 17ā€“26. EURALEX, Gƶteborg, Sweden (2002)

    Google ScholarĀ 

  2. Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL-02 workshop on Computational approaches to Semitic languages, pp. 1ā€“13. ACL, Philadelphia, USA (2002)

    Google ScholarĀ 

  3. Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: Linguistic and statistical approaches to basque term extraction. In: Proceedings of GLAT 2004. Barcelona, Spain (2004)

    Google ScholarĀ 

  4. Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: An xml-based term extraction tool for basque. In: Proceedings of the 4th International Conference on Language Resources and Evaluations (LREC). ELRA, Lisbon, Portugal (2004)

    Google ScholarĀ 

  5. Alegria, I., Gurrutxaga, A., Saralegi, X., Ugartetxea, S.: Elexbi, a basic tool for bilingual term extraction from Spanish-Basque parallel corpora. In: Proceedings of Euralex 2006, pp. 159ā€“165. Euralex, Torino, Italy (2006)

    Google ScholarĀ 

  6. Amati, G., Van Rijsbergen, C.: Probabilistic models of information retrieval based on measuring divergence from randomness. Trans. Inform. Syst. 20(4), 357ā€“389 (2002)

    ArticleĀ  Google ScholarĀ 

  7. Baayen, R.: Word Frequency Distributions. Kluwer, Dordrecht (2001)

    BookĀ  MATHĀ  Google ScholarĀ 

  8. Ballesteros, L., Croft, W.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR Conference, pp. 64ā€“71. ACM, Melbourne (1998)

    Google ScholarĀ 

  9. Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004, pp. 1313ā€“1316. ELRA, Lisbon, Portugal (2004)

    Google ScholarĀ 

  10. Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: Cleaneval: a competition for cleaning web pages. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)

    Google ScholarĀ 

  11. Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Proceedings of EACL 2006, pp. 87ā€“90. EACL, Trento, Italy (2006)

    Google ScholarĀ 

  12. Baroni, M., Ueyama, M.: Building general- and special purpose corpora by web crawling. In: Proceedings of the 13th NIJL International Symposium. Tokyo, Japan (2006)

    Google ScholarĀ 

  13. Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of HLT/NAACL, pp. 16ā€“23. NAACL, Edmonton, USA (2003)

    Google ScholarĀ 

  14. Basic dictionary of science and technology, http://zthiztegia.elhuyar.org

  15. Bekavac, B., Osenova, P., Simov, K., Tadić, M.: Making monolingual corpora comparable: a case study of Bulgarian & Croatian. In: Proceedings of LREC 2004, pp. 1187ā€“1190. ELRA, Lisbon, Portugal (2004)

    Google ScholarĀ 

  16. Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the 39th Annual Meeting of the ACL, pp. 54ā€“60. ACL, Toulouse, France (2001)

    Google ScholarĀ 

  17. Bourigault, D.: Lexter, a natural language processing tool for terminology extraction. In: Proceedings of 7th EURALEX International Conference. Gƶteborg, Sweden (1996)

    Google ScholarĀ 

  18. Braschler, M., SchƤuble, P.: Multilingual information retrieval based on document alignment techniques. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, pp. 183ā€“197. Springer, Heraklion, Greece (1998)

    Google ScholarĀ 

  19. Broder, A.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21ā€“29. IEEE, Salerno, Italy (1997)

    Google ScholarĀ 

  20. Broder, A.: Identifying and filtering near-duplicate documents. In: Proceedings of Combinatorial Pattern Matching: 11th Annual Symposium, pp. 1ā€“10. Montreal, Canada (2000)

    Google ScholarĀ 

  21. Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161ā€“175. Las Vegas, USA (1994)

    Google ScholarĀ 

  22. Chakrabarti, S., Van der Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International WWW Conference, pp. 545ā€“562. W3C, Toronto, Canada (1999)

    Google ScholarĀ 

  23. Chen, H., Bian, G., Lin, W.: Resolving translation ambiguity and target polysemy in cross-language information retrieval. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 215ā€“222. ACL, College Park, USA (1999)

    Google ScholarĀ 

  24. Chiao, Y., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 1208ā€“1212. ACL, Taipei, Taiwan (2002)

    Google ScholarĀ 

  25. Church, K., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of the 27th Annual Meeting of the ACL, pp. 76ā€“83. ACL, Vancouver, Canada (1989)

    Google ScholarĀ 

  26. Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic filtering. Tech. Rep. UCREL Technical Papers 5, UCREL (1995)

    Google ScholarĀ 

  27. Daille, B., Morin, E.: French-english terminology extraction from comparable corpora. Natural Language Processingā€”IJCNLP, p. 707G718 (2005)

    Google ScholarĀ 

  28. Dias, G., GuillorĆ©, S., Lopes, J.: Mutual expectation: a measure for multiword lexical unit extraction. In: Proceedings of VExTALā€”Venezia per il Trattamento Automatico delle Lingue. Venezia, Italy (1999)

    Google ScholarĀ 

  29. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61ā€“74 (1994)

    Google ScholarĀ 

  30. Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukwac, a very large web-derived corpus of English. In: Proceedings of WAC4 Workshop. ACL SIGWAC, Marrakech, Morocco (2008)

    Google ScholarĀ 

  31. Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: content classification for digital libraries. In: Proceedings of Personalisation and Recommender Systems in Digital Libraries Workshop. Dublin, Ireland (2001)

    Google ScholarĀ 

  32. Fletcher, W.: Corpus Linguistics in North America 2002. In: Making the Web More Useful as a Source for Linguistic Corpora. Rodopi, Amsterdam (2004)

    Google ScholarĀ 

  33. Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 173ā€“183. Boston, USA (1995)

    Google ScholarĀ 

  34. Fung, P., Yee, L.: An ir approach for translating new words from nonparallel comparable texts. In: Proceedings of COLING-ACL, pp. 414ā€“420. ACL, Montreal, Canada (1998)

    Google ScholarĀ 

  35. Gamallo, P.: Learning bilingual lexicons from comparable English and Spanish corpora. In: Proceedings of Machine Translation Summit XI, pp. 191ā€“198. Copenhagen, Denmark (2007)

    Google ScholarĀ 

  36. Gao, J., Nie, J.: A study of statistical models for query translation: finding a good unit of translation. In: Proceedings of SIGIR Conference, pp. 194ā€“201. ACM, Seattle, USA (2006)

    Google ScholarĀ 

  37. Gurrutxaga, A., Leturia, I., Saralegi, X., San Vicente, I.: Evaluation of an automatic process for specialized web corpora collection and term extraction for basque. In: Proceedings of eLexicography 2009. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2009)

    Google ScholarĀ 

  38. Hull, D., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49ā€“57. ACM (1996)

    Google ScholarĀ 

  39. Justeson, J.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Tech. Rep. IBM Research Report RC 18906 (82591), IBM (1993)

    Google ScholarĀ 

  40. Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: Proceedings of workshop on very large corpora, pp. 231ā€“245. ACL SIGDAT, Beijing and Hong Kong, China (1997)

    Google ScholarĀ 

  41. Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of EMNLP-3, pp. 46ā€“52. ACL SIGDAT, Granada, Spain (1998)

    Google ScholarĀ 

  42. Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Corpeus, a ā€™web as corpusā€™ tool designed for the agglutinative nature of basque. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 69ā€“81. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)

    Google ScholarĀ 

  43. Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Kimatu, a tool for cleaning non-content text parts from html docs. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 163ā€“167. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)

    Google ScholarĀ 

  44. Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a search service designed for the agglutinative nature of basque. In: Proceedings of Improving non-English web searching (iNEWSā€™07) workshop, pp. 47ā€“54. SIGIR, Amsterdam, The Netherlands (2007)

    Google ScholarĀ 

  45. Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and performance of morphological query expansion and language-filtering words on basque web searching. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)

    Google ScholarĀ 

  46. Leturia, I., San Vicente, I., Saralegi, X., Lopez de Lacalle, M.: Basque specialized corpora from the web: language-specific performance tweaks and improving topic precision. In: Proceedings of the 4th Web as Corpus Workshop, pp. 40ā€“46. ACL SIGWAC, Marrakech, Morocco (2008)

    Google ScholarĀ 

  47. Liu, Y., Jin, R., Chai, J.: A maximum coherence model for dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 536ā€“543. ACM, Salvador, Brazil (2005)

    Google ScholarĀ 

  48. Matsuo, Y., Ishizuka, M.: Keyword extraction from a document using word co-occurrence statistical information. Trans. Jpn. Soc. Artif. Intell. 17(3), 217ā€“223 (2000)

    Google ScholarĀ 

  49. Melamed, I.D.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107ā€“130 (1999), http://portal.acm.org/citation.cfm?id=973215.973218

  50. Milos, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, pp. 275ā€“284. Halifax, Canada (2003)

    Google ScholarĀ 

  51. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology miningā€”using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 664ā€“671. ACL, Prague, Czech Republic (2007)

    Google ScholarĀ 

  52. Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477ā€“504 (2005)

    ArticleĀ  Google ScholarĀ 

  53. Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 55ā€“63. ACM, Melbourne, Australia (1998)

    Google ScholarĀ 

  54. Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320ā€“322. ACL, Cambridge, USA (1995)

    Google ScholarĀ 

  55. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 519ā€“526. ACL, College Park, USA (1999)

    Google ScholarĀ 

  56. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing Corpora, pp. 1ā€“6. ACL, Hong Kong, China (2000)

    Google ScholarĀ 

  57. Robertson, S., Walker, S., Beaulieu, M.: Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. In: Proceedings of 7th Text REtrieval Conference (TREC-7), pp. 199ā€“210. Gaithersburg, USA (1998)

    Google ScholarĀ 

  58. Saralegi, X., San Vicente, I., Gurrutxaga, A.: Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In: Proceedings of Building and using Comparable Corpora workshop. ACL, Marrakech, Morocco (2008)

    Google ScholarĀ 

  59. Saralegi, X., San Vicente, I., Lopez de Lacalle, M.: Mining term translations from domain restricted comparable corpora. Procesamiento del Lenguaje Natural 41, 273ā€“280 (2008)

    Google ScholarĀ 

  60. Shao, L., Ng, H.: Mining new word translations from comparable corpora. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 618ā€“624. ACL, Geneva, Switzerland (2004)

    Google ScholarĀ 

  61. Sharoff, S.: WaCky! Working papers on the Web as Corpus, chap. Creating general-purpose corpora using automated search engine queries, pp. 63ā€“98. Gedit, Bologna, Italy (2006)

    Google ScholarĀ 

  62. Sharoff, S.: Classifying web corpora into domain and genre using automatic feature identification. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 83ā€“94. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)

    Google ScholarĀ 

  63. Sharoff, S., Babych, B., Hartley, A.: ā€™irrefragable answersā€™ using comparable corpora to retrieve translation equivalents. Lang. Resour. Eval. 43(1), 15ā€“25 (2007), http://www.springerlink.com/content/8k6631431pl3538l/

  64. Sheridan, P., Ballerini, J.: Experiments in multilingual information retrieval using the spider system. In: Proceedings of the 19th Annual International ACM SIGIR Conference, pp. 58ā€“65. ACM, Zurich, Switzerland (1996)

    Google ScholarĀ 

  65. Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143ā€“177 (1993)

    Google ScholarĀ 

  66. Talvensaari, T., Laurikkala, J., JƤrvelin, K., Juhola, M., Keskustalo, H.: Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inform. Syst. 25(1), 4 (2007)

    ArticleĀ  Google ScholarĀ 

  67. Talvensaari, T., Pirkola, A., JƤrvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in acquisition of comparable corpora. Inform. Retr. 11, 427ā€“445 (2008)

    ArticleĀ  Google ScholarĀ 

  68. Treetagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

  69. Zientzia.net, http://www.zientzia.net

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antton Gurrutxaga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gurrutxaga, A., Leturia, I., Saralegi, X., Vicente, I.S. (2013). Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics