Skip to main content

Abstract

The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For example, “letter of credit” in English may be translated in Dutch as “accreditief” or “kredietbrief” (based on using Eurowordnet).

  2. 2.

    Wikipedia inter-language links connect documents from different languages that describe the same topic.

  3. 3.

    Good-quality articles are those that senior Wikipedia moderators and the Romanian Wikipedia community think to be complete, well written, with good references, etc.

  4. 4.

    We used a clean dictionary containing more than 1.5 million entries for RO-EN; for other language pairs, dictionaries were built in the ACCURAT project using GIZA++.

  5. 5.

    Wikipedia Extractor tool is available for download in the ACCURAT project website (Paramita et al. 2012).

  6. 6.

    For named entity parsing, we use OpenNLP tools: http://incubator.apache.org/opennlp/

  7. 7.

    Boilerpipe—http://code.google.com/p/boilerpipe/—is used to extract the textual content from the URL.

  8. 8.

    Titles, which have less than five content words, are not taken into consideration.

  9. 9.

    http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm

  10. 10.

    http://bixo.101tec.com

  11. 11.

    Have all topic-core-terms been included? Have other terms effectively pointing to the topic also been included? Does the topic definition file contain multi-word strong topic indicators? Have all terms been ranked consistently?

  12. 12.

    Do seed URLs in the source language and seed URLs in the target language address highly comparable Web documents? Have multilingual sites, if any, been included?

  13. 13.

    The longer the better, especially in cases where there are not too many Web documents relevant to the topic selected.

  14. 14.

    For example, increase “Minimum unique terms that must exist in clean content” from default value of 3–5.

  15. 15.

    LDA modeling can abstract a model from a relatively small corpus and a tenth of the original Reuters corpus is much more manageable in terms of memory and requirements.

References

  • ACCURAT Deliverable: D3.3, D3.4, D3.5.

    Google Scholar 

  • Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.

    Google Scholar 

  • Aker, A., Kanoulas, E., & Gaizauskas, R. (2012). A light way to collect comparable corpora from the Web. Proceedings of LREC 2012, 21–27 May, Istanbul, Turkey.

    Google Scholar 

  • Ardö, A., & Golub, K. (2007). Documentation for the Combine (Focused) Crawling System. http://combine.it.lth.se/documentation/DocMain/

  • Argaw, A. A., & Asker, L. (2005). Web mining for an amharic-english bilingual corpus. Proceedings of the 1st International Conference on Web Information Systems and Technologies, WEBIST ’05 (pp. 239–246). INSTICC Press.

    Google Scholar 

  • Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web. Proceedings of LREC 2004 (pp. 1313–1316).

    Google Scholar 

  • Barzilay, R., & McKeown, K. R. (2001). Extracting paraphrases from a parallel corpus. ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 50–57). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12), ACM, New York, NY.

    Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Braschler, P. S. (1998). Multilingual information retrieval based on document alignment techniques. Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98, Heraklion, Crete, Cyprus, September 21–23, 1998: Proceedings, 183. Springer.

    Google Scholar 

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.

    Article  Google Scholar 

  • Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 17–24). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161–175.

    Google Scholar 

  • Chakrabarti, S., Punera, K., & Subramanyam, M. (2002, May). Accelerated focused crawling through online relevance feedback. Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). ACM.

    Google Scholar 

  • Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7), 161–172.

    Article  Google Scholar 

  • De Bra, P. M. E., & Post, R. D. J. (1994). Information retrieval in the World-Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems, 27(2), 183–192.

    Article  Google Scholar 

  • Dimalen, D. M. D., & Roxas, R. (2007). AutoCor: A query based automatic acquisition of corpora of closely-related languages. Proceedings of the 21st PACLIC (pp. 146–154).

    Google Scholar 

  • Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics, 93, 77–86.

    Article  Google Scholar 

  • Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 ’09).

    Google Scholar 

  • Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP ’04 (pp. 57–63), Citeseer.

    Google Scholar 

  • Gamallo, P., & Garcia, M. (2012). Extraction of bilingual cognates from Wikipedia. Computational Processing of the Portuguese Language (pp. 63–72). Springer.

    Google Scholar 

  • Ghani, R., Jones, R., & Mladenic, D. (2005). Building minority language corpora by learning to generate web search queries. Knowledge and Information Systems, 7(1), 56–83.

    Article  Google Scholar 

  • Hassan, A., Fahmy, H., & Hassan, H. (2007). Improving named entity translation by exploiting comparable and parallel corpora. Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop.

    Google Scholar 

  • Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm—An application: Tailored Web site mapping. Computer Networks and ISDN Systems, 30(1–7), 317–326.

    Article  Google Scholar 

  • Huang, D., Zhao, L., Li, L., & Yu, H. (2010). Mining large-scale comparable corpora from Chinese-English news collections. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 472–480). Association for Computational Linguistics.

    Google Scholar 

  • Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., & Ştefănescu, D. (2010). On-line compilation of comparable corpora and their evaluation. Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7) (pp. 29–34). Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, October 2010.

    Google Scholar 

  • Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 455–462). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Koehn, P. (2009). Statistical machine translation. Cambridge University Press.

    Google Scholar 

  • Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. The Third ACM International Conference on Web Search and Data Mining.

    Google Scholar 

  • Kumano, T., Tanaka, H., & Tokunaga, T. (2007). Extracting phrasal alignments from comparable corpora by using joint probability SMT model. Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (pp. 95–103).

    Google Scholar 

  • Lü, Y., Huang, J., & Liu, Q. (2007, June). Improving statistical machine translation performance by training data selection and optimization. EMNLP-CoNLL (Vol. 34, pp. 3–350).

    Google Scholar 

  • Marton, Y., Callison-Burch, C., Resnik, P. (2009). Improved statistical machine translation using monolingually-derived paraphrases. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 381–390). Association for Computational Linguistics.

    Google Scholar 

  • Mastropavlos, N., & Papavassiliou, V. (2011). Automatic acquisition of bilingual language resources. Proceedings of the 10th International Conference on Greek Linguistics, Komotini, Greece

    Google Scholar 

  • Menczer, F., & Belew, R. (2000). Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3), 203–242.

    Article  Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 289–295). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Nakov, P. (2008). Paraphrasing verbs for noun compound interpretation. Proceedings of the Workshop on Multiword Expressions, LREC-2008.

    Google Scholar 

  • Paramita, M., Clough, P., Aker, A., & Gaizauskas, R. (2012). Correlation between similarity measures for inter-language linked Wikipedia articles. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012) (pp. 790–797), Istanbul, Turkey.

    Google Scholar 

  • Passerini, A., Frasconi, P., & Soda, G. (2001). Evaluation methods for focused crawling, Lecture Notes in Computer Science 2175, pp. 33–45.

    Chapter  Google Scholar 

  • Phan, X. H., Nguyen, L. M., & Horiguchi, S. (2008, April). Learning to classify short and sparse text and web with hidden topics from large-scale data collections. Proceedings of the 17th International Conference on World Wide Web (pp. 91–100). ACM.

    Google Scholar 

  • Pinkerton, B. (1994). Finding what people want: Experiences with the Web Crawler. Proceedings of the 2nd International World Wide Web Conference.

    Google Scholar 

  • Preiss, J. (2012). Identifying comparable corpora using LDA. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT ‘12) (pp. 558–562). Association for Computational Linguistics, Stroudsburg, PA.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics.

    Google Scholar 

  • Resnik, P. (1998). Parallel strands: A preliminary investigation into mining the web for bilingual text. In D. Farwell, L. Gerber, & E. Hovy (Eds.), Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas (AMTA-98), Langhorne, PA, Lecture Notes in Artificial Intelligence 1529, Springer, October, 1998.

    Chapter  Google Scholar 

  • Resnik, P. (1999). Mining the web for bilingual text. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 527–534). Association for Computational Linguistics.

    Google Scholar 

  • Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The Reuters corpus volume 1 – from yesterday’s news to tomorrow’s language resources. Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 827–832).

    Google Scholar 

  • Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 739–746). Association for Computational Linguistics, Morristown, NJ.

    Google Scholar 

  • Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In A. Gawman, E. Kidd, & P-Å. Larson (Eds.), Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing (CASCON ’93) (Vol. 2, pp. 1071–1082). IBM Press.

    Google Scholar 

  • Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In NAACL-HLT (pp. 403–411).

    Google Scholar 

  • Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information. Journal of Computing and Information Technology, 13(4), 257–264.

    Article  Google Scholar 

  • Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.

    Article  Google Scholar 

  • Theobald, M., Siddharth, J., & Paepcke, A. (2008). SpotSigs: Robust and efficient near duplicate detection in large web collections. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008).

    Google Scholar 

  • Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J., (2001). Mining Wikipedia as a parallel and comparable corpus. Language Forum (Vol. 34, No. 1, pp. 123–137). Bahri Publications.

    Google Scholar 

  • Uszkoreit, J., Ponte, J. M., Popat, A. C., & Dubiner, M. (2010, August). Large scale parallel document mining for machine translation. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1101–1109). Association for Computational Linguistics.

    Google Scholar 

  • Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 121–124). Association for Computational Linguistics, Stroudsburg, PA.

    Google Scholar 

  • Zhao, S., Niu, C., Zhou, M., Liu, T., & Li, S. (2008, June). Combining multiple resources to improve SMT-based paraphrasing model. Proceedings of ACL-08: HLT (pp. 1021–1029). Association for Computational Linguistics, Columbus, OH.

    Google Scholar 

  • Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Gaizauskas .

Editor information

Editors and Affiliations

Additional information

Chapter editors: Robert Gaizauskas and Monica Lestari Paramita

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Paramita, M.L. et al. (2019). Collecting Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics