Collecting Comparable Corpora

Paramita, Monica Lestari; Aker, Ahmet; Clough, Paul; Gaizauskas, Robert; Glaros, Nikos; Mastropavlos, Nikos; Yannoutsou, Olga; Ion, Radu; Ștefănescu, Dan; Ceauşu, Alexandru; Tufiș, Dan; Preiss, Judita

doi:10.1007/978-3-319-99004-0_3

Monica Lestari Paramita¹⁰,
Ahmet Aker¹⁰,
Paul Clough¹⁰,
Robert Gaizauskas¹⁰,
Nikos Glaros¹¹,
Nikos Mastropavlos¹¹,
Olga Yannoutsou¹¹,
Radu Ion¹²,
Dan Ștefănescu¹²,
Alexandru Ceauşu¹²,
Dan Tufiș¹² &
…
Judita Preiss¹⁰

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

405 Accesses
2 Citations

Abstract

The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For example, “letter of credit” in English may be translated in Dutch as “accreditief” or “kredietbrief” (based on using Eurowordnet).
2.
Wikipedia inter-language links connect documents from different languages that describe the same topic.
3.
Good-quality articles are those that senior Wikipedia moderators and the Romanian Wikipedia community think to be complete, well written, with good references, etc.
4.
We used a clean dictionary containing more than 1.5 million entries for RO-EN; for other language pairs, dictionaries were built in the ACCURAT project using GIZA++.
5.
Wikipedia Extractor tool is available for download in the ACCURAT project website (Paramita et al. 2012).
6.
For named entity parsing, we use OpenNLP tools: http://incubator.apache.org/opennlp/
7.
Boilerpipe—http://code.google.com/p/boilerpipe/—is used to extract the textual content from the URL.
8.
Titles, which have less than five content words, are not taken into consideration.
9.
http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm
10.
http://bixo.101tec.com
11.
Have all topic-core-terms been included? Have other terms effectively pointing to the topic also been included? Does the topic definition file contain multi-word strong topic indicators? Have all terms been ranked consistently?
12.
Do seed URLs in the source language and seed URLs in the target language address highly comparable Web documents? Have multilingual sites, if any, been included?
13.
The longer the better, especially in cases where there are not too many Web documents relevant to the topic selected.
14.
For example, increase “Minimum unique terms that must exist in clean content” from default value of 3–5.
15.
LDA modeling can abstract a model from a relatively small corpus and a tenth of the original Reuters corpus is much more manageable in terms of memory and requirements.

References

ACCURAT Deliverable: D3.3, D3.4, D3.5.
Google Scholar
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.
Google Scholar
Aker, A., Kanoulas, E., & Gaizauskas, R. (2012). A light way to collect comparable corpora from the Web. Proceedings of LREC 2012, 21–27 May, Istanbul, Turkey.
Google Scholar
Ardö, A., & Golub, K. (2007). Documentation for the Combine (Focused) Crawling System. http://combine.it.lth.se/documentation/DocMain/
Argaw, A. A., & Asker, L. (2005). Web mining for an amharic-english bilingual corpus. Proceedings of the 1st International Conference on Web Information Systems and Technologies, WEBIST ’05 (pp. 239–246). INSTICC Press.
Google Scholar
Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web. Proceedings of LREC 2004 (pp. 1313–1316).
Google Scholar
Barzilay, R., & McKeown, K. R. (2001). Extracting paraphrases from a parallel corpus. ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 50–57). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12), ACM, New York, NY.
Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Braschler, P. S. (1998). Multilingual information retrieval based on document alignment techniques. Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98, Heraklion, Crete, Cyprus, September 21–23, 1998: Proceedings, 183. Springer.
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
Article Google Scholar
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 17–24). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161–175.
Google Scholar
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002, May). Accelerated focused crawling through online relevance feedback. Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). ACM.
Google Scholar
Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7), 161–172.
Article Google Scholar
De Bra, P. M. E., & Post, R. D. J. (1994). Information retrieval in the World-Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems, 27(2), 183–192.
Article Google Scholar
Dimalen, D. M. D., & Roxas, R. (2007). AutoCor: A query based automatic acquisition of corpora of closely-related languages. Proceedings of the 21st PACLIC (pp. 146–154).
Google Scholar
Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics, 93, 77–86.
Article Google Scholar
Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 ’09).
Google Scholar
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP ’04 (pp. 57–63), Citeseer.
Google Scholar
Gamallo, P., & Garcia, M. (2012). Extraction of bilingual cognates from Wikipedia. Computational Processing of the Portuguese Language (pp. 63–72). Springer.
Google Scholar
Ghani, R., Jones, R., & Mladenic, D. (2005). Building minority language corpora by learning to generate web search queries. Knowledge and Information Systems, 7(1), 56–83.
Article Google Scholar
Hassan, A., Fahmy, H., & Hassan, H. (2007). Improving named entity translation by exploiting comparable and parallel corpora. Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop.
Google Scholar
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm—An application: Tailored Web site mapping. Computer Networks and ISDN Systems, 30(1–7), 317–326.
Article Google Scholar
Huang, D., Zhao, L., Li, L., & Yu, H. (2010). Mining large-scale comparable corpora from Chinese-English news collections. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 472–480). Association for Computational Linguistics.
Google Scholar
Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., & Ştefănescu, D. (2010). On-line compilation of comparable corpora and their evaluation. Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7) (pp. 29–34). Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, October 2010.
Google Scholar
Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 455–462). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Koehn, P. (2009). Statistical machine translation. Cambridge University Press.
Google Scholar
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. The Third ACM International Conference on Web Search and Data Mining.
Google Scholar
Kumano, T., Tanaka, H., & Tokunaga, T. (2007). Extracting phrasal alignments from comparable corpora by using joint probability SMT model. Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (pp. 95–103).
Google Scholar
Lü, Y., Huang, J., & Liu, Q. (2007, June). Improving statistical machine translation performance by training data selection and optimization. EMNLP-CoNLL (Vol. 34, pp. 3–350).
Google Scholar
Marton, Y., Callison-Burch, C., Resnik, P. (2009). Improved statistical machine translation using monolingually-derived paraphrases. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 381–390). Association for Computational Linguistics.
Google Scholar
Mastropavlos, N., & Papavassiliou, V. (2011). Automatic acquisition of bilingual language resources. Proceedings of the 10th International Conference on Greek Linguistics, Komotini, Greece
Google Scholar
Menczer, F., & Belew, R. (2000). Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3), 203–242.
Article Google Scholar
Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 289–295). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Article Google Scholar
Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Nakov, P. (2008). Paraphrasing verbs for noun compound interpretation. Proceedings of the Workshop on Multiword Expressions, LREC-2008.
Google Scholar
Paramita, M., Clough, P., Aker, A., & Gaizauskas, R. (2012). Correlation between similarity measures for inter-language linked Wikipedia articles. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012) (pp. 790–797), Istanbul, Turkey.
Google Scholar
Passerini, A., Frasconi, P., & Soda, G. (2001). Evaluation methods for focused crawling, Lecture Notes in Computer Science 2175, pp. 33–45.
Chapter Google Scholar
Phan, X. H., Nguyen, L. M., & Horiguchi, S. (2008, April). Learning to classify short and sparse text and web with hidden topics from large-scale data collections. Proceedings of the 17th International Conference on World Wide Web (pp. 91–100). ACM.
Google Scholar
Pinkerton, B. (1994). Finding what people want: Experiences with the Web Crawler. Proceedings of the 2nd International World Wide Web Conference.
Google Scholar
Preiss, J. (2012). Identifying comparable corpora using LDA. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT ‘12) (pp. 558–562). Association for Computational Linguistics, Stroudsburg, PA.
Google Scholar
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics.
Google Scholar
Resnik, P. (1998). Parallel strands: A preliminary investigation into mining the web for bilingual text. In D. Farwell, L. Gerber, & E. Hovy (Eds.), Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas (AMTA-98), Langhorne, PA, Lecture Notes in Artificial Intelligence 1529, Springer, October, 1998.
Chapter Google Scholar
Resnik, P. (1999). Mining the web for bilingual text. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 527–534). Association for Computational Linguistics.
Google Scholar
Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The Reuters corpus volume 1 – from yesterday’s news to tomorrow’s language resources. Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 827–832).
Google Scholar
Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 739–746). Association for Computational Linguistics, Morristown, NJ.
Google Scholar
Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In A. Gawman, E. Kidd, & P-Å. Larson (Eds.), Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing (CASCON ’93) (Vol. 2, pp. 1071–1082). IBM Press.
Google Scholar
Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In NAACL-HLT (pp. 403–411).
Google Scholar
Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information. Journal of Computing and Information Technology, 13(4), 257–264.
Article Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.
Article Google Scholar
Theobald, M., Siddharth, J., & Paepcke, A. (2008). SpotSigs: Robust and efficient near duplicate detection in large web collections. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008).
Google Scholar
Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J., (2001). Mining Wikipedia as a parallel and comparable corpus. Language Forum (Vol. 34, No. 1, pp. 123–137). Bahri Publications.
Google Scholar
Uszkoreit, J., Ponte, J. M., Popat, A. C., & Dubiner, M. (2010, August). Large scale parallel document mining for machine translation. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1101–1109). Association for Computational Linguistics.
Google Scholar
Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 121–124). Association for Computational Linguistics, Stroudsburg, PA.
Google Scholar
Zhao, S., Niu, C., Zhou, M., Liu, T., & Li, S. (2008, June). Combining multiple resources to improve SMT-based paraphrasing model. Proceedings of ACL-08: HLT (pp. 1021–1029). Association for Computational Linguistics, Columbus, OH.
Google Scholar
Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Sheffield, Sheffield, UK
Monica Lestari Paramita, Ahmet Aker, Paul Clough, Robert Gaizauskas & Judita Preiss
Institute for Language and Speech Processing (ILSP), Athens, Greece
Nikos Glaros, Nikos Mastropavlos & Olga Yannoutsou
Research Institute for Artificial Intelligence, Romanian Academy Center for Artificial Intelligence (RACAI), Bucharest, Romania
Radu Ion, Dan Ștefănescu, Alexandru Ceauşu & Dan Tufiș

Authors

Monica Lestari Paramita
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Aker
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Glaros
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Mastropavlos
View author publications
You can also search for this author in PubMed Google Scholar
Olga Yannoutsou
View author publications
You can also search for this author in PubMed Google Scholar
Radu Ion
View author publications
You can also search for this author in PubMed Google Scholar
Dan Ștefănescu
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Ceauşu
View author publications
You can also search for this author in PubMed Google Scholar
Dan Tufiș
View author publications
You can also search for this author in PubMed Google Scholar
Judita Preiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Gaizauskas .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Additional information

Chapter editors: Robert Gaizauskas and Monica Lestari Paramita

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Paramita, M.L. et al. (2019). Collecting Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_3
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics