hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

Ljubešić, Nikola; Erjavec, Tomaž

doi:10.1007/978-3-642-23538-2_50

Nikola Ljubešić²¹ &
Tomaž Erjavec²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6836))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

943 Accesses
20 Citations

Abstract

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Guevara, E.: NoWaC: a large web-based corpus for Norwegian. In: NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 1–7 (2010)
Google Scholar
Spoustov, D., Spousta, M., Pecina, P.: Building a Web Corpus of Czech. In: Seventh Intl. Conf. on Language Resources and Evaluation, LREC 2010 (2010)
Google Scholar
Sharoff, S.: Analysing Similarities and Differences between Corpora. In: 7th Conference ”Language Technologies”, Jožef Stefan Institute, Ljubljana, pp. 5–11 (2010)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM 2010, pp. 441–450 (2010)
Google Scholar
Stupar, M., Jurić, T., Ljubešić, N.: Language Identification on Web Data for Building Linguistic Corpora. In: Proceedings of the INFuture 2011 Conference (2011) (in press)
Google Scholar
Agić, Ž., Tadić, M.: Evaluating Morphosyntactic Tagging of Croatian Texts. In: Fifth Intl. Conf. on Language Resources and Evaluation (2006)
Google Scholar
Erjavec, T., Ignat, C., Pouliquen, B., Steinberger, R.: Massive Multilingual Corpus Compilation: Acquis Communautaire and ToTaLe. Archives of Control Sciences 15(3), 253–264 (2005)
MathSciNet MATH Google Scholar
Erjavec, T., Krek, S.: The JOS morphosyntactically tagged corpus of Slovene. In: Sixth Intl. Conf. on Language Resources and Evaluation (2008)
Google Scholar
Erjavec, T.: MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Seventh Intl. Conf. on Language Resources and Evaluation (2010)
Google Scholar
McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu

Download references

Author information

Authors and Affiliations

Faculty of Humanities and Social Sciences, University of Zagreb, Croatia
Nikola Ljubešić
Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Tomaž Erjavec

Authors

Nikola Ljubešić
View author publications
You can also search for this author in PubMed Google Scholar
Tomaž Erjavec
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Sciences, University of West Bohemia, Univerzitní 22, 306 14, Pilsen, Czech Republic
Ivan Habernal
Faculty of Applied Sciences, Dept. of Computer Science and Engineering, University of West Bohemia, Univerzitni 8, 306 14, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ljubešić, N., Erjavec, T. (2011). hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-23538-2_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23537-5
Online ISBN: 978-3-642-23538-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics