Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features

Reicher, Tomislav; Krišto, Ivan; Belša, Igor; Šilić, Artur

doi:10.1007/978-3-642-15390-7_3

Tomislav Reicher²³,
Ivan Krišto²³,
Igor Belša²³ &
…
Artur Šilić²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6277))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1568 Accesses
2 Citations

Abstract

In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F ₁ measure for blogs, 93% acc. and F ₁ for articles, and 99% acc. and F ₁ for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of ACH/ALLC (2005)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cortes, C., Vapnik, V.: Support-vector networks. Machine L. 20(3), 273–297 (1995)
MATH Google Scholar
Coyotl-Morales, R., Villaseñor-Pineda, L., Montes-y Gómez, M., Rosso, P., Lenguaje, L.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Chapter Google Scholar
De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM Sigmod Record 30(4), 55–64 (2001)
Article Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)
Article MATH Google Scholar
Holmes, D.: Authorship attribution. Comp. and Humanities 28(2), 87–106 (1994)
Article Google Scholar
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Google Scholar
Keerthi, S., Lin, C.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15(7), 1667–1689 (2003)
Article MATH Google Scholar
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. Proceedings of IJCAI 3, 69–72 (2003)
Google Scholar
Kukushkina, O., Polikarpov, A., Khmelev, D.: Using literal and grammatical statistics for authorship attribution. Probl. of Info. Trans. 37(2), 172–184 (2001)
Article MATH MathSciNet Google Scholar
Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the fifteenth meeting of Computational Linguistics in the Netherlands (CLIN 2004), pp. 149–160 (2005)
Google Scholar
Malenica, M., Smuc, T., Šnajder, J., Dalbelo Bašić, B.: Language morphology offset: Text classification on a croatian-english parallel corpus. Information Processing and Management 44(1), 325–339 (2008)
Article Google Scholar
Mendenhall, T.C.: The characteristic curves of composition. Science (214S), 237 (1887)
Google Scholar
Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Proc. 10th Conf. on European Chapter of the Assoc. Comp. Ling., vol. 1, pp. 267–274. ACL (2003)
Google Scholar
Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. 3rd Int. Workshop on Text-based Information Retrieval, pp. 41–46 (2006)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comp. and Hum. 35(2), 193–214 (2001)
Article Google Scholar
Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 969–980. Springer, Heidelberg (2005)
Chapter Google Scholar
van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Computers and the Humanities 28(2), 87–106 (1996)
Google Scholar
Šilić, A., Chauchat, J.H., Dalbelo Bašić, B., Morin, A.: N-grams and morphological normalization in text classification: A comparison on a croatian-english parallel corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 671–682. Springer, Heidelberg (2007)
Google Scholar
Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic acquisition of inflectional lexica for morphological normalisation. Information Processing and Management 44(5), 1720–1731 (2008)
Article Google Scholar
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical and Computing Engineering, University of Zagreb, Unska 3, 10000, Zagreb, Croatia
Tomislav Reicher, Ivan Krišto, Igor Belša & Artur Šilić

Authors

Tomislav Reicher
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Krišto
View author publications
You can also search for this author in PubMed Google Scholar
Igor Belša
View author publications
You can also search for this author in PubMed Google Scholar
Artur Šilić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, The Parade, Cardiff University, CF24 3AA, Cardiff, UK
Rossitza Setchi
Dept. of Computer Science and Software Engineering, BUckingham Building, Lion Terrace, University of Portsmouth, PO1 3HE, Portsmouth, UK
Ivan Jordanov
KES International, 145-157 St. John Street, EC1V 4PY, London, UK
Robert J. Howlett
School of Electrical and Information Engineering, University of South Australia, Adelaide, Mawson Lakes Campus, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reicher, T., Krišto, I., Belša, I., Šilić, A. (2010). Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15390-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-15390-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15389-1
Online ISBN: 978-3-642-15390-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics