Abstract
In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F 1 measure for blogs, 93% acc. and F 1 for articles, and 99% acc. and F 1 for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of ACH/ALLC (2005)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cortes, C., Vapnik, V.: Support-vector networks. Machine L. 20(3), 273–297 (1995)
Coyotl-Morales, R., Villaseñor-Pineda, L., Montes-y Gómez, M., Rosso, P., Lenguaje, L.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM Sigmod Record 30(4), 55–64 (2001)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)
Holmes, D.: Authorship attribution. Comp. and Humanities 28(2), 87–106 (1994)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Keerthi, S., Lin, C.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15(7), 1667–1689 (2003)
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. Proceedings of IJCAI 3, 69–72 (2003)
Kukushkina, O., Polikarpov, A., Khmelev, D.: Using literal and grammatical statistics for authorship attribution. Probl. of Info. Trans. 37(2), 172–184 (2001)
Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the fifteenth meeting of Computational Linguistics in the Netherlands (CLIN 2004), pp. 149–160 (2005)
Malenica, M., Smuc, T., Šnajder, J., Dalbelo Bašić, B.: Language morphology offset: Text classification on a croatian-english parallel corpus. Information Processing and Management 44(1), 325–339 (2008)
Mendenhall, T.C.: The characteristic curves of composition. Science (214S), 237 (1887)
Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Proc. 10th Conf. on European Chapter of the Assoc. Comp. Ling., vol. 1, pp. 267–274. ACL (2003)
Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. 3rd Int. Workshop on Text-based Information Retrieval, pp. 41–46 (2006)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comp. and Hum. 35(2), 193–214 (2001)
Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 969–980. Springer, Heidelberg (2005)
van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Computers and the Humanities 28(2), 87–106 (1996)
Šilić, A., Chauchat, J.H., Dalbelo Bašić, B., Morin, A.: N-grams and morphological normalization in text classification: A comparison on a croatian-english parallel corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 671–682. Springer, Heidelberg (2007)
Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic acquisition of inflectional lexica for morphological normalisation. Information Processing and Management 44(5), 1720–1731 (2008)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Reicher, T., Krišto, I., Belša, I., Šilić, A. (2010). Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15390-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-15390-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15389-1
Online ISBN: 978-3-642-15390-7
eBook Packages: Computer ScienceComputer Science (R0)