Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing

Markov, Ilia; Stamatatos, Efstathios; Sidorov, Grigori

doi:10.1007/978-3-319-77116-8_21

Ilia Markov¹⁴,
Efstathios Stamatatos¹⁵ &
Grigori Sidorov¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

1211 Accesses
12 Citations

Abstract

The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we present an improved algorithm for cross-topic AA. We demonstrate that the effectiveness of character n-grams representation can be significantly enhanced by performing simple pre-processing steps and appropriately tuning the number of features, especially in cross-topic conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
When large sets of HFWs are replaced by distinct symbols, the size of feature set increases.
2.
http://www.nltk.org [last access: 12.01.2017].
3.
We also examined naive Bayes classifier, which produced worse results but similar behaviour (not shown).

References

Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)
Article Google Scholar
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group Web forum messages. IEEE Intell. Syst. 20, 67–75 (2005)
Article Google Scholar
Chaski, C.E.: Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4, 1–13 (2005)
Google Scholar
Coulthard, M.: On admissible linguistic evidence. J. Law Policy 21, 441–466 (2013)
Google Scholar
Koppel, M., Seidman, S.: Automatically identifying pseudepigraphic texts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, (EMNLP’13), pp. 1449–1454 (2013)
Google Scholar
Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 427–439 (2013)
Google Scholar
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08), pp. 513–520 (2008)
Google Scholar
Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Proceedings of Artificial Intelligence: Methodologies, Systems, and Applications (AIMSA’06), pp. 77–86 (2006)
Chapter Google Scholar
Kestemont, M.: Function words in authorship attribution. From black magic to theory? In: Proceedings of the 3rd Workshop on Computational Linguistics for Literature (EACL’14), pp. 59–66 (2014)
Google Scholar
Daelemans, W.: Explanation in computational stylometry. In: Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’13), pp. 451–462 (2013)
Chapter Google Scholar
Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies (NAACL-HLT’15), pp. 93–102 (2015)
Google Scholar
Hedegaard, S., Simonsen, J.G.: Lost in translation: authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), pp. 65–70 (2011)
Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13), pp. 1880–1891 (2013)
Google Scholar
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41, 853–860 (2014)
Article Google Scholar
Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. In: Working Notes Papers of the CLEF 2015 Evaluation Labs (CLEF’15), vol. 1391. CEUR (2015)
Google Scholar
Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Lit. Linguist. Comput. 22, 251–270 (2007)
Article Google Scholar
Stamatatos, E.: Author identification using imbalanced and limited training texts. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications (DEXA’07), pp. 237–241 (2007)
Google Scholar
Escalante, H.J., Solorio, T., Montes-y-Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), pp. 288–298 (2011)
Google Scholar
Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics (COLING’14), pp. 1228–1237 (2014)
Google Scholar
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03), pp. 104–110 (2003)
Google Scholar
Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Proceedings of the 27th European conference on Advances in Information Retrieval Research (ECIR’05), pp. 300–314 (2005)
Google Scholar
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution with character level n-grams. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), pp. 267–274 (2003)
Google Scholar
Qian, T., Liu, B., Chen, L., Peng, Z.: Tri-training for authorship attribution with limited training data. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), pp. 345–351 (2014)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)
Article Google Scholar
de Vel, O.Y., Anderson, A., Corney, M., Mohay, G.M.: Mining email content for author identification forensics. SIGMOD Rec. 30, 55–64 (2001)
Article Google Scholar
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
MATH Google Scholar
Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes of CLEF 2015-Conference and Labs of the Evaluation Forum (2015)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 363–370 (2005)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Article Google Scholar
Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016, 13 (2016)
Google Scholar
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial naive Bayes for text categorization revisited. In: Proceedings of the 17th Australian Joint Conference on Advances in Artificial Intelligence (AI’04), pp. 488–499 (2005)
Google Scholar
Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society (NAFIPS’15) and 5th World Conference on Soft Computing, pp. 1–4 (2015)
Google Scholar
Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20161947, 20161958, 20162204, 20162064, 20171813, 20171344, and 20172008).

Author information

Authors and Affiliations

Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico
Ilia Markov & Grigori Sidorov
Department of Information and Communication Systems Engineering, University of the Aegean, Karlovassi, Samos, Greece
Efstathios Stamatatos

Authors

Ilia Markov
View author publications
You can also search for this author in PubMed Google Scholar
Efstathios Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilia Markov .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Markov, I., Stamatatos, E., Sidorov, G. (2018). Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-77116-8_21
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics