Abstract
Assessing whether two documents were written by the same author is a crucial task, especially in the Internet age, with possible applications to philology and forensics. The problem has been tackled in the literature by exploiting frequency-based approaches, numeric techniques or writing style analysis. Focusing on this last perspective, this paper proposes a novel technique that takes into account the structure of sentences, assuming that it is strictly related to the author’s writing style. Specifically, a (collection of) text(s) in natural language written by a given author is translated into a set of First-Order Logic descriptions, and a model of the author’s writing habits is obtained as the result of clustering these descriptions. Then, if an overlapping exists between the models of a known author and of an unknown one, the conclusion can be drawn that they are the same person. Among the advantages of this approach, it does not need a training phase, and performs well also on short texts and/or small collections.
Similar content being viewed by others
References
Argamon, S., Saric, M., Stein, S S. (2003). Style mining of electronic messages for multiple authorship discrimination: first results In Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (Eds.), Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 475–480): ACM.
Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., Levitan, S (2007). Stylistic text classification using functional lexical features: Research articles. Journal American Society Information Science Technology, 58(6), 802–822.
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16, 22–29.
De Marneffe, M., Maccartney, B., Manning, C.D. (2006). Generating typed dependency parses from phrase structure parses. In Proc. Int’l Conf. on Language Resources and Evaluation (LREC) (pp. 449–454).
Diederich, J., Kindermann, J., Leopold, E., Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1-2), 109–123.
Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge: MIT Press.
Feng, V. W., & Hirst, G. (2013). Authorship verication with entity coherence and other rich linguistic features notebook for pan at clef 2013 In Forner, P., Navigli, R., Tufis, D. (Eds.), CLEF 2013 Labs and Workshops - Online Working Notes, PROMISE, Padua, Italy.
Ferilli, S., Basile, T. M., Biba, M., Mauro, N.D., Esposito, F. (2009a). A general similarity framework for horn clause logic. Fundamenta Informaticæ, 90(1-2), 43–46.
Ferilli, S., Biba, M., Di Mauro, N., Basile, T., Esposito, F. (2009b). Plugging taxonomic similarity in first-order logic horn clauses comparison. In: Emergent Perspectives in Artificial Intelligence, Lecture Notes in Artificial Intelligence, (pp. 131–140): Springer.
Ferilli, S., Leuzzi, F., Rotella, F. (2011). Cooperating techniques for extracting conceptual taxonomies from text. In: Proceedings of The Workshop on Mining Complex Patterns at AI*IA XIIth Conference.
van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’04.
Juola, P., & Stamatatos, E. (2013). Overview of the author identification task at PAN 2013. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 2013), http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-papers-final/pan13-authorship-verification/juola13-overview.pdf.
Klein, D., & Manning, C.D. (2003). Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems Vol. 15: MIT Press.
Leuzzi, F., Ferilli, S., Rotella, F. (2013). Improving robustness and flexibility of concept taxonomy learning from text In Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (Eds.), New Frontiers in Mining Complex Patterns - First International Workshop, NFMCP 2012, Held in Conjunction with ECML/PKDD 2012, Bristol, UK, September 24, 2012, Revised Selected Papers, CCIS, vol 7765, (pp. 232–244). Berlin Heidelberg: Springer-Verlag.
Li, J., Zheng, R., Chen, H. (2006). From fingerprint to writeprint. Commun ACM, 49(4), 76–82.
Lloyd, J.W. (1987). Foundations of logic programming, 2nd Edition: Springer.
Lowe, D., & Matthews, R. (1995). Shakespeare vs. fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29(6), 449–461.
Mccarthy, P. M., Lewis, G. A., Dufty, D. F., Mcnamara, D. S. (2006). Analyzing writing styles with coh-metrix In Sutcliffe, G., & Goebel, R. (Eds.), Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS), (pp. 764–769): AAAI Press.
Qiu, L., Kan, M.Y., Chua, T.S. (2004). A public reference implementation of the RAP anaphora resolution algorithm. In Proceedings of the 4th international conference on language resources and evaluation, LREC 2004, May 26-28, (pp. 291–294). Lisbon, Portugal: European Language Resources Association.
Raghavan, S., Kovashka, A., Mooney, R. (2010). Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 conference short papers, ACLShort ’10, (pp. 38–42). Stroudsburg: Association for Computational Linguistics.
Seidman, S. (2013). Authorship verification using the impostors method – notebook for PAN at CLEF 2013. In Forner, P., Navigli, R., Tufis, D. (Eds.) CLEF 2013 labs and workshops - online working notes, PROMISE. Padua, Italy.
Settles, B (2010). Active learning literature survey. Tech. Rep. Computer Sciences 1648, University of Wisconsin-Madison.
Tweedie, F. J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: The federalist papers. Computers and the Humanities, 30(1), 1–10.
Vilariño, D., Pinto, D., Gómez, H., León, S., Castillo, E. (2013). Lexical-syntactic and graph-based features for authorship verification, -notebook for pan at clef 2013 In Forner, P., Navigli, R., Tufis, D. (Eds.), CLEF 2013 labs and workshops - online working notes, PROMISE, Padua, Italy.
Zheng, R., Li, J., Chen, H., Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal American Society Information Science Technology, 57(3), 378–393.
Acknowledgments
The author would like to thank Fabio Leuzzi and Fulvio Rotella for their contribution in implementing the proposed approach, and for the useful hints in setting up the strategy. This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 “Puglia@Service”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ferilli, S. A sentence structure-based approach to unsupervised author identification. J Intell Inf Syst 46, 1–19 (2016). https://doi.org/10.1007/s10844-014-0349-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-014-0349-9