Skip to main content
Log in

A sentence structure-based approach to unsupervised author identification

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Assessing whether two documents were written by the same author is a crucial task, especially in the Internet age, with possible applications to philology and forensics. The problem has been tackled in the literature by exploiting frequency-based approaches, numeric techniques or writing style analysis. Focusing on this last perspective, this paper proposes a novel technique that takes into account the structure of sentences, assuming that it is strictly related to the author’s writing style. Specifically, a (collection of) text(s) in natural language written by a given author is translated into a set of First-Order Logic descriptions, and a model of the author’s writing habits is obtained as the result of clustering these descriptions. Then, if an overlapping exists between the models of a known author and of an unknown one, the conclusion can be drawn that they are the same person. Among the advantages of this approach, it does not need a training phase, and performs well also on short texts and/or small collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Argamon, S., Saric, M., Stein, S S. (2003). Style mining of electronic messages for multiple authorship discrimination: first results In Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (Eds.), Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 475–480): ACM.

  • Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., Levitan, S (2007). Stylistic text classification using functional lexical features: Research articles. Journal American Society Information Science Technology, 58(6), 802–822.

    Article  Google Scholar 

  • Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16, 22–29.

    Google Scholar 

  • De Marneffe, M., Maccartney, B., Manning, C.D. (2006). Generating typed dependency parses from phrase structure parses. In Proc. Int’l Conf. on Language Resources and Evaluation (LREC) (pp. 449–454).

  • Diederich, J., Kindermann, J., Leopold, E., Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1-2), 109–123.

    Article  MATH  Google Scholar 

  • Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge: MIT Press.

  • Feng, V. W., & Hirst, G. (2013). Authorship verication with entity coherence and other rich linguistic features notebook for pan at clef 2013 In Forner, P., Navigli, R., Tufis, D. (Eds.), CLEF 2013 Labs and Workshops - Online Working Notes, PROMISE, Padua, Italy.

  • Ferilli, S., Basile, T. M., Biba, M., Mauro, N.D., Esposito, F. (2009a). A general similarity framework for horn clause logic. Fundamenta Informaticæ, 90(1-2), 43–46.

    MATH  Google Scholar 

  • Ferilli, S., Biba, M., Di Mauro, N., Basile, T., Esposito, F. (2009b). Plugging taxonomic similarity in first-order logic horn clauses comparison. In: Emergent Perspectives in Artificial Intelligence, Lecture Notes in Artificial Intelligence, (pp. 131–140): Springer.

  • Ferilli, S., Leuzzi, F., Rotella, F. (2011). Cooperating techniques for extracting conceptual taxonomies from text. In: Proceedings of The Workshop on Mining Complex Patterns at AI*IA XIIth Conference.

  • van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’04.

  • Juola, P., & Stamatatos, E. (2013). Overview of the author identification task at PAN 2013. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 2013), http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-papers-final/pan13-authorship-verification/juola13-overview.pdf.

  • Klein, D., & Manning, C.D. (2003). Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems Vol. 15: MIT Press.

  • Leuzzi, F., Ferilli, S., Rotella, F. (2013). Improving robustness and flexibility of concept taxonomy learning from text In Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (Eds.), New Frontiers in Mining Complex Patterns - First International Workshop, NFMCP 2012, Held in Conjunction with ECML/PKDD 2012, Bristol, UK, September 24, 2012, Revised Selected Papers, CCIS, vol 7765, (pp. 232–244). Berlin Heidelberg: Springer-Verlag.

    Google Scholar 

  • Li, J., Zheng, R., Chen, H. (2006). From fingerprint to writeprint. Commun ACM, 49(4), 76–82.

    Article  Google Scholar 

  • Lloyd, J.W. (1987). Foundations of logic programming, 2nd Edition: Springer.

  • Lowe, D., & Matthews, R. (1995). Shakespeare vs. fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29(6), 449–461.

    Article  Google Scholar 

  • Mccarthy, P. M., Lewis, G. A., Dufty, D. F., Mcnamara, D. S. (2006). Analyzing writing styles with coh-metrix In Sutcliffe, G., & Goebel, R. (Eds.), Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS), (pp. 764–769): AAAI Press.

  • Qiu, L., Kan, M.Y., Chua, T.S. (2004). A public reference implementation of the RAP anaphora resolution algorithm. In Proceedings of the 4th international conference on language resources and evaluation, LREC 2004, May 26-28, (pp. 291–294). Lisbon, Portugal: European Language Resources Association.

  • Raghavan, S., Kovashka, A., Mooney, R. (2010). Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 conference short papers, ACLShort ’10, (pp. 38–42). Stroudsburg: Association for Computational Linguistics.

  • Seidman, S. (2013). Authorship verification using the impostors method – notebook for PAN at CLEF 2013. In Forner, P., Navigli, R., Tufis, D. (Eds.) CLEF 2013 labs and workshops - online working notes, PROMISE. Padua, Italy.

  • Settles, B (2010). Active learning literature survey. Tech. Rep. Computer Sciences 1648, University of Wisconsin-Madison.

  • Tweedie, F. J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: The federalist papers. Computers and the Humanities, 30(1), 1–10.

    Article  Google Scholar 

  • Vilariño, D., Pinto, D., Gómez, H., León, S., Castillo, E. (2013). Lexical-syntactic and graph-based features for authorship verification, -notebook for pan at clef 2013 In Forner, P., Navigli, R., Tufis, D. (Eds.), CLEF 2013 labs and workshops - online working notes, PROMISE, Padua, Italy.

  • Zheng, R., Li, J., Chen, H., Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal American Society Information Science Technology, 57(3), 378–393.

    Article  Google Scholar 

Download references

Acknowledgments

The author would like to thank Fabio Leuzzi and Fulvio Rotella for their contribution in implementing the proposed approach, and for the useful hints in setting up the strategy. This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 “Puglia@Service”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ferilli, S. A sentence structure-based approach to unsupervised author identification. J Intell Inf Syst 46, 1–19 (2016). https://doi.org/10.1007/s10844-014-0349-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0349-9

Keywords

Navigation