Skip to main content

Author Clustering with an Adaptive Threshold

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2017)

Abstract

This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest using the m most frequent terms of each text (isolated words and punctuation symbols with m set to at most 200). Applying a distance measure, we define whether there is enough evidence that two texts were written by the same author. The evaluations are based on six test collections (PAN Author Clustering task at CLEF 2016). A more detailed analysis shows the strengths of our approach but also indicates the problems and provides reasons for some of the potential failures of the Spatium model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amigo, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (2009)

    Article  Google Scholar 

  2. Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Lit. Linguist. Comput. 17(3), 267–287 (2002)

    Article  Google Scholar 

  3. Craig, H., Kinney, A.F.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  4. Hernández, D.M., Bécue-Bertaut, M., Barahona, I.: How scientific literature has been evolving over the time? A novel statistical approach using tracking verbal-based methods. In: JSM Proceedings, Section on Statistical Learning and Data Mining, Alexandria, pp. 1121–1131. American Statistical Association (2014)

    Google Scholar 

  5. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)

    Article  Google Scholar 

  6. Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput. 25(2), 215–223 (2010)

    Article  Google Scholar 

  7. Kocher, M., Savoy, J.: A simple and efficient algorithm for authorship verification. J. Am. Soc. Inf. Sci. Technol. 68(1), 259–269 (2017)

    Article  Google Scholar 

  8. Kocher, M., Savoy, J.: Author clustering using spatium. In: Proceedings of ACM/IEEE Joint Conference on Digital Libraries (2017, to appear)

    Google Scholar 

  9. Kocher, M., Savoy, J.: Distance measures in author profiling. Inf. Process. Manag. 53(5), 1103–1119 (2017)

    Article  Google Scholar 

  10. Labbé, D.: Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14(1), 33–80 (2007)

    Article  MathSciNet  Google Scholar 

  11. Layton, R., Watters, P., Dazeley, R.: Evaluating authorship distance methods using the positive silhouette coefficient. Nat. Lang. Eng. 19, 517–535 (2013)

    Article  Google Scholar 

  12. Manning, C.D., Raghaven, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  13. Savoy, J.: Estimating the probability of an authorship attribution. J. Am. Soc. Inf. Sci. Technol. 67(6), 1462–1472 (2016)

    Article  Google Scholar 

  14. Savoy, J.: Comparative evaluation of term selection functions for authorship attribution. Digit. Scholarsh. Hum. 30(2), 246–261 (2015)

    Article  MathSciNet  Google Scholar 

  15. Sebastiani, F.: Machine learning in automatic text categorization. ACM Comput. Surv. 34(1), 1–27 (2002)

    Article  Google Scholar 

  16. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: Working Notes of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, CEUR-WS.org (2016)

    Google Scholar 

  17. Witten, I.H., Frank, E., Hall, M.A.: Data Mining. Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2011)

    Google Scholar 

  18. Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Computer Science Conference, Ballarat, pp. 59–68 (2007)

    Google Scholar 

Download references

Acknowledgments

The authors want to thank the task coordinators for their valuable effort to promote test collections in authorship attribution. This research was supported, in part, by the NSF under Grant #200021_149665/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirco Kocher .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Kocher, M., Savoy, J. (2017). Author Clustering with an Adaptive Threshold. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65813-1_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65812-4

  • Online ISBN: 978-3-319-65813-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics