Skip to main content

Authorship Attribution with Very Few Labeled Data: A Co-training Approach

  • Conference paper
Web-Age Information Management (WAIM 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8485))

Included in the following conference series:

Abstract

Authorship attribution refers to the task of identifying the authors of a set of documents. Early studies in this area either used book length texts or assumed that there were a large number of training documents. The focus of modern authorship attribution has been shifted to the analysis on small online texts. This is realistic since in the real life it is hard to collect the training texts. However, the small size of training data makes the authorship attribution much more difficult. In this paper, we present a novel co-training method to iteratively recognize a few unlabeled data to augment the training set. Specifically, each document is first partitioned into two distinct views, i.e., lexical and syntactic view. And then, a two view semi-supervised method, co-training, is adopted to exploit the large amount of unlabeled documents. Our experiment results based on real data show that the proposed method can effectively exploit unlabeled data to improve the classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Literary and Linguistic Computing pp. 1–3 (2004)

    Google Scholar 

  2. Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features: Research articles. J. Am. Soc. Inf. Sci. Technol. 58, 802–822 (2007)

    Article  Google Scholar 

  3. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  4. Burrows, J.: All the way through: Testing for authorship in different frequency data. Literary and Linguistic Computing 22, 27–47 (2007)

    Article  Google Scholar 

  5. Diederich, J., Kindermann, J., Leopold, E., Paass, G., Informationstechnik, G.F., Augustin, D.S.: Authorship attribution with support vector machines. Applied Intelligence 19, 109–123 (2000)

    Article  Google Scholar 

  6. Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 288–298 (2011)

    Google Scholar 

  7. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)

    Google Scholar 

  8. Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural Language Engineering 11, 397–415 (2005)

    Article  Google Scholar 

  9. Grieve, J.: Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22, 251–270 (2007)

    Article  Google Scholar 

  10. van Halteren, H.: Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1–17 (2007)

    Article  Google Scholar 

  11. van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 121–132 (1996)

    Article  Google Scholar 

  12. Hedegaard, S., Simonsen, J.G.: Lost in translation: authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 65–70. Human Language Technologies (2011)

    Google Scholar 

  13. Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 405–417 (2007)

    Article  Google Scholar 

  14. Hoover, D.L.: Statistical stylistics and authorship attribution: an empirical investigation. Literary and Linguistic Computing 16, 421–424 (2001)

    Article  Google Scholar 

  15. Joachims, T.: (2007), http://www.cs.cornell.edu/people/tj/svm_light/old/svmmulticlass_v2.12.html

  16. Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE), pp. 27–35 (2005)

    Google Scholar 

  17. Kim, S., Kim, H., Weninger, T., Han, J., Kim, H.D.: Authorship classification: a discriminative syntactic tree mining approach. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2011)

    Google Scholar 

  18. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)

    Google Scholar 

  19. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)

    Google Scholar 

  20. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)

    Article  Google Scholar 

  21. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resources & Evaluation 45, 83–94 (2011)

    Article  Google Scholar 

  22. Kourtis, I., Stamatatos, E.: Author identification using semi-supervised learning. In: Notebook for PAN at CLEF 2011 (2011)

    Google Scholar 

  23. Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49, 76–82 (2006)

    Article  Google Scholar 

  24. Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 513–520 (2008)

    Google Scholar 

  25. Maria-Florina, B., Avrim Blum, K.Y.: Co-training and expansion: Towards bridging theory and practice. In: Advances in Neural Information Processing Systems (2004)

    Google Scholar 

  26. Mosteller, F.W.: Inference and disputed authorship: The Federalist. Addison-Wesley (1964)

    Google Scholar 

  27. Nigam, K., Analyzing, G.R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management, pp. 86–93 (2000)

    Google Scholar 

  28. Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)

    Google Scholar 

  29. Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: Proc. of The 50th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 264–269 (2012)

    Google Scholar 

  30. Seroussi, Y., Zukerman, I., Bohnert, F.: Collaborative inference of sentiments from texts. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 195–206. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  31. Solorio, T., Pillay, S., Raghavan, S., Montes Y Gómez, M.: Modality specific meta features for authorship attribution in web forum posts. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 156–164 (2011)

    Google Scholar 

  32. Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. of the 3rd Int. Workshop on Textbased Information Retrieval, pp. 41–46 (2003)

    Google Scholar 

  33. Stamatatos, E.: Author identification using imbalanced and limited training texts. In: Proc. of the 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)

    Google Scholar 

  34. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of The American Society for Information Science and Technology 60, 538–556 (2009)

    Article  Google Scholar 

  35. Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)

    Article  Google Scholar 

  36. Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, pp. 969–980 (2005)

    Google Scholar 

  37. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining email content for author identification forensics. Sigmod Record 30, 55–64 (2001)

    Article  Google Scholar 

  38. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  39. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society of Information Science and Technology 57, 378–393 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Fan, M., Qian, T., Chen, L., Liu, B., Zhong, M., He, G. (2014). Authorship Attribution with Very Few Labeled Data: A Co-training Approach. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08010-9_70

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08009-3

  • Online ISBN: 978-3-319-08010-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics