Authorship Attribution with Very Few Labeled Data: A Co-training Approach

Fan, Mengdi; Qian, Tieyun; Chen, Li; Liu, Bin; Zhong, Ming; He, Guoliang

doi:10.1007/978-3-319-08010-9_70

Mengdi Fan²⁰,
Tieyun Qian²⁰,
Li Chen²⁰,
Bin Liu²⁰,
Ming Zhong²⁰ &
…
Guoliang He²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8485))

Included in the following conference series:

International Conference on Web-Age Information Management

5829 Accesses
2 Citations

Abstract

Authorship attribution refers to the task of identifying the authors of a set of documents. Early studies in this area either used book length texts or assumed that there were a large number of training documents. The focus of modern authorship attribution has been shifted to the analysis on small online texts. This is realistic since in the real life it is hard to collect the training texts. However, the small size of training data makes the authorship attribution much more difficult. In this paper, we present a novel co-training method to iteratively recognize a few unlabeled data to augment the training set. Specifically, each document is first partitioned into two distinct views, i.e., lexical and syntactic view. And then, a two view semi-supervised method, co-training, is adopted to exploit the large amount of unlabeled documents. Our experiment results based on real data show that the proposed method can effectively exploit unlabeled data to improve the classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Literary and Linguistic Computing pp. 1–3 (2004)
Google Scholar
Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features: Research articles. J. Am. Soc. Inf. Sci. Technol. 58, 802–822 (2007)
Article Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Burrows, J.: All the way through: Testing for authorship in different frequency data. Literary and Linguistic Computing 22, 27–47 (2007)
Article Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G., Informationstechnik, G.F., Augustin, D.S.: Authorship attribution with support vector machines. Applied Intelligence 19, 109–123 (2000)
Article Google Scholar
Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 288–298 (2011)
Google Scholar
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
Google Scholar
Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural Language Engineering 11, 397–415 (2005)
Article Google Scholar
Grieve, J.: Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22, 251–270 (2007)
Article Google Scholar
van Halteren, H.: Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1–17 (2007)
Article Google Scholar
van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 121–132 (1996)
Article Google Scholar
Hedegaard, S., Simonsen, J.G.: Lost in translation: authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 65–70. Human Language Technologies (2011)
Google Scholar
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 405–417 (2007)
Article Google Scholar
Hoover, D.L.: Statistical stylistics and authorship attribution: an empirical investigation. Literary and Linguistic Computing 16, 421–424 (2001)
Article Google Scholar
Joachims, T.: (2007), http://www.cs.cornell.edu/people/tj/svm_light/old/svmmulticlass_v2.12.html
Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE), pp. 27–35 (2005)
Google Scholar
Kim, S., Kim, H., Weninger, T., Han, J., Kim, H.D.: Authorship classification: a discriminative syntactic tree mining approach. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2011)
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Article Google Scholar
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resources & Evaluation 45, 83–94 (2011)
Article Google Scholar
Kourtis, I., Stamatatos, E.: Author identification using semi-supervised learning. In: Notebook for PAN at CLEF 2011 (2011)
Google Scholar
Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49, 76–82 (2006)
Article Google Scholar
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 513–520 (2008)
Google Scholar
Maria-Florina, B., Avrim Blum, K.Y.: Co-training and expansion: Towards bridging theory and practice. In: Advances in Neural Information Processing Systems (2004)
Google Scholar
Mosteller, F.W.: Inference and disputed authorship: The Federalist. Addison-Wesley (1964)
Google Scholar
Nigam, K., Analyzing, G.R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management, pp. 86–93 (2000)
Google Scholar
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)
Google Scholar
Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: Proc. of The 50th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 264–269 (2012)
Google Scholar
Seroussi, Y., Zukerman, I., Bohnert, F.: Collaborative inference of sentiments from texts. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 195–206. Springer, Heidelberg (2010)
Chapter Google Scholar
Solorio, T., Pillay, S., Raghavan, S., Montes Y Gómez, M.: Modality specific meta features for authorship attribution in web forum posts. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 156–164 (2011)
Google Scholar
Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. of the 3rd Int. Workshop on Textbased Information Retrieval, pp. 41–46 (2003)
Google Scholar
Stamatatos, E.: Author identification using imbalanced and limited training texts. In: Proc. of the 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of The American Society for Information Science and Technology 60, 538–556 (2009)
Article Google Scholar
Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)
Article Google Scholar
Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, pp. 969–980 (2005)
Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining email content for author identification forensics. Sigmod Record 30, 55–64 (2001)
Article Google Scholar
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Chapter Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society of Information Science and Technology 57, 378–393 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China
Mengdi Fan, Tieyun Qian, Li Chen, Bin Liu, Ming Zhong & Guoliang He

Authors

Mengdi Fan
View author publications
You can also search for this author in PubMed Google Scholar
Tieyun Qian
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang He
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of Utah, 50 S. Central Campus Drive, 84112, Salt Lake City,, UT, USA
Feifei Li
Department of Computer Science, Tsinghua University, 100084, Beijing, China
Guoliang Li
POSTECH, Republic of Korea
Seung-won Hwang
Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering,, Shanghai Jiao Tong University, China
Bin Yao
Advanced Digital Sciences Center (ADSC), 138632, Singapore, Singapore
Zhenjie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, M., Qian, T., Chen, L., Liu, B., Zhong, M., He, G. (2014). Authorship Attribution with Very Few Labeled Data: A Co-training Approach. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_70

Download citation

DOI: https://doi.org/10.1007/978-3-319-08010-9_70
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08009-3
Online ISBN: 978-3-319-08010-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics