Authorship Attribution With Few Training Samples

Iqbal, Farkhund; Debbabi, Mourad; Fung, Benjamin C. M.

doi:10.1007/978-3-030-61675-5_6

Farkhund Iqbal⁵,
Mourad Debbabi⁶ &
Benjamin C. M. Fung⁷

Part of the book series: International Series on Computer Entertainment and Media Technology ((ISCEMT))

461 Accesses

Abstract

This chapter discusses authorship attribution through a training sample. The focus on authorship attribution discussed in this chapter differs in two ways from the traditional authorship identification problem discussed in the earlier chapters of this book. Firstly, the traditional authorship attribution studies [63, 65] only work in the presence of large training samples from each candidate author, which are typically enough to build a classification model. With authorship attribution, the emphasis is on using a few training samples for each suspect. In some scenarios, no training samples may exist, and the suspects may be asked (usually through court orders) to produce a writing sample for investigation purposes. Secondly, in traditional authorship studies, the goal is to attribute a single anonymous document to its true author. In this chapter, we look at cases where we have more than one anonymous message that needs to be attributed to the true author(s). It is likely that the perpetrator may either create a ghost e-mail account or hack an existing account, and then use it for sending illegitimate messages in order to remain anonymous. To address the aforementioned shortfalls, the authorship attribution problem has been redefined as follows: given a collection of anonymous messages potentially written by a set of suspects {S₁, ···, S_n}, a cybercrime investigator first wants to identify the major groups of messages based on stylometric features; intuitively, each message group is written by one suspect. Then s/he wants to identify the author of each anonymous message collection from the given candidate suspects. To address the newly defined authorship attribution problem, the stylometric pattern-based approach of AuthorMinerl (described previously in Sect. 5.4.1) is extended and called AuthorMinerSmall. When applying this approach, the stylometric features are first extracted from the given anonymous message collection Ω.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

R. Zheng, J. Li, H. Chen, Z. Huang, A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Article Google Scholar
O. De Vel, A. Anderson, M. Corney, G. Mohay, Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30(4), 55–64 (2001)
Article Google Scholar
B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in Proceedings of the 2003 SIAM International Conference on Data Mining (2003), pp. 59–70
Google Scholar
A. Kulkarni, T. Pedersen, Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts, in IICAI (2005), pp. 703–722
Google Scholar
Enron Email Dataset. [Online]. https://www.cs.cmu.edu/~enron/. Accessed 5 May 2020
F. Iqbal, H. Binsalleeh, B.C.M. Fung, M. Debbabi, Mining writeprints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)
Article Google Scholar
H. Li, D. Shen, B. Zhang, Z. Chen, Q. Yang, Adding semantics to email clustering, in Sixth International Conference on Data Mining, 2006. ICDM’06 (2006), pp. 938–942
Google Scholar
J.A. Hartigan, M.A. Wong, Algorithm AS 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Scam Victim Stories, Scammer’s Exposed (2017). [Online]. https://scammer419.wordpress.com/scam-victim-stories/
R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases. ACM SIGMOD Rec 22(2), 207–216 (1993)
Article Google Scholar
F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi, A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digit. Investig. 5, S42–S51 (2008)
Article Google Scholar
B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999), pp. 16–22
Google Scholar

Download references

Author information

Authors and Affiliations

Zayed University, Abu Dhabi, United Arab Emirates
Farkhund Iqbal
School of Engineering & Computer Science, Concordia University, Montreal, QC, Canada
Mourad Debbabi
School of Information Studies, McGill University, Montreal, QC, Canada
Benjamin C. M. Fung

Authors

Farkhund Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Debbabi
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin C. M. Fung
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Iqbal, F., Debbabi, M., Fung, B.C.M. (2020). Authorship Attribution With Few Training Samples. In: Machine Learning for Authorship Attribution and Cyber Forensics. International Series on Computer Entertainment and Media Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-61675-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-61675-5_6
Published: 05 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61674-8
Online ISBN: 978-3-030-61675-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics