Vietnamese Text’s Writing Styles Based Authorship Identification Model

Dong, Khoa Dang; Nguyen, Dang Tuan

doi:10.1007/978-981-19-8069-5_23

Khoa Dang Dong⁸ &
Dang Tuan Nguyen⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1688))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1762 Accesses

Abstract

Identification of authorship is a research topic in natural language processing that has been interesting in recent years. Previously, texts were studied through a large variety of feature extraction methods to identify the author of the content. Advanced approaches based on deep learning have recently been applied to authorship attribution. This paper introduces a new model called ViBert4Author (V4A), a fine-tuning version of the pre-trained PhoBERT language model with the addition of dense layer and soft-max through combining the same algorithms. The feature extraction method is used for author classification in Vietnamese literature. In addition, our article also introduces a dataset that has been collected based on self-developed tools, the dataset on building over 800 works from 8 authors named VN-Literature. We also performed many tests on English datasets to evaluate the model: blogs, emails published on Kaggle, and pre-trained multi-languages for testing. We give a comprehensive analysis of the advantages and disadvantages of the proposed method. In addition, we evaluate the extraction of additional features (stylometric and hybrid features) in our assessment of approaches using the F1-score measure. The results show that our proposed model has improved performance over previous methods, in which the model that combines stylistic features and modern methods achieves outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Chaski, C.E.: Who’s at the keyboard? authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4, 1–13 (2005)
Google Scholar
Lambers, M., Veenman, C.J.: Forensic authorship attribution using compression distances to prototypes. In: Proceeding of the 3rd International Workshop on Computational Forensics, 13–24 (2009)
Google Scholar
Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining write prints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)
Article Google Scholar
Kimler: Using style markers for detecting plagiarism in natural language (2003)
Google Scholar
Huang and Mizuho IWAIHARA: Authorship Attribution Based on Pre-Trained Language Model and Capsule Network (2022)
Google Scholar
Gollub, T., et al.: Recent trends in digital text forensics and its evaluation – plagiarism detection, author identification, and author profiling (2013)
Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro messages. In: Conf. Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)
Google Scholar
Shrestha, P., et al.: Convolutional neural networks for authorship attribution of short texts (2017)
Google Scholar
Bagnall, D.: Author identification using multi–headed recurrent neural network. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, vol. 1391 (2015)
Google Scholar
Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis. In: IEEE Transactions on Cybernetics, pp. 107–121 (2016)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Huang, W., Su, R., Iwaihara, M.: Contribution of improved character embedding and latent posting styles to authorship attribution of short texts, pp. 261–269. Springer (2020)
Google Scholar
Wang, X., Iwaihara, M.: Integrating RoBERTa fine-tuning and user writing styles for authorship attribution of short texts (2021)
Google Scholar
Iyer, R.R., Rose, C.P.: A machine learning framework for authorship identification from texts. arXiv Prepr. arXiv: 1912.10204 (2019)
Google Scholar
Anwar, W., Bajwa, I.S., Choudhary, M.A., Ramzan, S.: An empirical study on forensic analysis of urdu text using LDA-based authorship attribution (2018)
Google Scholar
Dmitrin, Y.V, Botov, D.S, Klenin, J.D, Nikolaev, I.E.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts (2018)
Google Scholar
Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation (2020)
Google Scholar
Ruder, S., Ghaffari, P., Breslin, J.G.: Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Prepr. arXiv:1609.06686 (2016)
Klimt, B., Yang, Y.: A new dataset for email classification research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (ed.), Machine Learning: ECML (2004)
Google Scholar
Aggarwal, C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, US (2012)
Chapter Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on EMNLP, pp. 1746–1751 (2014)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Medsker, L., Jain, L.C.: Recurrent Neural Networks: Design and Applications. CRC Press (1999)
Google Scholar
Yang, X., Yang, L., Bi, R., Lin, H.: A comprehensive verification of transformer in text classification. In: China National Conference on Chinese Computational Linguistics, pp. 207–218. Springer (2019)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2020)
Google Scholar
Nguyen, D.Q., Nguyen, T.A.: PhoBERT: Pre-trained language modelsfor vietnamese. In: Findings of the Association for Computational Lin-guistics: EMNLP 2020, pp. 1037–1042. Association for ComputationalLinguistics, Online (2020)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Google Scholar
Sari, Y., Stevenson, M., Vlachos, A.: Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 343–353. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Khoa Dang Dong
Saigon University, Ho Chi Minh City, Vietnam
Dang Tuan Nguyen

Authors

Khoa Dang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Dang Tuan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dang Tuan Nguyen .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Food Industry, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Sungkyunkwan University, Seoul, Korea (Republic of)
Tai M. Chung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, K.D., Nguyen, D.T. (2022). Vietnamese Text’s Writing Styles Based Authorship Identification Model. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2022. Communications in Computer and Information Science, vol 1688. Springer, Singapore. https://doi.org/10.1007/978-981-19-8069-5_23

Download citation

DOI: https://doi.org/10.1007/978-981-19-8069-5_23
Published: 20 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8068-8
Online ISBN: 978-981-19-8069-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics