Skip to main content

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1688))

Included in the following conference series:

  • 1762 Accesses

Abstract

Identification of authorship is a research topic in natural language processing that has been interesting in recent years. Previously, texts were studied through a large variety of feature extraction methods to identify the author of the content. Advanced approaches based on deep learning have recently been applied to authorship attribution. This paper introduces a new model called ViBert4Author (V4A), a fine-tuning version of the pre-trained PhoBERT language model with the addition of dense layer and soft-max through combining the same algorithms. The feature extraction method is used for author classification in Vietnamese literature. In addition, our article also introduces a dataset that has been collected based on self-developed tools, the dataset on building over 800 works from 8 authors named VN-Literature. We also performed many tests on English datasets to evaluate the model: blogs, emails published on Kaggle, and pre-trained multi-languages for testing. We give a comprehensive analysis of the advantages and disadvantages of the proposed method. In addition, we evaluate the extraction of additional features (stylometric and hybrid features) in our assessment of approaches using the F1-score measure. The results show that our proposed model has improved performance over previous methods, in which the model that combines stylistic features and modern methods achieves outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus.

  2. 2.

    https://www.cs.cmu.edu/~enron.

  3. 3.

    https://fasttext.cc/docs/en/crawl-vectors.html.

References

  • Chaski, C.E.: Who’s at the keyboard? authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4, 1–13 (2005)

    Google Scholar 

  • Lambers, M., Veenman, C.J.: Forensic authorship attribution using compression distances to prototypes. In: Proceeding of the 3rd International Workshop on Computational Forensics, 13–24 (2009)

    Google Scholar 

  • Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining write prints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)

    Article  Google Scholar 

  • Kimler: Using style markers for detecting plagiarism in natural language (2003)

    Google Scholar 

  • Huang and Mizuho IWAIHARA: Authorship Attribution Based on Pre-Trained Language Model and Capsule Network (2022)

    Google Scholar 

  • Gollub, T., et al.: Recent trends in digital text forensics and its evaluation – plagiarism detection, author identification, and author profiling (2013)

    Google Scholar 

  • Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro messages. In: Conf. Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)

    Google Scholar 

  • Shrestha, P., et al.: Convolutional neural networks for authorship attribution of short texts (2017)

    Google Scholar 

  • Bagnall, D.: Author identification using multi–headed recurrent neural network. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, vol. 1391 (2015)

    Google Scholar 

  • Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis. In: IEEE Transactions on Cybernetics, pp. 107–121 (2016)

    Google Scholar 

  • Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  • Huang, W., Su, R., Iwaihara, M.: Contribution of improved character embedding and latent posting styles to authorship attribution of short texts, pp. 261–269. Springer (2020)

    Google Scholar 

  • Wang, X., Iwaihara, M.: Integrating RoBERTa fine-tuning and user writing styles for authorship attribution of short texts (2021)

    Google Scholar 

  • Iyer, R.R., Rose, C.P.: A machine learning framework for authorship identification from texts. arXiv Prepr. arXiv: 1912.10204 (2019)

    Google Scholar 

  • Anwar, W., Bajwa, I.S., Choudhary, M.A., Ramzan, S.: An empirical study on forensic analysis of urdu text using LDA-based authorship attribution (2018)

    Google Scholar 

  • Dmitrin, Y.V, Botov, D.S, Klenin, J.D, Nikolaev, I.E.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts (2018)

    Google Scholar 

  • Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation (2020)

    Google Scholar 

  • Ruder, S., Ghaffari, P., Breslin, J.G.: Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Prepr. arXiv:1609.06686 (2016)

  • Klimt, B., Yang, Y.: A new dataset for email classification research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (ed.), Machine Learning: ECML (2004)

    Google Scholar 

  • Aggarwal, C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, US (2012)

    Chapter  Google Scholar 

  • Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on EMNLP, pp. 1746–1751 (2014)

    Google Scholar 

  • Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  • Medsker, L., Jain, L.C.: Recurrent Neural Networks: Design and Applications. CRC Press (1999)

    Google Scholar 

  • Yang, X., Yang, L., Bi, R., Lin, H.: A comprehensive verification of transformer in text classification. In: China National Conference on Chinese Computational Linguistics, pp. 207–218. Springer (2019)

    Google Scholar 

  • Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2020)

    Google Scholar 

  • Nguyen, D.Q., Nguyen, T.A.: PhoBERT: Pre-trained language modelsfor vietnamese. In: Findings of the Association for Computational Lin-guistics: EMNLP 2020, pp. 1037–1042. Association for ComputationalLinguistics, Online (2020)

    Google Scholar 

  • Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)

  • Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)

    Google Scholar 

  • Sari, Y., Stevenson, M., Vlachos, A.: Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 343–353. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dang Tuan Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, K.D., Nguyen, D.T. (2022). Vietnamese Text’s Writing Styles Based Authorship Identification Model. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2022. Communications in Computer and Information Science, vol 1688. Springer, Singapore. https://doi.org/10.1007/978-981-19-8069-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8069-5_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8068-8

  • Online ISBN: 978-981-19-8069-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics