Skip to main content

Contextualized BERT Sentence Embeddings for Author Profiling: The Cost of Performances

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2020 (ICCSA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12252))

Included in the following conference series:

Abstract

The necessity to know information about the real identity of an online subject is a highly relevant issue in User Profiling, especially for analysis from digital sources such as social media. The digital identity of a user does not always present explicit data about her offline life such as age, gender, work, and more. This problem makes the task of user profiling complex and incomplete. For many years this issue has received a considerable amount of attention from the whole community, which has developed several solutions, also based on machine learning, to estimate user characteristics. The increasing diffusion of deep learning approaches has allowed, on the one hand, to obtain a considerable increase in predictive performance, but on the other hand, to have available models that cannot be interpreted and that require very high computational power. Considering the validity of new pre-trained language models on extensive data for resolving many natural language processing and classification tasks, we decided to propose a BERT-based approach (BERT-DNN) also for the author profiling task. In a first analysis, we compared the results obtained by our model with them of more classical approaches. As a follow, a critical analysis was carried out. We analyze the advantages and disadvantages of these approaches also in terms of resources needed to run them. The results obtained by our model are encouraging in terms of reliability but very disappointing if we consider the computational power required for running it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/UKPLab/sentence-transformers.

  2. 2.

    https://github.com/CyberZHG/keras-self-attention.

  3. 3.

    https://pan.webis.de/clef19/pan19-web/celebrity-profiling.html.

  4. 4.

    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

  5. 5.

    The source code of the project can be found at the following GitHub repository: https://github.com/marcopoli/ICCSA2020_author_profiling.

  6. 6.

    https://colab.research.google.com/.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics (2011)

    Google Scholar 

  4. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733 (2016)

  5. Daelemans, W., et al.: Overview of PAN 2019: author profiling, celebrity profiling, cross-domain authorship attribution and style change detection. In: Crestani, F., et al. (eds.) 10th International Conference of the CLEF Association (CLEF 2019). Springer, September 2019. http://ceur-ws.org/Vol-2380/

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://www.aclweb.org/anthology/N19-1423

  8. Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Author profiling for English emails. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pp. 263–272 (2007)

    Google Scholar 

  9. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Jing, K., Xu, J., He, B.: A survey on neural network language models. arXiv preprint arXiv:1906.03591 (2019)

  12. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)

  13. LeCun, Y., et al.: Generalization and network design strategies. In: Connectionism in Perspective, pp. 143–155 (1989)

    Google Scholar 

  14. Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  15. López-Monroy, A.P., Montes-y Gómez, M., Escalante, H.J., Villasenor-Pineda, L., Stamatatos, E.: Discriminative subprofile-specific representations for author profiling in social media. Knowl. Based Syst. 89, 134–147 (2015)

    Article  Google Scholar 

  16. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: contextualized embeddings for document ranking. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1101–1104 (2019)

    Google Scholar 

  17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  18. Musto, C., Semeraro, G., Lovascio, C., de Gemmis, M., Lops, P.: Myrror: a platform for quantified self and holistic user modeling. In: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 215–216 (2018)

    Google Scholar 

  19. Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retrieval 2(1–2), 1–135 (2008)

    Google Scholar 

  20. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol. 54(1), 547–577 (2003)

    Article  Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  22. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

  23. Petrik, J., Chuda, D.: Twitter feeds profiling with TF-IDF notebook for PAN at CLEF 2019, vol. 2380 (2019)

    Google Scholar 

  24. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G.: A comparison of word-embeddings in emotion detection from text using BiLSTM, CNN and self-attention. In: Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pp. 63–68 (2019)

    Google Scholar 

  25. Polignano, M., Basile, P., Rossiello, G., de Gemmis, M., Semeraro, G.: Learning inclination to empathy from social media footprints. In: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pp. 383–384 (2017)

    Google Scholar 

  26. Polignano, M., de Gemmis, M., Narducci, F., Semeraro, G.: Do you feel blue? Detection of negative feeling from social media. In: Esposito, F., Basili, R., Ferilli, S., Lisi, F. (eds.) Conference of the Italian Association for Artificial Intelligence, pp. 321–333. Springer (2017)

    Google Scholar 

  27. Radivchev, V., Nikolov, A., Lambova, A.: Celebrity profiling using TF-IDF, logistic regression, and SVM notebook for pan at CLEF 2019, vol. 2380 (2019)

    Google Scholar 

  28. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at pan 2013. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. pp. 352–365. CELCT (2013)

    Google Scholar 

  29. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at pan 2017: gender and language variety identification in Twitter. In: Working Notes Papers of the CLEF, pp. 1613–1673 (2017)

    Google Scholar 

  30. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  31. Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 763–772. Association for Computational Linguistics (2011)

    Google Scholar 

  32. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)

    Google Scholar 

  33. Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013)

    Article  Google Scholar 

  34. Sun, Y., et al.: Ernie 2.0: a continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412 (2019)

  35. Wachter, S.: Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the GDPR. Comput. Law Secur. Rev. 34(3), 436–449 (2018)

    Article  MathSciNet  Google Scholar 

  36. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)

  37. Wiegmann, M., Stein, B., Potthast, M.: Celebrity profiling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 2611–2618 (2019)

    Google Scholar 

  38. Zhang, Y., Pennacchiotti, M.: Predicting purchase behaviors from social media. In: Proceedings of the 22nd international conference on World Wide Web, pp. 1521–1532 (2013)

    Google Scholar 

  39. Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: OpenTag: open attribute value extraction from product profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1049–1058. ACM (2018)

    Google Scholar 

  40. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)

    Google Scholar 

Download references

Acknowledgment

This research has been funded by Regione Puglia under the programme “INNOLABS” Sostegno alla creazione di soluzioni innovative finalizzate a specifici problemi di rilevanza sociale. POR Puglia FESR – FSE 2014-2020. Asse prioritario 1 – Ricerca, sviluppo tecnologico, innovazione. Azione 1.4.b - Project: Feel@Home cod. UIKTJF3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Polignano .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Polignano, M., de Gemmis, M., Semeraro, G. (2020). Contextualized BERT Sentence Embeddings for Author Profiling: The Cost of Performances. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12252. Springer, Cham. https://doi.org/10.1007/978-3-030-58811-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58811-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58810-6

  • Online ISBN: 978-3-030-58811-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics