Skip to main content

Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2016)

Abstract

In this article, we address the problem of age identification of Twitter users, after their online text. We used a set of text mining, sociolinguistic-based and content-related text features, and we evaluated a number of well-known and widely used machine learning algorithms for classification, in order to examine their appropriateness on this task. The experimental results showed that Random Forest algorithm offered superior performance achieving accuracy equal to 61%. We ranked the classification features after their informativity, using the ReliefF algorithm, and we analyzed the results in terms of the sociolinguistic principles on age linguistic variation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Androutsopoulos, J.K., Georgakopoulou, A. (eds.): Discourse Constructions of Youth Identities, vol. 110. John Benjamins Publishing, Amsterdam (2003)

    Google Scholar 

  2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: Age, gender and the varieties of self-expression. First Monday, 12(9) (2007)

    Google Scholar 

  3. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009)

    Article  Google Scholar 

  4. Barbieri, F.: Patterns of age-based linguistic variation in American English. J. Socioling. 12(1), 58–88 (2008)

    Article  MathSciNet  Google Scholar 

  5. Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 15–20, March 2006

    Google Scholar 

  6. Eckert, P.: Age as a sociolinguistic variable. In: The Handbook of Sociolinguistics, pp. 151–167 (1997)

    Google Scholar 

  7. Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of LREC, vol. 6, pp. 417–422, May 2006

    Google Scholar 

  8. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Third International AAAI Conference on Weblogs and Social Media, March 2009

    Google Scholar 

  9. Kira, K., Rendell, L.: The feature selection problem: traditional methods and a new algorithm. In: AAAI (1992)

    Google Scholar 

  10. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57

    Chapter  Google Scholar 

  11. Nguyen, D., Doğruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. arXiv preprint arXiv:1508.07544 (2015)

  12. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think i am?”; a study of language and age in Twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press (2013)

    Google Scholar 

  13. Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics, June 2011

    Google Scholar 

  14. Patra, B.G., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic Author Profiling Based on Linguistic and Stylistic Features: Notebook for PAN at CLEF (2013)

    Google Scholar 

  15. Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Pers. Soc. Psychol. 85(2), 291 (2003)

    Article  Google Scholar 

  16. Pfeil, U., Arjan, R., Zaphiris, P.: Age differences in online social networking–a study of user profiles and the social capital divide among teenagers and older users in MySpace. Comput. Hum. Behav. 25(3), 643–654 (2009)

    Article  Google Scholar 

  17. Prasath, R.R.: Learning age and gender using co-occurrence of non-dictionary words from stylistic variations. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS (LNAI), vol. 6086, pp. 544–550. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13529-3_58

    Chapter  Google Scholar 

  18. Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772. Association for Computational Linguistics, June 2011

    Google Scholar 

  19. Rustagi, M., Prasath, R.R., Goswami, S., Sarkar, S.: Learning age and gender of blogger from stylistic variation. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, Sankar K. (eds.) PReMI 2009. LNCS, vol. 5909, pp. 205–212. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-11164-8_33

    Chapter  Google Scholar 

  20. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205, March 2006

    Google Scholar 

  21. Simaki, V., Aravantinou, C., Mporas, I., Megalooikonomou, V.: Using sociolinguistic inspired features for gender classification of web authors. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 587–594. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6_66

    Chapter  Google Scholar 

  22. Linguistic inquiry and word count. http://www.liwc.net/

  23. http://www.nltk.org/

  24. http://www.adweek.com/socialtimes/social-media-statistics-2014/499230

  25. http://www.cs.waikato.ac.nz/ml/weka/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasiliki Simaki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Simaki, V., Mporas, I., Megalooikonomou, V. (2018). Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75487-1_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75486-4

  • Online ISBN: 978-3-319-75487-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics