Abstract
In recent decades, the amount of text available for organizational science research has grown tremendously. Despite the availability of text and advances in text analysis methods, many of these techniques remain largely segmented by discipline. Moreover, there is an increasing number of open-source tools (R, Python) for text analysis, yet these tools are not easily taken advantage of by social science researchers who likely have limited programming knowledge and exposure to computational methods. In this article, we compare quantitative and qualitative text analysis methods used across social sciences. We describe basic terminology and the overlooked, but critically important, steps in pre-processing raw text (e.g., selection of stop words; stemming). Next, we provide an exploratory analysis of open-ended responses from a prototypical survey dataset using topic modeling with R. We provide a list of best practice recommendations for text analysis focused on (1) hypothesis and question formation, (2) design and data collection, (3) data pre-processing, and (4) topic modeling. We also discuss the creation of scale scores for more traditional correlation and regression analyses. All the data are available in an online repository for the interested reader to practice with, along with a reference list for additional reading, an R markdown file, and an open source interactive topic model tool (topicApp; see https://github.com/wesslen/topicApp, https://github.com/wesslen/text-analysis-org-science, https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/R4W7ZS).
Similar content being viewed by others
Notes
Changes from pre-registered protocol: The final sample size (n = 585) was lower than expected (n = 1000), but was dictated by our prespecified budgetary limit. Also, we originally planned to ask participants about their time working with the leader, but dropped the question due to space concerns. We had planned to examine how occupation related to LMX. However, there were not enough respondents for the majority of the occupations (n < 20); given the small n there is not adequate power to detect even a small magnitude effect (e.g., d = .30). When we aggregated the occupations, the information became redundant with our industry question. Hence, our question about how LMX varied by occupation was dropped.
Start words also exist where a researcher specifies that only certain words be included in an analysis.
References
Antonakis, J. (2017). On doing better science: From thrill of discovery to policy implications. The Leadership Quarterly, 28, 5–21.
Banks, G. C., Gooty, J., Ross, R., Williams, C., & Harrison, N. (2017). Construct redundancy in leader behaviors: A review and agenda for the future. The Leadership Quarterly. https://doi.org/10.1016/j.leaqua.2017.12.005.
Banks, G. C., McCauley, K. D., Gardner, W. L., & Guler, C. E. (2016). A meta-analytic review of authentic and transformational leadership: A test for redundancy. The Leadership Quarterly, 27, 634–652.
Baumer, E. P., Mimno, D., Guha, S., Quan, E., & Gay, G. K. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology, 68, 1397–1410.
Bernerth, J. B., Armenakis, A. A., Feild, H. S., Giles, W. F., & Walker, H. J. (2007). Leader–member social exchange (LMSX): Development and validation of a scale. Journal of Organizational Behavior, 28, 979–1003.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Bliese, P. D., Maltarich, M. A., & Hendricks, J. L. (2017). Back to basics with mixed-effects models: Nine take-away points. Journal of Business and Psychology, 1–23.
Buntine, W., & Jakulin, A. (2004). Applying discrete PCA in data analysis. Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence.
Cammann, C., Fichman, M., Jenkins, G. D., & Klesh, J. R. (1983). Assessing the attitudes and perceptions of organizational members. In S. E. Seashore, E. E. Lawler, P. H. Mirvis, & C. Cammann (Eds.), Assessing organizational change: A guide to methods, measures, and practices (pp. 71–138). New York: Wiley.
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Paper presented at the Advances in neural information processing systems.
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37, 51–89.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
Connelly, B. L., Certo, S. T., Ireland, R. D., & Reutzel, C. R. (2011). Signaling theory: A review and assessment. Journal of Management, 37, 39–67.
Cowan, R. L., & Fox, S. (2015). Being pushed and pulled: A model of US HR professionals’ roles in bullying situations. Personnel Review, 44, 119–139.
Crain, S. P., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality reduction and topic modeling: From latent semantic indexing to latent Dirichlet allocation and beyond Mining text data (pp. 129-161): Springer.
Denny, M. J., & Spirling, A. (2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Available at SSRN: https://ssrn.com/abstract=2849145.
Dou, W., & Liu, S. (2016). Topic-and time-oriented visual text analysis. IEEE Computer Graphics and Applications, 36, 8–13.
Dulebohn, J. H., Bommer, W. H., Liden, R. C., Brouer, R. L., Gerald, R., & Ferris, G. R. (2012). A meta-analysis of antecedents and consequences of leader-member exchange: Integrating the past with an eye toward the future. Journal of Management, 38(6), 1715–1759.
Eisenberger, R., Hungtinton, R., Hutchsion, S., & Sowa, D. (1986). Perceived organizational support. Journal of Applied Psychology, 71, 500–507.
Fong, C., & Grimmer, J. (2016). Discovery of treatments from text corpora. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2013). Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Methods, 16, 15–31.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. New York: Aldine.
Grimmer, J. (2015). We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48, 80–83.
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis: mps028.
Janasik, N., Honkela, T., & Bruun, H. (2009). Text mining in qualitative research application of an unsupervised learning method. Organizational Research Methods, 12, 436–460.
Joshi, A. K. (1991). Natural language processing. Science, 253, 1242.
Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., & Den Hartog, D. N. (2017). Text classification for organizational researchers: A tutorial. Organizational Research Methods. https://doi.org/10.1177/1094428117719322.
Kouloumpis, E., Wilson, T., & Moore, J. D. (2011). Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11, 164.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
Lee, M., & Mimno, D. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. Paper presented at the Proceedings of Empirical Methods in Natural Language Processing.
Lehmann-Willenbrock, N., & Allen, J. A. (2017). Modeling temporal interaction dynamics in organizational settings. Journal of Business and Psychology, 1–20.
Manning, C. D., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
McKenny, A. F., Aguinis, H., Short, J. C., & Anglin, A. H. (2016). What doesn’t get measured does exist improving the accuracy of computer-aided text analysis. Journal of Management: 0149206316657594.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Paper presented at the Proceedings of the conference on empirical methods in natural language processing.
Mitchel, J. O. (1981). The effect of intentions, tenure, personal, and organizational variables on managerial turnover. Academy of Management Journal, 24, 742–751.
Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275–309.
Newman, M. E. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46, 323–351.
Pearce, C. L., & Sims, H. P. (2002). Vertical versus shared leadership as predictors of the effectiveness of change management teams: An examination of aversive, directive, transactional, transformational, and empowering leader behaviors. Group Dynamics: Theory, Research, and Practice, 6, 172–197.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, 14, 1532–1543.
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54, 209–228.
Reinard, J. C. (2008). Introduction to communication research (4th ed.). Boston: McGraw-Hill.
Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111, 988–1003.
Roberts, M. E., Stewart, B. M., & Tingley, D. (2014a). Navigating the local modes of big data: The case of topic models. New York: Cambridge University Press.
Roberts, M. E., Stewart, B. M., & Tingley, D. (2014b). stm: R package for structural topic models. R package version 0.6, 1.
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58, 1064–1082.
Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives on Psychological Science, 5, 233–242.
Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling Out the stops: Rethinking stopword removal for topic models. EACL, 432.
Schofield, A., & Mimno, D. (2016). Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4, 287–300.
Schriesheim, C. A., Castro, S. L., & Cogliser, C. C. (1999). Leader-member exchange (LMX) research: A comprehensive review of theory, measurement, and data-analytic practices. The Leadership Quarterly, 10, 63–113.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34, 1–47.
Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19, 80–110.
Shanock, L. R., Baran, B. E., Gentry, W. A., Pattison, S. C., & Heggestad, E. D. (2010). Polynomial regression with response surface analysis: A powerful approach for examining moderation and overcoming limitations of difference scores. Journal of Business and Psychology, 25, 543–554.
Short, J. C., Broberg, J. C., Cogliser, C. C., & Brigham, K. H. (2010). Construct validation using computer-aided text analysis (CATA) an illustration using entrepreneurial orientation. Organizational Research Methods, 13, 320–347.
Spreitzer, G. M. (1995). Psychological empowerment in the workplace: Dimensions, measurement, and validation. Academy of Management Journal, 38, 1442–1465.
Strauss, A., & Corbin, J. (1990). Basics of qualitative research. Newbury Park, CA: Sage.
Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Thousand Oaks: Sage.
Suddaby, R. (2006). From the editors: What grounded theory is not. Academy of Management Journal, 49, 633–642.
Taddy, M. (2012). On estimation and selection for topic models. Paper presented at the International Conference on Artificial Intelligence and Statistics.
Tang, J., Meng, Z., Nguyen, X., Mei, Q., & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. Paper presented at the ICML.
Tonidandel, S., & LeBreton, J. M. (2015). RWA web: A free, comprehensive, web-based, and user-friendly tool for relative weight analyses. Journal of Business and Psychology, 30, 207–216.
Waddell, K. (2016). The algorithms that tell bosses how employees are feeling. The Atlantic.
Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. Paper presented at the Proceedings of the 26th annual international conference on machine learning.
Williams, L. J., & McGonagle, A. K. (2016). Four research designs and a comprehensive analysis strategy for investigating common method variance with self-report measures using latent variables. Journal of Business and Psychology, 31, 339–359.
Author information
Authors and Affiliations
Corresponding author
Additional information
We dedicate this article to Jared Borns for his insight, patience, and guidance in the data collection process. We thank the three reviewers at Journal of Business and Psychology as well as John Batchelor, Wenwen Dou, Katherine Frear, Tiffany Gallicano, Andy Loignon, Aaron McKenny, Bob Muenchen, Ernest O’Boyle, Jeremy Short, Anne Smith, Allison Toth, and Christopher Whelpley for their feedback on previous versions of the manuscript and our analysis. The article was pre-registered via the Open Science Framework (https://osf.io/g9wjy/?view_only=045606c4e42843f7b3d131de6d0908d0).
Rights and permissions
About this article
Cite this article
Banks, G.C., Woznyj, H.M., Wesslen, R.S. et al. A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App). J Bus Psychol 33, 445–459 (2018). https://doi.org/10.1007/s10869-017-9528-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10869-017-9528-3