Psychotherapy involves goal-directed conversations where people are able to explore their emotions, experiences, and distress. For over a century, researchers and practitioners have consistently acknowledged the central role emotions play in psychotherapy (Freud & Breuer, 1895; Lane, Ryan, Nadel, & Greenberg, 2015). Emotion is directly involved in psychotherapeutic process and outcome, including the formation of the therapeutic alliance (Chui, Hill, Kline, Kuo, & Mohr, 2016; Safran & Muran, 2000), client decision making (Bar-On, Tranel, Denburg, & Bechara, 2004; Isen, 2008), behavior change (Lang & Bradley, 2010), personality style (Mischel, 2013), and happiness (which of course, is an emotion; Diener, Scollon, & Lucas, 2009). More generally, emotion is implicated in human memory (Schacter, 1999), and its expression and perception are building blocks of empathy (Elliott, Bohart, Watson, & Greenberg, 2011; Zaki, Bolger, & Ochsner, 2008). The particular role of emotion in different psychotherapy theories varies from accessing and releasing suppressed emotions (as in psychoanalysis; e.g., Kohut, 2013), to identifying the impact of cognitions on emotions (as in cognitive behavioral therapy; Beck, Rush, Shaw & Emery, 1979), to deepening understanding and accepting emotions as a fundamental part of life (i.e., Emotion Focused Therapy; Greenberg, 2015). Researchers have employed various methods to examine the relationship between communication of emotions, therapy processes, and outcomes. Many of these methodologies focus on examining emotional valence, as processing and experiencing of both positive and negative affect is often a crucial component of therapy that spans different treatment modalities (Sloan & Kring, 2007).

Self-report ratings are often used to study client and therapist emotion (e.g., the Positive and Negative Affect Scale [PANAS] Crawford & Henry, 2004; Self-Assessment Manikin [SAM] Bradley & Lang, 1994). For example, the PANAS has been used to assess positive and negative affect of clients, which are important indicators of symptom improvement for clients with anxiety and depression (e.g., Kring, Persons, & Thomas, 2007; Watson, 2005). Similarly, researchers have used the SAM to assess for processing of traumatic memories in clients engaged in exposure-based treatments (Sloan, Marx, & Epstein, 2005). Therapist detection of client emotional valence is important, as lower therapist empathic accuracy for positive emotions can predict greater symptom severity in following sessions (Atzil-Slonim et al., 2018). Similarly, researchers have also used observer ratings of different theorized emotional processes to detect positive and negative valenced emotions (e.g., see Kramer, Pascual-Leone, Despland, & de Roten, 2015; Kramer, Pascual-Leone, Rohde, & Sachse, 2016; Luedke, Peluso, Diaz, Freund, & Baker, 2017). Across client self-report and observer measures of examining emotions, a recent meta-analysis found that both client and therapist emotional expression were associated with improved treatment outcomes (Peluso & Freund, 2018). Furthermore, researchers have utilized physiological measures such as skin conductivity (Messina et al., 2013) and vocal tone (Imel et al., 2014) as measures of emotional arousal in psychotherapy.

While self-report measures, behavioral coding, and physiological measures have been important methods in understanding degree, arousal, and valence of emotional expression, there are problems with these methods for assessing emotions: (1) Self-report measures are easy to obtain and allow access to internal private experiences, but rely on retrospective client or therapist insight (see Joormann & Stanton, 2016 for a review of emotion reporting for individuals with depression), which do not capture moment-to-moment fluctuations in emotion during a session. (2) Observer ratings can be more objective, and can capture these finer-grained changes during a session, but require time-intensive coding (see Gottman, Swanson, & Swanson, 2002 for an example of coding procedures). (3) Physiological measures have been historically impractical, and typically only capture magnitude of arousal rather than emotional valence (e.g., Juslin & Scherer, 2005; Imel et al., 2014). Thus, until very recently, the exploration of emotion in psychotherapy has been limited by the lack of methodology for examining emotion during sessions in direct and scalable ways. However, recent advances in machine learning and natural language processing provide potential solutions to facilitate coding of emotion during psychotherapy (Gonçalves, Araújo, Benevenuto, & Cha, 2013), which may better account for within-session valence that has otherwise been unexplored.

Natural language processing and sentiment analysis

Machine learning is a field of computer science that includes the process of creating algorithms such that computers are able to learn patterns of inputted data without being explicitly programmed with large collections of manually developed rules (see Samuel, 1962). Machine learning has provided innovative and critical methodologies to support various domains of mental health research (Aafjes-van Doorn, Kamsteeg, Bate, & Aafjes, 2020). For example, machine learning algorithms have been applied to session notes to assess treatment of post-traumatic stress disorder among veterans (Shiner et al., 2013). Furthermore, researchers have used network analysis to examine symptom clusters of individuals whose depression and anxiety symptoms relapsed or went into remission (Lorimer, Delgadillo, Kellett, & Brown, 2019). Similarly, machine learning algorithms have been used to estimate alliance-outcome estimates for individual patients (Rubel, Zilcha-Mano, Giesemann, Prinz, & Lutz, 2020). The emergence of research focusing on machine learning has laid the foundation to exploring different ways of examining and improving mental health care.

Natural language processing (NLP) is a subfield of machine learning whose goal is to computationally “learn, understand, and produce human language content” (Hirschberg & Manning, 2015, p. 261; Hladka & Holub, 2015). For example, researchers implemented automated speech analysis and machine learning methods to predict the onset of schizophrenia (Bedi et al., 2015), and produced language in the form of conversational dialogue (Vinyals & Le, 2015). NLP techniques have already been used to extract topics of conversation between therapists and clients (Atkins et al., 2012; Imel, Steyvers, & Atkins, 2015), and examine empathy of therapists (Xiao, Imel, Georgiou, Atkins, & Narayanan, 2015). Currently, a major focus in NLP is developing methods that correctly identify the emotion-related phenomenon in passages using only the written words—often called sentiment analysis in computer science (for a review, see Pang & Lee, 2008). This field is broad, including classification of emojis (Read, 2005), tone of movie reviews (Socher, Pennington, Huang, Ng, & Manning, 2011), and product reviews (Nasukawa & Yi, 2003).

A common approach before the more widespread usage of NLP techniques was to rely solely on hand-compiled lists of positive or negative words (e.g., texts with more positive words have more positive sentiment; Baccianella, Esuli, & Sebastiani, 2010; see Linguistic Inquiry and Word Count; LIWC; Pennebaker, Francis, & Booth, 2001). Word-count-based programs such as LIWC have been utilized to investigate the relationship of word usage in populations with mental health diagnoses. For people suffering from depression, research has shown first-person-singular pronoun usage to be positively correlated with symptoms of depression following treatment (Zimmermann, Brockmeyer, Hunn, Schauenburg, & Wolf, 2017). Similarly, another study demonstrated that for participants diagnosed with anorexia and bulimia nervosa, usage of first-person-singular pronouns during the recall of negative memories was positively correlated with self-reported symptoms of depression and anxiety (Brockmeyer et al., 2015). LIWC is a dictionary-based classification method, whereby the emotion word categories are based on a list of 915 positive and negative affect words (Pennebaker, Booth, Boyd, & Francis, 2015). LIWC has been utilized in a variety of ways, including understanding people’s emotional reaction to the terrorist attack on September 11, 2001 (Cohn, Mehl, & Pennebaker, 2004), illustrating changes in emotional expression in published books over time (Acerbi, Garnet, Lampos, & Bentley, 2013), and predicting elections with Twitter data (Tumasjan, Sprenger, Sandner, & Welpe, 2010). Emotion word dictionaries can identify positive and negative affect at a level competitive with human coding of emotional responses (Tausczik & Pennebaker, 2010; see also Kahn, Tobin, Massey, & Anderson, 2007). These dictionary-based techniques benefit from simplicity and interpretability, but require researchers to compile the word lists to create a comprehensive inventory of all positive and negative words. In addition, this technique does not allow a model to improve with more data.

Since the creation of dictionary-based programs, a number of new methods have been developed for performing text analysis. Using a dataset with sentences labeled by humans as positive or negative, these statistical models can predict whether the presence of words or phrases increased the likelihood of a sentence being labeled as positive or negative. Specifically, researchers have begun to use statistical techniques to attempt to model how the presence of words and phrases changes the probability of a passage being labeled as positive, negative, or neutral (Jurafsky & Martin, 2008; Gonçalves et al., 2013; Pak & Paroubek, 2015). In practice, statistical NLP methods have been shown to be superior to lexical-based dictionary methods such as LIWC (Gonçalves et al., 2013), which are typically used by psychology researchers. For example, Bantum and Owen (2009) demonstrated that when analyzing an Internet-based psychological intervention for women with breast cancer, LIWC, in comparison with human raters, overidentified emotional expression. As larger collections of text labeled for sentiment have become available, NLP researchers have begun to use models that rely on advanced machine learning methods to identify sentiment (e.g., recursive neural networks, see data analysis section for description; Socher et al., 2013).

Detecting emotion in psychotherapy with natural language processing

At present, text-based methods for evaluating emotion in psychotherapy are reliant on dictionary-based methods. Mergenthaler (1996) was one of the first researchers to create a quantitative method for measuring emotional expression in psychotherapy. Mergenthaler and Bucci (1999) hypothesized that key moments in the psychotherapy process involved client expression of both high emotional content and high verbal abstraction. To test this hypothesis, Mergenthaler used a dictionary-based method similar to LIWC; that is, a list of words that expressed either positive or negative emotional tones that were specific to psychotherapy-based text (Mergenthaler, 1996). In a similar study, Anderson, Bein, Pinnell, and Strupp (1999) assessed therapist verb usage in high versus low affect segments in therapy sessions and found that therapists who used more cognitively oriented verbs in high affect sessions had worse outcomes. Similar to the limitations of LIWC, described above, these methods use a priori identification of positive and negative words, as opposed to empirical measurement learned from human ratings.

There are existing, publicly available tools that use statistical NLP tools to rate the valence of passages of text. However, these tools have been trained on immense text corpora obtained from domains other than psychotherapy such as classic literature (Liu & Zhang, 2012; Qiu, Liu, Bu, & Chen, 2011; Yussupova, Bogdanova, & Boyko, 2012), news articles (see Pang & Lee, 2008 for a list of databases), and social media text (for examples, see Bohlouli, Dalter, Dronhofer, Zenkert, & Fathi, 2015; Gokulakrishnan, Priyanthan, Ragavan, Prasath, & Perera, 2012; Pak & Paroubek, 2015). Additionally, researchers have used a variety of models to harvest data from the Internet, including a live feed of tweets and posts from social media outlets such as Twitter and Facebook (Bohlouli et al., 2015). Given the availability of these trained models, it is reasonable to wonder whether it might be useful to simply utilize one of these models to label emotion in psychotherapy—emotion in one domain is not necessarily different from emotion in another. However, this seems unlikely, as “domain adaptation” is a major subfield of machine learning, wherein researchers explore whether models developed in one domain can be meaningfully applied to another (e.g., Is a sentence parser developed on newspaper articles accurate with conversational text; see Dredze et al., 2007). Pang and Lee (2008) have argued that sentiment analysis is quite likely domain-specific. For example, if one were reviewing a movie and wrote that “the movie was very effective emotionally, deeply sad,” the review might be rated as a very positive statement. But in a therapy session, the word “sad” would be more likely to be used in the context “I am feeling very sad.” Moreover, there are many emotion-relevant words that might be extremely rare in other datasets, but are very common in psychotherapy. For example, “Zoloft” (an antidepression medication) may never occur in a movie review corpus, but it is said 381 times in the collection of transcripts we use in the current study (http://alexanderstreet.com/). Moreover, psychotherapy text comes from spoken language, not written communication. Modeling strategies that work well on written text may perform poorly on spoken language (Jurafsky & Martin, 2008). For example, methods that require that language be structured based upon grammar (i.e., parse tree) may have difficulties analyzing disfluencies, or fillers and fragments that occur frequently in dialogue. Virtually all of the databases for training sentiment analysis models are written, and none come from mental health domains.

With large and highly flexible deep neural networks, performance improvements are limited not by the model selection but by the quantity of labeled training data. To address these issues, researchers have investigated how to extract meaningful representations from unlabeled textual data. Some early work achieved reasonable success by structuring the problem as learning a representation for words based on their context (e.g., Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Pennington, Socher, & Manning, 2014). Recently, strides have been made in combining a number of ideas and recent advances in NLP into one system—the Bidirectional Encoder Representations and Transformations (BERT; Devlin, Chang, Lee, & Toutanova, 2018). BERT utilizes massive quantities of unlabeled data to learn useful representations of language and linguistic concepts by masking portions of the input and trying to predict which word was in fact masked. As such, BERT can learn powerful representations of human language from billions of sentences. This massive pre-training makes it possible to fine-tune BERT on specific tasks introducing only minor task tweaks to the model and leveraging the knowledge acquired through extensive pre-training. Additionally, such extensive pre-training allows for BERT to outperform traditional models.

Given the abundance of linguistic information present in psychotherapy transcripts, modern NLP techniques have the potential to become highly efficient alternatives to relying solely on human ratings to gather emotion data present during psychotherapy. Statistical programs designed to conduct sentiment analysis can consume hundreds of hours of data, and create analyses of sessions almost immediately, whereas behavioral coding often takes months and requires a multitude of human resources. Mental health researchers have already demonstrated the capacity of developing and training more complex NLP models (see Imel et al., 2019). These methods may allow researchers to explore new and more complex questions about the role of emotion in psychotherapy and how it interacts with other psychotherapy processes (e.g., alliance, cultural discussions) and client outcomes (e.g., distress indices, satisfaction, dropout).

In summary, the role of emotions is clearly important to the process of psychotherapy, but there has been a lack of empirical research on this subject. One of the primary reasons is the shortcomings in methodology. Self-report is easy to obtain but coarse, and lexical methods have been limited to dictionary-based techniques, which can be insensitive to context. Advances within the field of NLP have provided methods that can improve the way that emotions are measured in a psychotherapy session. In an effort to use NLP to rate emotion in psychotherapy, Tanana et al. (2016) compiled nearly 100,000 human labeled utterances and developed a model to identify, test, and compare four different sentiment models. However, the initial study was brief, and these previous models have not been compared to an existing psychotherapy dictionary-based model and require an update with emergent NLP technology. In the current study, we extended findings from Tanana et al. (2016) by comparing the sentiment model to LIWC and an innovative NLP model, BERT. We hypothesized that the prior NLP models from Tanana et al. (2016) would outperform LIWC, and that BERT would outperform all models.

Method

Data sources

The raw data was obtained from a large corpus of psychotherapy transcripts that were published by Alexander Street Press (http://alexanderstreet.comg). Tanana et al. (2016) provides only a brief description of the data and coding procedure, so the following description provides more detail. The original transcripts come from a variety of different therapists (e.g., Carl Rogers, Albert Ellis) and theoretical perspectives (person-centered, rational emotive, psychodynamic, experiential/humanistic, cognitive behavioral, and also include drug therapy/medication management sessions). Data was from real therapy sessions that were anonymized for confidentiality. The data included patient demographics, number of sessions, patient symptoms, and general topic of each therapy session. These transcripts are available through a public university library subscription. As a result, they can be accessed more easily than typical psychotherapy datasets (see Imel et al., 2014 for additional details). The Alexander Street Press corpus has been utilized in prior research to train machine learning models on psychotherapy data (e.g., Xiao et al., 2015; Xiao, Huang, Imel, Atkins, Georgiou, & Narayanan, 2016; Tanana, Hallgren, Imel, Atkins, & Srikumar, 2016; Gaut, Steyvers, Imel, Atkins, & Smyth, 2017).

At the time of developing the initial sentiment model, there were 2354 session transcripts available, with 514,118 talk turns. The dataset includes speaker-identified talk turns, which are continuous periods where one speaker talks until the other speaker interrupts or responds. Before sampling from the dataset, we segmented talk turns on the punctuation indicating sentence boundaries (e.g., periods, exclamation, and question marks indicated by the transcriber). We refer to these discrete units as utterances. We also excluded any talk turns that were shorter than 15 characters (a large part of the dataset consists of short filler text like “mm-hmm,” “yeah,” “ok” that are typically neutral in nature). We retained nonverbal indicators that were transcribed, like “(laugh)” or “(sigh),” because they might be useful indicators of the sentiment of the sentence. We randomly sampled 97,497 (19%) from the entire dataset of utterances that met the criteria for length, without any stratification by session.

Tanana et al. (2016) used naïve coders to identify basic valence of emotion (e.g., positive, negative, neutral) among a large corpus of utterances from psychotherapy sessions. Valence has been identified as a primary component of emotion; according to the widely researched circumplex theory of emotion, emotions can be conceptualized as occurring on a continuum of positive and negative valence and high and low arousal (Russell, 2003). Focus on valence allowed for comparison with existing computer science models of sentiment (Pang & Lee, 2008), and positive and negative emotion categories in dictionary-based programs (Tausczik & Pennebaker, 2010). Naïve coding was utilized because previous research studies suggest that they are viable alternatives to identifying basic aspects of emotions like valence, and require less training than expert coders. Naïve coders are used, almost exclusively, in the field of computer science for tasks involving coding of positive/negative emotions in text (Pang & Lee, 2008). In the field of psychology, naïve coders who received little training in identifying common emotions have been found to have adequate interrater reliability (Albright, Kenny, & Malloy, 1988; Ambady & Rosenthal, 1993), and have ratings similar to those of trained coders (Waldinger, Schulz, Hauser, Allen, & Crowell, 2004).

Amazon Mechanical Turk (MTurk) workers were recruited to code the dataset for sentiment. Researchers have found that workers on MTurk are more diverse than typical college samples, as well as other types of internet samples (Buhrmester, Kwang, & Gosling, 2011). Workers were limited to individuals in the United States to reduce the variability in the ratings to only US English speakers. In addition, workers were all required to be “master”-certified by the system, meaning they had a track record of successfully performing other tasks. Each utterance was with a set of seven others that were all completed at the same time (though all were selected randomly and were not in order). Workers were told that the utterances came from transcripts of spoken dialogue, and as a result were sometimes messy, but that they were to try their best to rate each one. For each rating, workers were given the following five options: negative, somewhat negative, neutral, somewhat positive, or positive. Each utterance in the main dataset was rated by one person.

From the overall collection of 97,497 ratings, utterances were randomly split into training, development, and test subsets. This is a standard approach in machine learning in order to prevent overfitting the model to the training data. Tanana et al. (2016) allocated 60% of the data to the training set (58,496 ratings), 20% to the development (19,503), and 20% to the test set (19,498). The training set was used to estimate model parameters, and the development set was used to periodically monitor performance and compare model variations on data that was not used for training. To ensure that the model did not begin to capitalize on chance in the development set, the model was run once on the test set to ensure that the final model performance was an accurate representation of how the model would perform on similar, unseen data (Hastie, Tibshirani, & Friedman, 2009).

Interrater dataset

In addition to the main dataset, where one worker rated each utterance, a separate dataset was created where a random selection of 100 utterances were each rated by 75 workers (i.e., a single utterance was rated 75 times by different people). The purpose of this dataset was (1) to estimate interrater reliability of human coding of sentiment in psychotherapy, and (2) to estimate the distribution of sentiment ratings for different utterances, providing a direct estimate of the inherent uncertainty in making judgments on sentiment. Several decisions were made that resulted in a more conservative—though we would argue more accurate—estimate of inter-rater reliability. In contrast to many other studies identifying emotions from therapy sessions or text (e.g., Herrmann et al,, 2016; Greenberg et al,, 2007; Bantum & Owen, 2009; Choi, Pos, & Magnusson, 2016), the sample of utterances was not restricted to those predetermined to have high emotional content. Moreover, utterances with little content (less than 15 characters) were removed, which prior work suggests individuals tend to agree are neutral in valence, thus artificially inflating reliability. Finally, interrater reliability was examined at the utterance, rather than the session level, with the first four raters, utilizing intraclass correlation coefficient (ICC) (Shrout & Fleiss, 1979). Using a two-way random effects model for absolute agreement, treating the data as ordinal, ICC = .81 (95% CI [.74, .86])Footnote 1.

In this study, we report the interrater reliability for individual utterances. It should be noted that this is a very different approach from other studies that typically report on ratings aggregated over longer passages or time periods (Auszra et al., 2013; Bantum & Owen, 2009; Greenberg et al., 2007; Herrmann et al., 2014), or entire sessions (Denecke & Deng, 2015). As a result, the interrater reliability may appear to be lower than other studies. Despite this choice, our interrater reliability remained in the moderate range. This finding suggests that we should not expect perfect performance from sentiment analysis models because not even humans completely agree on these types of ratings. However, it is reasonable to expect the best models to approach this level of performance.

Procedure

For the current study, we compared LIWC and BERT (described below) to the previous four NLP models from Tanana et al. (2016)—unigram, bigram, trigram, and recursive neural net (RNN) models. We provide a more comprehensive description of each model below than the original study to aid with the comparison of otherwise complex models. Prior results from Tanana et al. (2016) are presented in Table 1, along with results from the current study.

Table 1 Model performance on test set

Features/predictors

A common strategy for classifying language into categories in NLP is to break a sentence into what are called n-grams (for a comprehensive tutorial, see Jurafsky and Martin, 2008). N-grams represent individual vocabulary words and short phrases as a set of indicator variables, which are often several thousand in total. As a result, this type of NLP model represents any sentence as a very large list of possible n-grams (i.e., sparse vector). We tested several n-gram combinations in this study. N-grams were created by parsing on word boundaries, without separating out contractions. For example, the word “don’t” would be left as a single gram. Each model was tested with (1) unigram features, (2) unigram + bigram features, and (3) unigram, bigram, and trigram features. (Note: in the results, when we write “tri-gram,” this is short for uni-grams + bi-grams + tri-grams).

Classifier models

N-gram models

Tanana et al. (2016) used a maximum entropy (MaxEnt) classifier model (Jaynes, 1990). This is a predictive model that is mathematically identical to logistic or multinomial regression. Specifically, we used the n-grams present in a given sentence to identify the probability that a new sentence example fits into one of three categories (negative, neutral, positive). The original variable had five categories (very negative, negative, neutral, positive, and very positive), but was reduced due to the lack of usage of the very negative and very positive codes. However, in this situation there are several important differences between the use of logistic regression and what might be typical in a classic statistical scenario. First, the number of possible predictors in a typical NLP scenario is much, much larger than in a typical statistical analysis (e.g., 40,374 predictors for the unigram model). Second, in a typical statistical scenario that most psychology readers would be familiar with, logistic regression is fit simultaneously (though iteratively) to the entire dataset at once, which would take a prohibitive amount of time with so many predictors. To solve this problem, an approach called stochastic gradient descent (SGD; see Bottou, 2012) was used, which considers the error for one case, or a small number of cases, makes a small update to the model parameters, and then iterates throughout the entire training set. Third, with NLP, the statistical significance of any one predictor is not of interest, so no confidence intervals are estimated around each of the features. Finally, when estimating a model with thousands of predictors, it is easy to overfit the training data. Overfitting refers to a model that learns how to classify a phenomenon accurately in the training data set, but performs poorly on new and unseen data. To solve this problem, researchers use what is called a regularizer. Instead of training a model to minimize the error in a set, the model makes a compromise between minimizing error and minimizing the size of each parameter. This helps prevent the model from overfitting the training data and improves performance on new data (Tibshirani, 1996). All MaxEnt models were estimated using R version 3.3.2, with the RTextTools package (Jurka, Collingwood, Boydstun, Grossman, & van Atteveldt, 2013) and the maxent package (Jurka et al., 2013). Tanana et al. (2016) demonstrated that though the trigram model had only slightly higher accuracy, and was outperformed by the unigram model (kappa = 0.308).

Recursive neural network (RNN)

Tanana et al. (2016) also tested a version of the sentiment model from the Stanford NLP toolkit (Manning et al., 2014), which was based on an RNN, but was trained on movie reviews instead of psychotherapy transcripts (Socher et al., 2013). This model used neural networks organized in a tree structure to predict the sentiment of a sentence from the bottom of the tree (words) up (combinations of words). This model was one of the best-performing models in the NLP literature that was publicly available and could be tested on the dataset. Thus, it is a state-of-the-art NLP model for sentiment, but one that had not been adapted to the psychotherapy domain. The RNN was outperformed by the n-gram models, and in particular the unigram model. See Fig. 1 for an example of how a therapy session annotated for sentiment appears.

Fig. 1
figure 1

Example therapy session annotated for sentiment

Comparison models

We compared the performance of the four MaxEnt models trained on the Alexander Street Press dataset to two other models: (1) the LIWC coding method, which is commonly used in the psychology literature (Tausczik & Pennebaker, 2010) and similar in function to dictionary-based methods that have been used in psychotherapy (Mergenthaler, 1996), and (2) an innovative deep learning approach called the Bidirectional Encoder Representations and Transformations (BERT; Devlin et al., 2018), also trained on written text (e.g., English Wikipedia).

Linguistic Inquiry and Word Count (LIWC; Pennebaker, Frances, & Booth, 2001)

LIWC, as mentioned in the literature review, uses the frequency of words in a document to classify the text on a number of different dimensions (e.g., affect, cognition, biological processes). We used the positive and negative emotion dimensions to categorize whether a statement was generally positive, negative, or neutral. The LIWC coding system does not give any specific guidelines of how to turn the continuous rating into the categories of positive, negative, and neutral (Tausczik & Pennebaker, 2010). To create these categories, we subtracted the negative emotion dimension from the positive dimension. Any statement with a positive value was then classified as positive, a negative value as negative, and a zero value as neutral.

Bidirectional Encoder Representations and Transformations (BERT)

Finally, in order to test a more recent innovation in NLP, we used the Bidirectional Encoder Representations and Transformations (BERT; Devlin et al., 2018). BERT is a type of deep neural network which attends to other words in a particular sentence depending on the current state of the network (i.e., attention mechanism). We combined 12 submodels, and initially trained on the Books Corpus (800 million words; Zhu et al., 2015) and English Wikipedia (2.5 billion words). By utilizing prior language knowledge via a vast text corpus, BERT provides a highly advanced language detection model due to the quantity of training texts. It should be noted that BERT is not pretrained on labeled datasets, it is simply learning from unlabeled English text by masking random words in a sentence and trying to predict which word best fits in this blank.

We used a pre-trained version of BERT that can be downloaded by other researchers from TensorFlow Hub (https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1). This model comes with a tokenizer that uses byte-pair encoding (BPE) to convert words into discrete numeric representations. We used a process known as “fine-tuning,” where we begin with the pre-trained BERT model weights, but then allow them to change as they learn from our dataset. For the experiments in this paper, we allowed all layers of the model to learn from the data (not just the final classification layer). In using BERT to predict our sentiment classes, we used a softmax classifier stacked above the 768 hidden units that were output above the <cls> token (a word that signals to the model that we are performing a classification task). Our training used the Adam optimizer (Kingma & Ba, 2014). We used 10% dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) as a regularizer to prevent our model from overfitting the data. The model was trained for 9140 steps with a batch size of 32. We used a 10% linear warm-up period to condition the model to the new dataset before moving on to a larger learning rate. The warm-up period followed a linear schedule of (step/warmup_period * lr). Over the course of 914 steps, the learning rate increased linearly from 2.2e−8 to 2e−5, at which point the Adam optimizer was used with a beta_1 = 0.9, beta_2 = 0.999, weight decay of 0.01, and an epsilon of 1e−6.

Evaluation

All of the models were evaluated on how well they predicted the coarse human-generated sentiment labels, which were “positive,” “negative,” and “neutral.” We used several metrics: (1) Overall accuracy predicting labels. This measure is the percentage of correctly predicted labels divided by the total number of guesses. (2) F1 score for each of the individual labels. F1 is a common metric used in machine learning that is designed to be a compromise between recall and precision. Recall is the percentage of a group or label that is correctly identified. Precision is the percentage of a label that a model identifies as correct. F1 is the harmonic mean between these two metrics. (3) Cohen’s kappa (weighted) which gives a measure of correctly identified labels after correcting for chance guessing. Because the base rate for neutral was high in our dataset, the kappa metric gives the best overall measure of the performance of these models. Although accuracy is reported, kappa is a better metric because in our dataset, accuracy of .59 could be achieved by guessing the neutral class for every utterance.

Results

As noted above, models were initially trained on the training and development subsets of data. Results are reported for a single, final fit to the test subset of data. Table 1 shows the relative performance of the different models on the test set, which none of the models were allowed to train on. All of the MaxEnt models significantly outperformed the LIWC model. The RNN model as well as the LIWC model had lower accuracy relative to all of the MaxEnt models. BERT performed significantly better than any of the other models tested with an overall kappa of .48.

In order to compare our model performance with human reliability, we computed the average pairwise Cohen’s kappa for course (positive, negative, neutral) human sentiment ratings. The average kappa for this set was .42. Based on this computation, our best model exceeded human performance on a test set by 14%. It may be surprising that a model can exceed human performance on this task, but we should note that we are comparing to the average human-human agreement (some rater pairs had agreement as high as .54). Moreover, the model we tested has exceeded the performance of individual humans on a number of NLP tasks (Devlin et al., 2018).

Discussion

This study compared existing sentiment models (Tanana et al., 2016) with the LIWC coding system, as well as an innovative deep learning technique BERT (Devlin et al., 2018). We found that the newer NLP method (BERT), which can leverage large existing language datasets, outperformed the prior n-gram and RNN models from Tanana et al., as well as the commonly used dictionary model LIWC. In practice, statistical NLP methods have been shown to be superior to lexical-based dictionary methods such as LIWC (Gonçalves et al., 2013). At present, psychotherapy researchers have been restricted to dictionary-based attempts to model emotion with linguistic data. Our results suggest that these dictionary-based methods (e.g., LIWC) are largely similar to other models trained out of other domains (e.g., RNN), and that the linguistic detection of emotion in psychotherapy could be improved by utilizing newer methods that can leverage both prior knowledge about language in general, and training from an in-domain dataset (BERT). BERT was the superior model to the prior n-gram models, suggesting that context, and potentially the variety of training text available, provides a superior model for rating emotion. For example, one major difficulty for sentiment analysis methods is contrastive conjunctions (Socher et al., 2013). These are passages that contain two different clauses with the opposite sentiment. For example, “I sometimes like my boyfriend, but I’ve had it with this relationship.” Dictionary-based methods and n-gram models may have difficulties with these types of passages and may over- or underestimate the sentiment present.

Psychotherapy is often an emotional process, and many theories of psychotherapy involve hypotheses about emotional expression as a potential catalyst for change. However, the methodologies available to explore these processes have been limited. One important reason for this gap in the literature is that it is time-consuming and expensive for human coders to rate every utterance in a session for emotional expression. Additionally, some emotion-coding systems typically used in psychotherapy science (e.g., LIWC) are expensive programs and may not be widely utilized due to financial restrictions. In general, our results indicate that the modern machine learning methods perform better at predicting sentiment in psychotherapy dialogue than dictionary-based methods like LIWC that have been utilized to evaluate psychotherapy, as well as models trained on other domains. The methods presented have the possibility of being free, open-source solutions for emotion coding in psychotherapy, and thus a competitive alternative to traditional methods. These results extend on current sentiment analysis research within the psychotherapy speech domain (e.g., Tanana et al., 2016) and provide methods for continued innovation in the field.

Limitations

Our paper relies on a broad conceptualization of emotion wherein raters were simply asked to rate the positive or negative sentiment of a set of text. However, it is unclear how the sentiment perceived by raters from reading the text maps onto the internal emotional state of clients or therapists in psychotherapy. It does not appear that this is well defined in the field of computer science. While sentiment appears to be the attitudes being expressed by an individual, it is unclear if that maps onto an internal emotional state. For example, rating the attitude of an individual that states “this movie was terrible!” is relatively straightforward. The individual feels something negative towards the movie that they were assessing. However, in psychotherapy, a patient may state that “I don’t need therapy anymore,” which may reflect a positive internal emotional state, if true. The same statement, however, may be indicative of a negative internal emotional state if the client is expressing resistance to treatment. As a result, assessing sentiment in psychotherapy may need clearer definitions and instructions than methods used for domains like restaurant or movie reviews. The lower rating of human reliability in this study may also suggest that for psychotherapy, researchers may need to create more specific rating systems than those used for movies or Twitter messages. The instructions “rate the sentiment of the following phrase” may be clear when applied to movie reviews, but may be unclear for psychotherapy. Future research might experiment with several different rating systems and compare the interrater reliability of each type. For example, researchers may test the rating system used in this paper against others like “please rate how you think the person who said these words feels.” Alternatively, it may be that in some cases, the emotional expression of the speaker is truly ambiguous, and thus a distribution of ratings may be a better representation of the emotional state of the target as compared to a specific single point on a scale. Future work should explore the differences between rating sentiment and rating emotional expression.

Future directions in psychotherapy research

Our results suggest that NLP-based models can be useful tools for more nuanced examinations of psychotherapy processes, given that the NLP models we used were able to predict sentiment of therapist and client utterances more accurately than traditional dictionary-based methods of sentiment analysis (Mergenthaler, 1996). While constructs such as the working alliance (Horvath & Symonds, 1991), empathy (Elliott et al., 2011), and cultural awareness (Tao, Owen, Pace, & Imel, 2015) have been identified as important to client symptom improvement, there is still limited understanding of how these processes unfold in sessions. More specifically, despite the importance of emotion in psychotherapy, there is little focus on how immediate changes in sentiment are associated with important therapeutic constructs. Applying sentiment analysis to psychotherapy transcripts could also allow researchers to better understand patterns of emotional interactions (e.g., countertransference; Dahl, Røssberg, Bøgwald, Gabbard, & Høglend, 2012) that contribute to positive or negative perceptions of important therapeutic processes by examining changes in sentiment in each therapist–client conversation exchange. Prior research has demonstrated that reviewing sessions with computer-based recording systems can increase personal reflection and aid in supervision and collaboration during psychotherapy training (Murphy et al., 2019; Slovák et al., 2015). Future studies could investigate clinician experiences of utilizing BERT during supervision, whereby trainees could report on their perceptions of client sentiment, and review BERT outputs with their supervisor to reflect on their experiences, and potentially on broader psychotherapy processes happening. Integrating NLP-based models with both self-report measures of therapeutic processes and physiological measures of arousal could lend further understanding of how certain emotions, arousal, and perceptions of therapeutic processes interact together, using larger datasets and fewer resources.

In addition to better understanding psychotherapy processes, future research can also focus on exploring how changes in emotional expression in therapist–client interactions are associated with symptom improvement and outcomes. While researchers have started examining how in-session emotional expressions and interactions are associated with symptom improvement (e.g., McCarthy, Mergenthaler, Schneider, & Grenyer, 2011; Walter et al., 2009), the small sample sizes in these studies limit their generalizability. Examining changes in sentiment using large psychotherapy datasets that encompass different treatments could help researchers identify patterns of therapist and client emotional expression that are associated with improved clinical outcomes across treatments. Similarly, future research could also examine therapist variability in eliciting emotional change in clients and its relationship with clinical outcomes. Although therapists are significant sources of variability in client outcomes (Del Re, Flückiger, Horvath, Symonds, & Wampold, 2012; Laska, Smith, Wislocki, Minami, & Wampold, 2013; Wampold & Brown, 2005), to date there are no large-scale studies examining in-session therapist behaviors that account for these differences. Given the importance of communication and expression of emotions in therapy, examining how therapists vary in eliciting emotions in their clients, and how these differences relate to symptom improvement, could shed insight into therapist behaviors that account for clinical change. Future research could examine how therapists differ in eliciting emotions in clients, how these differences are associated with symptom improvement, and other factors (e.g., client and therapist demographics, client presenting concerns) that may interact with emotional expression and symptom improvement using NLP models.

Applications to clinical practice

In addition to research implications, NLP models predicting sentiment can also facilitate supervision of clinicians-in-training. Combined with systems that can automatically transcribe entire psychotherapy sessions (Georgiou, Black, Lammert, Baucom, & Narayanan, 2011), methods employed in this paper could help supervisors and supervisees better focus on specific parts of therapy sessions. Currently, supervision is based mainly on verbal reports from supervisees, without any quantitative indicators of important moments during the hours of psychotherapy that may have taken place over a week (Amerikaner & Rose, 2012; Goodyear & Nelson, 1997). Visualizations of coded therapy sessions for sentiment (like the ones presented in this paper) could allow supervisors to select session recordings that appear emotionally salient (e.g., very positive, negative, or emotionally variable) to review with their supervisees. For instance, it may be particularly important for therapists to identify negative emotions to better understand client avoidance of negative emotions (e.g., Acceptance and Commitment Therapy; Hayes, Luoma, Bond, Masuda, & Lillis, 2006). While NLP models of sentiment may not accurately label all therapist and client statements, these models could still allow supervisors to focus on specific moments in therapy that may be of interest, see how their supervisees perform, and process with their supervisees what occurred. Using a methodology like this may help supervisors use their time more efficiently and enable them to listen to portions of psychotherapy sessions that go beyond those that were selected by their supervisees.

Automatically transcribing a psychotherapy session, identifying speakers, and automatically coding the sentiment of every statement may seem unrealistic and impractical on the surface, but have been well studied and generally available. Methods like the ones used in Georgiou et al. (2011) to transcribe and identify speakers in long recordings are now commercially available from companies like amazon (https://aws.amazon.com/transcribe/). The methods presented in this paper to identify sentiment from text are relatively standard methods in the field of natural language processing (Jurafsky & Martin, 2008). With the hope of continuing to advance psychotherapy science and clinical practice (e.g., Barnard, 2004), NLP represents one innovative interdisciplinary channel in which therapists may find additional resources.

Conclusion

Psychotherapy often revolves around the discussion of emotionally charged topics, and most theories of psychotherapy involve some idea of how emotions influence future behavior. However, it has been extremely difficult to study these processes in an empirical way because manually coding sessions for emotional content is expensive and time-consuming. In psychotherapy, researchers have typically relied on LIWC in an attempt to automate this laborious coding, but this method has serious limitations. More modern NLP methods exist, but have been trained on out-of-domain datasets that do not perform well on psychotherapy data. This study proposes a method that addresses these problems by training on a large dataset of psychotherapy-based data, which outperformed both LIWC and a modern publicly available NLP method trained on out-of-domain data. However, much more improvement is needed to reach the same performance as human raters. This approach represents an important first step towards allowing researchers to begin a more rigorous study of emotion in psychotherapy.