Introduction

With rapid growth in popularity among learners and broad involvement with accredited organizations, massive open online courses or MOOCs have witnessed great success since their appearance in 2008 (Almatrafi et al. 2018; Babori et al. 2019; Xing et al. 2016). In December 2019, more than 110 million students enrolled in MOOCs from 13.5 thousand courses through mainstream MOOC platforms such as Coursera and edX (Shah 2019). Among all the learning resources within MOOCs such as video lectures and homework, the discussion forum stood out as a valuable platform for students’ learning through knowledge exchange (Almatrafi et al. 2018; Wang et al. 2015; Xing et al. 2019). Studies have found peer support on various dimensions from discussion forums have positive effects on students’ engagement and learning outcomes (Moore et al. 2019; Sunar et al. 2016). However, peer interactions on MOOC discussion forums are not prevalent (Chiu and Hew 2018). The lack of interactions among MOOC learners can then lead to a sense of isolation, suppressing learners’ desire to share knowledge and concerns. In consequence, the low engagement level on discussion forums, to some extent, leads to MOOCs’ long-standing issue of low participation and high dropout rate (Lee and Choi 2011; Ortega-Arranz et al. 2019; Xing and Du 2019).

As a result, extensive studies have been conducted to explore pedagogical innovations to spark MOOC learners’ participation in the discussion forum. Conventionally, supporting MOOC learners on discussion forums much emphasizes the role of instructors and students. Researchers have investigated the pedagogical approach where instructors or teaching assistants (TAs) served as facilitators to connect with students and helped the course co-evolve with learners through the adaptive generation of resources and learning places (Kop et al. 2011; Masters 2011; Ruey 2010). Other studies focused on supporting MOOC learners from peers with the lens of connectivist. From the connectivist perspective, students are advocated to receive and provide support by showing autonomy, openness, connection with resources and people, and interactivity, typically in the form of discussion forum activities (Dubosson and Emad 2015; Mackness et al. 2013; Wise et al. 2017). Learning occurs through the evolvement of networks among students (Goldie 2016).

However, it is difficult, if not impossible, for instructors and learners to manually provide support to all participants on discussion forums given the vast quantity of learners on MOOCs (Almatrafi et al. 2018; Wise et al. 2017; Xing 2019). The recent trend has shifted to support MOOC learners on discussion forums with automatic text analysis methods. For example, Almatrafi et al. (2018) used machine learning models to automatically detect posts that needed urgent attention in order to allow instructors or TAs to assist more efficiently. Similarly, Wise et al. (2017) adopted machine learning algorithms to automatically classify whether posts bore meaningful content related to courses to help instructors and students find information aligned with their purposes more efficiently.

Among the automated methods of supporting the MOOC community, natural language generation (NLG) has shown potential in providing students with support in online learning settings (Caballé and Conesa 2019; Kumar and Rose 2010; Mittal et al. 2018). Though with multiple definitions of slight variation (Evans et al. 2002), in this study, NLG is defined as a subset of natural language processing (NLP) techniques that determine what is to be said meaningfully in response to language contexts in a particular language (McKeown 1982). Researchers have used NLG for a series of tasks such as conversation building, question answering, and reading comprehension (Benamara and Saint-Dizier 2004; Dale 2016; Du et al. 2017). In the study of Mittal et al. (2018), the researchers built a chatbot with NLG for MOOC learners and found the generated texts by the chatbot were able to offer informative answers to students’ questions. MOOC learners were also found to benefit from NLG techniques in another study by Ferschke et al. (2015). In the study, Ferschke et al. found that the students who had access to a chatroom enhanced by the automated conversational agent were less likely to drop out as the course proceeded. Although studies have been done to explore the effects of using NLG to provide support for MOOC learners, to the best of our knowledge, most of the studies did not analyze the types of support NLG can offer to online learners compared with individuals. Furthermore, few studies have taken advantage of the rapid advancements in NLG with deep learning algorithms.

In this study, we explore the possibilities of using NLG with deep learning models to provide automatic support for students on MOOC discussion forums. The researchers have examined two state-of-the-art algorithms to generate replies to given posts, and they are recurrent neural network (RNN) and generative pretrained transformer 2 (GPT-2). Studies have shown the potential of RNN to generate high-quality texts (Indurthi et al. 2017; Tang et al. 2016; Wang et al. 2018; Wen et al. 2015). In a study by Tang et al. (2016), RNN has shown a promising result through automatic word suggestions to provide students with writing support. GPT-2 is the state-of-the-art language model by OpenAI, the complete version of which was said by the authors to be too dangerous to release because of its capability to generate texts close to what can be generated by human writers (Radford et al. 2019). Therefore, we compared the performance of texts generation between the two promising NLG models to find an ideal candidate, with which we can analyze the possibility of automatically supporting students in MOOCs. Overall, this research aims to examine the extent to which deep-learning-based NLG can offer responses that are similar to human-generated responses to the learners in MOOC forums.

Background

Theoretical Foundations

Social support theory is the theoretical framework that helps researchers explore the resources exchanged by individuals with the perception by the provider or the receiver that such exchange would be beneficial to the receiver (Shumaker and Brownell 1984). Studies have shown a positive relationship between learning outcomes and social support (Hsu et al. 2018; Lin and Anol 2008; Tsai et al. 2017). Conventionally, research on social support was conducted within the physical environments to explore relationships between individuals and family members. However, with the increasing infiltration of digital devices and tools, much research has also begun to apply social support theory in digital environments (Tsai et al. 2017).

Social support theory can be distilled down to the following categories: informational support, emotional support, and community support (Cutrona and Russell 1990; House 1981; Silverman 1999). Informational support yields the provision of advice, suggestions, and useful information to individuals (Wills 1991). Many studies have investigated informational support in online settings and have found informational support could help contribute to achieving intended outcomes (Deetjen and Powell 2016; Wu et al. 2019; Xing et al. 2018). For example, Xing et al. (2018) examined the effects of informational support on participants’ continuing involvement with an online healthcare community. The findings suggested that informational support is positively associated with periphery users’ commitment. Emotional support provides help in the form of empathy, concern, acceptance, and encouragement (Langford et al. 1997). In the context of MOOCs, studies found students’ emotions directly related to students’ engagement and persistence (Hew 2016; Jung and Lee 2018; Wen et al. 2014). In the meantime, emotional support dramatically contributes to how students perceive social presence and online learning (Cleveland-Innes and Campbell 2012). Finally, community support creates a sense of belonging by engaging in shared social activities (Wills 1991). Studies have suggested that the sense of community was positively correlated with students’ engagement, motivation, and learning outcomes (Tomkin and Charlevoix 2014; Zumbrunn et al. 2014). A lack of belonging could lead to a sense of isolation that increased the dropout rate of students (Almatrafi et al. 2018; Lee and Choi 2011). Through the lens of social support theory, this study will examine the three types of support from texts generated with deep learning models.

Support of NLG for MOOC Learners

NLG techniques have been studied extensively in MOOC settings, mainly in the form of conversational agents (Caballé and Conesa 2019; Kumar and Rose 2010; Mittal et al. 2018). Conversational agents, also known as chatbots, engage users through natural language with the aid of algorithms (Shawar and Atwell 2007a). Much research has suggested conversational agents had the potential to support students in online learning (Pereira 2016; Pereira et al. 2019; Radziwill and Benton 2017; Shawar and Atwell 2007b). For example, Pereira et al. (2019) conducted a study with 77 students in a MOOC. The chatbot was designed to engage students by challenging them with questions and streamlining students’ process of peer assignment review. The study found an improvement in students’ motivation, with 90% of participants stating they would recommend the use of a chatbot in the future. In another study by Pereira (2016), the researcher designed a chatbot to help students with self-guided learning. Among the 23 participants, 89% of them thought the chatbot as a good companion for learning, and 72% suggested using the chatbot could help enhance engagement.

However, current research on the use of NLG to support learning has its limits. In a review of literature on chatbots, Io and Lee (2017) pointed out that most research limited the development of chatbots with classical NLP techniques, while few studies tapped in to the recent advancements of deep learning in NLG. The traditional NLG algorithms relied on the extraction of topics and generated predefined bodies of texts by humans (Demetriadis et al. 2018; Tegos et al. 2019). However, MOOCs are large-scale learning environments with a huge number of participants. Generating texts with limited topics and variations will be challenging to meet learners’ needs. There is a need to empower conversational agents with responsiveness and creativity instead of solely relying on defined rules. The end-to-end nature of deep learning that requires few formal specifications for computers from human operators to achieve end goals (Goodfellow et al. 2016) makes deep learning an excellent candidate to support learning with NLG. The next section will present a review of deep learning for text generation.

NLG with Deep Learning Models

Deep learning often refers to deep neural networks. A deep neural network is “composed of multiple processing layers to learn representations of data with multiple levels of abstraction” (LeCun et al. 2015, p. 436). Compared with traditional machine learning approaches, deep learning shows a more promising future in that deep learning has proven its ability to process natural data in their raw form and requires little manual engineering to achieve desired results (Goodfellow et al. 2016; LeCun et al. 2015). Extant research has been conducted to explore NLG with deep learning, among which RNN stood out and showed promising results (Indurthi et al. 2017; Tang et al. 2016; Wang et al. 2018; Wen et al. 2015). For example, in the study of Indurthi et al. (2017), the researchers used the RNN to generate question-answer (QA) pairs given contexts (e.g., a body of texts describing a fact). The generated QA pairs were then compared with those generated with a non-deep-learning method. Their results suggested that RNN performed significantly better than the traditional method, and the generated QA pairs achieved a good quality overall. In another study exploring NLG with RNN, Wen et al. (2015) demonstrated that RNN could generate human-like texts on different topics.

In 2017, Vaswani et al. proposed the deep learning architecture Transformer for NLP, and continuous breakthroughs in NLP have been introduced since then (Devlin et al. 2018; Radford et al. 2019; Yang et al. 2019). The transformer architecture solves the computation bottleneck among prior deep learning models such as RNN and convolutional neural network (CNN) (Vaswani et al. 2017). Freedom to use vectorization computation in transformer architecture suggests better computational efficiency and allows researchers to experiment with models, including many more parameters. Recent models that achieved state-of-the-art performance on NLP are almost all based on the transformer architecture with hundreds of millions of parameters (Lan et al. 2019). GPT-2 is one of the transformer-based deep learning models, having achieved state-of-the-art performance in a series of NLP tasks such as question answering, reading comprehension, and summarization (Radford et al. 2019). The powerful language generation capability of GPT-2 has also been examined in studies across diverse domains (Budzianowski and Vulić 2019; Lee and Hsiang 2019; Zhang et al. 2020). For example, Lee and Hsiang (2019) adopted GPT-2 for augmented inventing. Specifically, the researchers trained a GPT-2 model to generate patent claims with 555,890 entries of sample claims. Their results showed that the patent claims generated by GPT-2 were coherent and reached a reasonable quality. In another example, Zhang et al. (2020) built a dialogue system with GPT-2. In the study, the researchers trained a GPT-2 and RNN model with conversational data from the Reddit public dataset to generate short responses to prompts. Through automatic and human evaluation, the researchers found that texts generated by GPT-2 could achieve a performance close to humans. In the comparison between GPT-2 and RNN, the results suggested GPT-2 outperformed RNN in terms of consistency with the dialogue context.

Furthermore, the architecture of transformer has been found to benefit researchers with pretrained models. Pretrained models are usually trained with a large amount of data by individuals or corporations with sufficient computation power. The parameters of pretrained models are then published for open use. Researchers can leverage the pretrained models to fit with custom datasets and inherit impressive base performance on tasks such as natural language inference and paraphrasing, which are essential for text generation (Fedus et al. 2018). Research has suggested that even models with limited training data can benefit from pretrained models (Lan et al. 2019). In the study of Budzianowski and Vulić (2019), the researchers found that the scarcity of training data for the NLG task can be mitigated by leveraging the pretrained model of GPT-2. The researchers evaluated results from GPT-2 automatically and manually and found that GPT-2 could generate high-quality texts even with limited training data.

From the literature reviewed, RNN and GPT-2 have shown great success in generating readable and coherent texts. However, limited research has been conducted to suggest the selection between RNN and GPT-2 when it comes to NLG. To the best of our knowledge, only one paper qualified with such a topic (Zhang et al. 2020) and no paper examined these two models in an educational setting. In order to select a more appropriate NLG model to better support students, in this study, we compared the performance of RNN and GPT-2 using automatic metrics.

Methods

Dataset and Research Context

The research context is a creative thinking course by a large U.S. public university in the northeastern area hosted on Coursera. The course included 7,066 students, with a large quantity of 42,307 posts and replies generated on the discussion forum during the course’s first offering in the year 2013. The course lasted six weeks and intended to introduce its students to the creative thinking mindset so that students would be empowered to bring changes to their life and society through innovation. The course asked students to use the discussion forum to share their ideas on innovation and form groups to conduct group projects. The discussion forum also served as the place where students shared their group project results so as to receive peer feedback.

Data Preprocessing

In preparation for model training, the following data removal and cleaning procedures were conducted: (1) Removed posts without replies. The purpose of the modeling was to generate replies based on posts with deep learning, and posts without replies would not be helpful for model training. (2) Normalized textual content. Posts and replies were stored in the format of Hypertext Markup Language (HTML), and we used the Python package BeautifulSoup to clean up posts and replies so that HTML tags in the content would be stripped away. For example, a raw post such as “Hi<br />I think its great.<br />” would be converted to “Hi\nI think its great.\n”, where <br /> means a line break in HTML and is replaced with the new line symbol \n. (3) Ensured 8-bit Unicode Transformation Format (UTF-8) encoding. UTF-8 is one of the most widely used text encoding formats that supports not only common characters but also non-character symbols such as emojis. To ensure all symbols in sentences could be accurately preserved, we used Python package ftfy to encode textual contents in UTF-8. (4) Masked web links. The deep learning models may learn the pattern of sharing web resources such as links when generating texts. However, the links generated by the models would either be taken from others’ works or faked, which could be misleading and confusing. Therefore, we used regular expressions to target web links and replaced them with a special token [LINK] such that models can be informed to ignore the tokens during training.

After the data preprocessing, there were 13,850 entries of post-reply pairs left. Each entry of a post-reply pair contains one post and one corresponding reply. Posts would have duplicates since a post could have more than one reply (see Table 1).

Table 1 Example of post-reply pairs

Automatic Text Generation with NLG and Deep Learning

Overview

The deep-learning-based NLG models had three stages (see Fig. 1). In stage 1, data preprocessing filtered and transformed data to prepare it for model training by using the steps mentioned above. In stage 2, the processed dataset was split into a training, validation, and testing set with a proportion of 0.6 (N = 8310), 0.2 (N = 2770), and 0.2 (N = 2770), respectively. An RNN and GPT-2 model was trained with the training dataset; meaning model parameters were “learned” during this phase. Model architectures of RNN and GPT-2 are explained specifically in the next section. The validation set was used to estimate models’ performance on an unseen dataset and evaluated if underfitting or overfitting occurred during model training. The testing dataset was used to provide contexts for the models to generate texts, the results of which were then evaluated quantitatively. In the last stage, models were evaluated based on word perplexity, readability, and cosine similarity with contexts (posts) and were compared with metrics of the original replies. Based on the quantitative metrics, one of the RNN and GPT-2 models was then selected for further qualitative analysis in understanding the types of support NLG can offer for MOOC learners.

Fig. 1
figure 1

Modeling procedures for NLG

NLG with Recurrent Neural Network (RNN)

RNN has been widely applied to explore its capability of text generation and has shown great potential in NLG tasks (Fedus et al. 2018; Mikolov et al. 2010; Potash et al. 2015; Sordoni et al. 2015). RNN is a variation of feedforward neural networks, which has three layers: an input layer, a hidden layer potentially consisted of multiple layers, and an output layer. RNN is different from a traditional feedforward neural network because RNN preserves data contexts through recurrent connections in the hidden layer, which significantly increases its performance on sequential data (Mikolov et al. 2010). The contexts in RNN, also known as hidden states, are retrieved through the current input (e.g., a word) and the information retained previously. The nature of relying on previous information to proceed defines the recurrence of RNN, which requires the model to process a series of data points one by one (see Fig. 2). As shown in the figure, RNN takes a sequence of inputs X, with t denoted as the current timestep. To compute the hidden state h at time t, RNN needs to take the input of x(t)and previous hidden state of h(t − 1). The output y at time t is then generated based on x(t)and h(t). The output is a probability distribution over the value of a given sequence.

Fig. 2
figure 2

Illustration of regular recurrent neural network for language generation (sequence-to-sequence)

However, training standard RNNs on problems that consist of long-term temporal dependencies (e.g., long sentences) can be challenging. A variation of RNN known as long short-term memory (LSTM) was found to be able to tackle the issues of long dependencies effectively (Graves 2013). LSTM improves regular RNN by introducing memory cell. The memory cell in LSTM can control when information should be preserved, when information should be output, and when information can be forgotten. Recent works using LSTM for language generation have suggested its great potential (Potash et al. 2015). Therefore, in this study, we used an LSTM-based RNN model for the NLG task.

NLG with Generative Pretrained Transformer 2 (GPT-2)

Generative pretrained transformer 2 (GPT-2) is a transformer-based deep learning model that achieves state-of-the-art performance on language generation (Radford et al. 2019). The GPT-2 model was pretrained with 40 gigabytes of texts by its authors, and four sizes of pretrained models have been released: (1) small with 12 layers and 117 million parameters; (2) medium with 24 layers and 345 million parameters; (3) large with 36 layers and 774 million parameters; and (4) full size with 48 layers and more than 1.5 billion parameters. Although larger GPT-2 models are presumed to have better performance, due to the limit of computational power, we chose the medium GPT-2 pretrained model with 24 layers and 345 million parameters in this research.

The transformer-based architecture of GPT-2 makes it different from RNN as it does not use recurrence in training to preserve contextual information (Vaswani et al. 2017). Due to this feature, when training with sentences, the transformer architecture does not need to process words one by one. The training process is thus much more efficient. However, many transformer-based models, including GPT-2, use the idea of auto-regression when generating texts, which resembles the mechanism of RNN. Auto-regression is the mechanism that when getting generated, words are output sequentially, and the newly generated word will be concatenated with previously generated words to form a new sequence. This new sequence will then serve as the input to generate the next word (Radford et al. 2019). Since this study addresses NLG, this section focuses on the text generation instead of the training mechanism of GPT-2.

Figure 3 demonstrates the architecture of GPT-2 for text generation, the values in which were made up for illustration purposes. In the left portion of the graph, an incomplete sentence, “John loves MOOCs and he”, has been generated, and the model is predicting the next word after “he”. Each generated word will first go through the embeddings layer, which will transform the words into numeric vectors. The embeddings layer contains two types of embeddings, a word embedding and a positional embedding. The two embeddings are matrices calculated during training and the model queries through a word or a word’s position in the sequence to get associated vector values. The word embedding has 50,257 rows, with each row being a unique word from the vocabulary of the pretrained GPT-2 model. The length of columns varies depending on the size of the pretrained GPT-2 model. For example, the medium version has 1024 columns, meaning a word will be transformed into a vector of length 1024. Words that cannot be found in the matrix will be using the vector of the special word “Unknown”. The positional embedding works similarly except that the rows are position numbers. For example, the word “he” in the figure is in position 5, so its value will be extracted from row 5 in the positional embedding. For each word, the embeddings layer will sum up the two vectors from the two embeddings and pass the result to a stack of decoders.

Fig. 3
figure 3

Illustration of the GPT-2 model for language generation

Each decoder in the stack has the same architecture while having different weights, which are calculated during training. Each decoder has two components, a self-attention layer and a feedforward neural network. The right part of Fig. 3 shows the mechanism of the self-attention layer. Conceptually, the self-attention layer is to get the contextual information related to the currently processed word. Specifically, each word will be assigned an attention score, with greater scores being more important to the currently processed word. Furthermore, each word has a vector value. For the first decoder (bottom one) in the stack, the vector values of words are the outputs from the embeddings layer. For other decoders, the vector values will be the outputs from the feedforward neural network in the previous decoder. The self-attention will sum up each word’s vector values multiplied by their attention scores, which will be fed into the feedforward neural network. The mechanism of self-attention is said to retain each word’s contextual information in a vector representation (Vaswani et al. 2017). The output from the self-attention layer is then fed into a feedforward neural network for feature extraction and dimension reduction. In the example of Fig. 3, “he” refers to “John”, so the expectation is the word “John” has the highest attention score. Since “he” is the currently processed word, its attention score tends to be small. After repeating the computation in each decoder orderly, the model will output a probability distribution of each word in its vocabulary. In our example, the word “has” achieved the highest probability and is thus predicted as the next word.

Evaluation

After training, models of RNN and GPT-2 were evaluated on the testing dataset with three metrics to aid researchers with model selection. We first generated replies with RNN and GPT-2 by using the 2770 posts in the testing dataset. The posts in the testing set were used as “prompts” of models, meaning the RNN and GPT-2 models would generate replies corresponding to those prompts. Then word perplexity, Flesch-Kincaid (F-K) grade level, and cosine similarity were computed for evaluation. The F-K grade level and cosine similarity of the original replies were also computed to serve as the benchmarks for evaluation. Since word perplexity is a metric specifically for NLG models, the original replies thus do not have such measure.

Word Perplexity

Word perplexity is a well-established metric for evaluating model performance in NLG (Bengio et al. 2003), which describes how surprised the model is when encountering the predicted next word (Evermann et al. 2016). For example, a word perplexity of 30 indicates the model proposes 30 words to choose from when generating the next word. In practice, lower word perplexity is preferred as it suggests a model is more confident in knowing what the next word should be (Serban et al. 2016). Word perplexity is defined as (Pietquin and Hastie 2013)

$$ Perplexity={e}^{\sum \limits_{i=1}^N\frac{1}{N}\log \left({p}_m\left({x}_i\right)\right)} $$

, where pm(xi) is the probability of the word xi given the model, and N is the number of possible words. However, since logarithm was computed with a base of the Euler number (natural log) in the models, the exponential function was used instead of a base of 2 in the formula.

Flesch-Kincaid (F-K) Grade Level for Readability

Initially designed for the U.S. Navy to test readability of English texts (Kincaid et al. 1975), Flesch-Kincaid (F-K) grade level has been widely used in NLP as well as educational settings (Dufty et al. 2006; Feng et al. 2010; Xu et al. 2016). The F-K grade level metric is defined as

$$ F-K\ Grade\ Level=0.39\cdotp \frac{ to tal\ words}{to tal\ sentences}+11.8\cdotp \frac{ to tal\ syllables}{to\mathrm{t} al\ words}-15.59 $$

The F-K grade level is an approximation to the grade level required to understand a given text. In this study, we used the Python package textstat to compute F-K grade levels of generated texts from the testing dataset.

Cosine Similarity for Semantic Coherence

While word perplexity can indicate how well an NLG model performs and F-K grade level can suggest the readability of generated texts, there is a need to automate the evaluation of how coherent generated texts are with original posts. Because off-topic replies are not likely to be supportive and can be confusing for students. Traditionally, manual works will be applied in the evaluation of generated texts’ semantic coherence. However, with the advancements of NLP, recent studies have suggested that using word embeddings to compute cosine similarity can well explain generated texts’ coherence with conversational contexts (Clark et al. 2019; Fang et al. 2016; Vakulenko et al. 2018).

Word embedding is a technique that represents words with vectors such that textual words can have numeric representations that are processable by machine learning algorithms. Furthermore, word embedding retains word meaning and semantic context so that similar words would have similar values in their vector representations. In the field of NLP, a famous example of word embedding is King - Man + Woman = Queen, which can be interpreted as King is to Man as Queen is to Woman. In the example, if we compute the word embedding vector of KingMan + Woman, then the resulted vector should be at least approximately equal to the word vector of Queen. Such property suggests the word embedding can capture the meanings of and relationships between words (Mikolov et al. 2013b). This study used the word embedding from bidirectional encoder representations from transformers (BERT) to convert words into vectors. BERT was selected because its word embedding innovatively broke the limitations of previous work (Devlin et al. 2018). Traditional word embeddings such as Word2Vec (Mikolov et al. 2013a) cannot differentiate words with multiple meanings. For example, for the word bank, previous word embeddings cannot tell the difference of bank in “a bank to withdraw money” and “a river bank for a walk”. To ensure the evaluation of semantic coherence considers contextual meanings of words, we chose BERT’s word embedding in this study. Although BERT also uses the transformer architecture, the word vector extracted from it would not favor other transformer-based models. The first reason is that we are applying BERT’s word embedding to texts directly, the process of which are model agnostic. Second, previous studies have found that traditional machine learning models can achieve significant performance gains by using features extracted from BERT’s word embedding when dealing with text data for classification (Alimova and Tutubalina 2020; Roitero et al. 2020).

Cosine similarity is a measure of similarity between two vectors and is defined as:

$$ similarity=\frac{A\cdotp B}{\left\Vert A\right\Vert \left\Vert B\right\Vert } $$

, where the similarity is the value of cosine taking the angle between vector A and B. The value of cosine similarity ranges from −1 to 1. A pair of texts with a similarity of 1 suggests the pair is identical, and a similarity of −1 means the pair is completely opposite. In the study of Vakulenko et al. (2018), the researchers found a cosine similarity around 0.7 can best reflect texts’ strong coherence within a conversation. In our research, with the Python package bert-as-service, we computed the cosine similarity between original posts and generated texts from the models of RNN and GPT-2 to have an approximated coherence metric of generated texts.

Multidimensional Support from Texts by NLG Models

After evaluating models with quantitative measures, we further conducted a qualitative evaluation of the generated texts by the outperforming model. In the qualitative analysis, the authors manually examined the informational support, emotional support, and community support in original replies and generated replies with given posts. The two coders communicated actively and reached consensus on all the results. We rated the three types of support in original replies and generated replies with a scale of none (0), weak (1), moderate (2), and strong (3). A type of support coded with none meant the coders thought there was no such support found in texts. Weak support indicated that a type of support could be vaguely identified in texts, while such support was not the theme in texts. Moderate support suggested that a type of support could be explicitly identified in texts, while such support was not the theme in texts. Strong support meant that a type of support could be explicitly identified in texts, and the support served as the theme in texts. Examples can be found in Table 2. After the qualitative coding, we conducted a MANOVA test to find if there existed a significant difference between the types of support offered by original replies (human learners) and generated replies. ANOVA tests were conducted to determine whether differences existed in each type of support between original and generated replies.

Table 2 Examples of ratings on support with a mixture of original replies and generated replies

Before conducting the qualitative analysis, we first conducted a power analysis with G*Power v3.1. The power analysis for the MANOVA test suggested we needed a sample size of 120 in order to achieve a power of 0.95 at an alpha level of 0.05 and with a weak-to-medium effect size of 0.15. Therefore, 150 posts were randomly selected, with 300 replies in total (150 of original replies and 150 of generated replies).

Survey for NLG Model Validation

An experiment was then conducted to validate the findings from our previous quantitative and qualitative analysis. In the experiment, we recruited four upper-level graduate students with a background in educational technologies. There were two females and two males in the participants. Participants were asked to fill in a survey consisting of 10 post-reply pairs to answer (1) whether they think a reply was generated by a human or machine, (2) whether a reply provided informational, emotional, and community support, and (3) a 10-point Likert scale to suggest the quality of a reply in terms of grammars, readability, and coherence to the discussion context. Participants were allowed to select multiple support types for a reply if applicable or choose none if no evident support could be detected. Definition and examples of the three types of support were given to participants before the experiment.

Each survey had a bank of 15 post-reply pairs and each post had a human-generated and a machine-generated reply. Five human-generated and five machine-generated pairs were randomly selected for each participant. The participants were informed that the survey contained both human- and machine-generated replies, without knowing the exact number. To check the NLG model’s generalizability in a similar course context, we randomly assigned the four participants to a group of 2 people to form two groups. One group used the bank of post-reply pairs from MOOC A, which was the course the NLG model was trained with. The other group used sources from another course of a similar topic on creative thinking (MOOC B), the data and context of which were new to the NLG model. Post-reply pairs from the two courses were extracted with a Python script, with only posts having at least one human-written reply being selected. Machine replies were then generated with the trained NLG model.

After participants submitted their responses, descriptive statistics were computed for further analysis. The analysis was conducted from two dimensions to compare the ratios of correctly labeled replies and socio-emotional support as well as the average ratings of content quality of replies. The two dimensions were (1) comparisons of performance between replies generated by the machine from the MOOC A group and the MOOC B group. This is to understand if the trained NLG model would yield a comparable performance in a similar while new context; and (2) comparisons of performance between replies generated by humans and the machine across all participants. This is to learn if the trained NLG model can generate replies similar to those generated by humans.

Results

Results for NLG Models with RNN and GPT2

The model training and evaluation were conducted with an NVIDIA Tesla P100 GPU of 16-gigabyte memory with Python 3. For the RNN model, we used the Python packages Tensorflow and Keras. In terms of GPT-2, we used the Python packages PyTorch and Huggingface’s Transformers. Each model was stopped when training loss’ decreasing tendency contradicted with that of validation loss, or when training loss stopped decreasing. The RNN model was stopped after fifteen epochs, and the GPT-2 model was stopped after eight epochs. An epoch of training means a full iteration of training data by the model since training with deep learning usually requires that data be chunked into mini-batches to avoid out-of-memory issues. Table 3 shows the evaluation results of RNN, GPT-2, and original replies from the perspectives of word perplexity, average F-K grade level, and average cosine similarity with posts.

Table 3 Evaluation results of RNN and GPT-2 compared with original replies

The word perplexity of 37.94 of GPT-2 suggests the GPT-2 model is much more confident in predicting the next word given a context, compared to the word perplexity of 151.41 of the RNN model.

Compared with the RNN model’s average F-K grade level of 22.68 (SD = 22.01), the GPT-2 model has an average F-K grade level of 10.10 (SD = 13.40). The average F-K grade level of GPT-2 is much closer to that of the original replies (M = 8.73, SD = 9.20). The GPT-2 model’s distribution of the F-K grade level follows more closely with that of the original replies than the distribution of RNN (see Fig. 4). Through the inspection of randomly selected examples of different F-K grade levels (see Table 4), we find that texts’ grade level scores falling in the range from 3 to 15 suggest better readability. In contrast, results with grade level higher than 20 or lower than 1 are usually repetitive or meaningless. From Fig. 4, F-K grade level scores of GPT-2 mainly lies in the range of 3 to 13, so are the scores of the original replies, while RNN has much more extreme scores. In general, the readability scores of GPT-2 better align with those of the original replies and GPT-2’s readability scores are concentrated in the favorable range from observations. Therefore, we conclude that texts generated by the GPT-2 model have better readability than those of RNN.

Fig. 4
figure 4

F-K grade level of texts generated by GPT-2, RNN, and Original Replies

Table 4 Examples of texts with different F-K grade levels

A similar evaluation was conducted on cosine similarity. After evaluating texts on various cosine similarities, we find that generated replies with cosine similarity in the mid-upper range are best aligned with corresponding posts (see Table 5). In the results, the GPT-2 model has an average cosine similarity of 0.57 (SD = 0.22), with 90% of responses (n = 2493) falling in the range from 0.13 to 0.84, while the RNN model’s mean cosine similarity is 0.52 (SD = 0.17), with 90% of responses (n = 2493) in the range from 0.17 to 0.75. The average cosine similarity of the original replies is 0.58 (SD = 0.18), with 90% of responses (n = 2493) falling in the range from 0.24 to 0.81. For responses with a cosine similarity greater than or equal to 0.95, there are 23 cases generated by GPT-2, none by the RNN model, and 3 in the original replies. Although GPT-2 tends to generate more responses with little variation from the original posts (cosine similarity > = 0.95), such cases only take less than 1% of the sample size (N = 2770). Therefore, the GPT-2 model is considered capable of generating more coherent replies than the RNN model, since the average cosine similarity of GPT-2 is closer to that of the original replies. Moreover, the majority of GPT-2’s cosine similarities are in the favorable range found empirically. Figure 5 shows the distribution of the cosine similarity of GPT-2, RNN, and original replies.

Table 5 Examples of texts with different cosine similarities
Fig. 5
figure 5

Cosine similarity of texts generated by GPT-2, RNN, and Original Replies

Given the performance of GPT-2 excelled RNN from various perspectives, we chose GPT-2 over RNN for further analysis to compare the types of support from GPT-2 with those by human learners.

Results for Multidimensional Support from GPT-2

In addition to conducting MANOVA and ANOVA tests, we also calculated the percentage of supportive replies. A reply is deemed as supportive if the reply has a rating of moderate or strong support. Replies rated with none of or weak support are treated as non-supportive in that we think these replies only provide minimal support that might not be helpful for learners. Table 6 shows the descriptive statistics on the percentage of different support types, as well as the results of statistical tests. From the percentage of supportive replies, original and generated replies provide close numbers of emotional and community support. In contrast, the original replies have substantially more instances of informational support. In terms of the MANOVA test, a significant multivariate effect in supporting MOOC learners was found between original replies and generated replies, Λ = 0.95, F(3, 296) = 4.85, p < 0.01. However, the Wilk’s Λ of 0.95 suggests only 5% of the variance can be explained by whether humans or machines generate a reply. The follow-up ANOVA tests do not find any significant differences in emotional and community support between original and generated replies. However, there was a significant difference in informational support between original replies and generated replies, η2 = 0.04, F(1, 298) = 11.725, p < 0.001. Similarly, the small effect size (η2) indicates that the effect of replies’ generation sources on the provided informational support is trivial in practice.

Table 6 Statistical report on types of support between original and generated replies

Results for NLG Model Validation Surveys

Table 7 shows the descriptive statistics of each participant’s result of human-machine labeling, social support detection, and average content quality. For example, participant 1 (P1) from Group A evaluated ten post-reply pairs from MOOC A. Among the ten pairs, six were labeled as “machine”, while the rest four were thought to be created by humans. For the six replies marked as machine-generated, two of them were actually generated by the GPT-2 model, which had a 33.33% (2/6*100%) correction percentage. For the four replies marked as human-generated, only one was actually generated by humans, having a 25% (1/4*100%) correction percentage. In terms of social support, P1 thought four out of five (80%) machine-generated replies showed informational support, two out of five (40%) showed emotional support, and one out of five (20%) demonstrated community support. Regarding the content quality, the average value rated by P1 was 7 for the machine-generated replies (n = 5) and 6 for the human-generated replies (n = 5).

Table 7 Descriptive statistics on the survey results to compare responses of GPT-2 model and Humans

To understand the generalizability of the trained GPT-2 model, we descriptively compared the differences of label correction rate, social support detection rate, and ratings of content quality between machine-generated texts in Group A and B (see Fig. 6(1)). The results showed similar ratios of correct labels for machine-generated texts between Group A (M = .4523, SD = .1684) and Group B (M = .4857, SD = .1212. In the meantime, machine-generated texts received a higher ratio of informational support in Group A (M = .7000, SD = .1414) than Group B (M = .6000, SD = 0). On contrast, the ratios of emotional and community support of machine-generated texts from Group B (Memotional = .5000, SDemotional = .1414; Mcommunity = .5000, SDcommunity = .1414) were higher than those from Group A (Memotional = .2000, SDemotional = .2828; Mcommunity = .2000, SDcommunity = 0). A similar result was shown regarding the average ratings of content quality for Group A (M = 4.90, SD = 2.97) and Group B (M = 5.10, SD = 1.84). It was interesting to see that the numbers of emotional and community support from Group B were greater than those from Group A, even though the GPT-2 model was trained within the context of MOOC A. The findings suggest that the trained GPT-2 model’s performance can be generalizable in a similar while new context.

Fig. 6
figure 6

Comparisons of participants’ performance between human- and machine-generated texts and between MOOC A and MOOC B

To find out if the trained GPT-2 model can generate replies similar to those generated by humans, we descriptively examined the differences in terms of label correction rate, social support detection rate, and ratings of content quality between texts generated by the machine and humans (see Fig. 6(2)). Participants were almost equally likely to correctly label a human-generated reply (M = .4958, SD = .2065) and a machine-generated one (M = .4691, SD = .1213). Human-generated texts showed a higher ratio of community support (M = .4500, SD = .1000) than those generated by the machine (M = .3500, SD = .1915). However, higher ratios of informational (Mmachine = .6500, SDmachine = .1000 vs. Mhuman = .4500, SDhuman = .1000) and emotional support (Mmachine = .3500, SDmachine = .2517 vs. Mhuman = .2500, SDhuman = .1915) were found in machine generated texts. Lastly, the average content quality rated by participants was higher for human-generated texts (M = 5.80, SD = .82) than machine-generated texts (M = 5.00, SD = 2.02).

Discussion

A discussion forum is an essential place for MOOC learners where learning transactions happen when students provide support to each other (Almatrafi et al. 2018; Wang et al. 2015; Xing et al. 2019). However, students’ communication in MOOC discussion forums is found to be scarce (Chiu and Hew 2018). The scarcity of interaction was also found in this study, with each post having only 0.6 replies on average. Furthermore, due to the size of MOOCs and the vast amount of posts generated every day, it is difficult for MOOC instructors to manually provide timely support for each learner (Almatrafi et al. 2018). This research aims to provide students with automatic support in MOOCs, more specifically, support students socio-emotionally by providing machine-generated replies for their posts in MOOC discussion forums in a more efficient way. The goal of this study was to examine the possibility of generating texts similar to those generated by humans with deep-learning-based NLG models. In this section, we discuss the results presented in the previous section.

Previous works on using NLG heavily rely on manual engineering to provide learning support (Demetriadis et al. 2018; Tegos et al. 2019), while such an approach requires advanced knowledge in algorithms and software engineering that are difficult for general researchers and educators. More importantly, NLG models with only manual engineering can limit the diversity of generated texts and topics. Therefore, NLG with deep learning shows potential in providing learners with automatic support in that deep learning does not require many formal specifications for computers from human operators to achieve desired results. This study has examined the use of state-of-the-art deep learning models GPT-2 and RNN to support students with text generation. Our results showed that NLG with deep learning could provide learners with genuine responses without digging deep into feature engineering. Specifically, the results showed that the GPT-2 model could generate more coherent and readable texts than the RNN model. In the observation, GPT-2 can generate texts naturally and contextually that are difficult to discern from human-written texts (see Table 2). The distributions of readability and similarity of GPT-2 are better aligned with those of the original replies than RNN (see Figs. 4 and 5). The main range of F-K grade level by GPT-2 (approximately 3 to 13) in this study resonates with Kincaid et al. (1975), where the researchers found the U.S. Navy’s training manuals had an F-K grade level ranging from 5.5 to 16.3. While the F-K grade level has been widely used to evaluate content readability in different domains and is shown to be an effective metric (Feng et al. 2010; Lipari et al. 2019; Xu et al. 2016), there are concerns about it due to the weights of the word and sentence lengths in the calculation (McNamara et al. 2014). The advent of the internet has shaped language used online towards a shorter form with abbreviations and symbols (Tagliamonte and Denis 2008). The F-K grade level, in this case, might not always accurately show the content readability. For example, a sequence of repetitive symbols such as “!” will have a low F-K grade level (see Table 4). However, the meaning of these symbols is subject to diverse interpretations and is not highly readable. Future studies on automatic evaluation of readability might consider improved metrics such as Coh-Metric (McNamara et al. 2014). Coh-Metric is less sensitive to the lengths of words and sentences and considers the relationships between ideas in the text, which can be a more robust readability indicator than the F-K grade level. For the cosine similarity, 62% of responses generated by GPT-2 are in the range from 0.5 to 0.8, which is closer to the percentage of the original replies (66%) than that of RNN (58%). In a study by Vakulenko et al. (2018), the researchers found a cosine similarity around 0.7 can best reflect the texts’ strong coherence within a conversation. Therefore, the main range of GPT-2’s cosine similarity indicates a functional coherence and suggests the model can generate texts that will effectively address students’ posts.

Social support theory has been studied extensively in online learning settings, and literature suggests informational, emotional, and community support are important to students’ learning (Hew 2016; Tomkin and Charlevoix 2014; Xing et al. 2018). However, few works have examined the design and development of NLG models to support learning from the perspective of social support theory. This study aims to fill the gap. Thus, we have evaluated the content generated by human learners and the GPT-2 model in informational, emotional, and community support. A MANOVA and follow-up ANOVA tests were then run to find significant differences. Descriptive results show that human-generated texts tend to provide more informational, emotional, and community support. However, the lack of significance suggests the GPT-2 model can provide a similar extent of emotional and community support compared with human learners. While human learners’ replies are significantly more informative, the small effect size suggests that such a difference between machine-generated replies is trivial in practice.

To confirm the findings from the qualitative analysis, we then applied the GPT-2 model with real users through surveys. From the descriptive statistics, participants were almost equally likely to correctly label a human-generated reply and a machine-generated one. In the meantime, participants tended to detect more diversified support from machine-generated texts than those generated by humans (see Fig. 6(2)). The results, to some extent, support our findings from the qualitative analysis that the GPT-2 model could generate texts similar to those generated by humans. Participants’ similar label correction percentages on machine- (46.91%) and human-generated (49.58%) texts suggest that it was challenging for participants to tell who has generated the texts. The finding that the trained GPT-2 model could generate texts that were challenging to distinguish from human-written texts aligned with previous studies (Adelani et al. 2020; Zhang et al. 2020). In Adelani and his colleagues’ study, GPT-2 was trained to generate product reviews for online commercial websites and participants were asked to differentiate between real and generated reviews. Their results showed that participants rated close fluency scores for real and generated reviews. Other than the ability to generate texts well imitating learners, we also found the GPT-2 model had the potential generalizability that could yield ideal performance in a similar but new context. The comparison between the two groups of participants showed close performances for machine-generated texts. In fact, participants found more social support from texts generated within the MOOC different from the training context (MOOC B) (see Fig. 6(1)).

The trained NLG model can be applied to MOOC contexts in various forms. For example, a web service can be programmed to provide application programming interfaces (APIs) for text generation in response to students’ posts. A demo of such service can be viewed at https://youtu.be/CyKeyb7_XaA. In the demo, a Python script was written to scrape students’ posts from MOOC A. Replies were generated with the trained GPT-2 model. The results in the demo suggested that the GPT-2 model can well respond to content not seen during the training. For example, for a post in 2019 that greeted everyone in the discussion forum, the GPT-2 model generated welcoming texts “sorry for the late reply. I am so excited to be in this course with you all!”. The model continued as a teacher to give suggestions on how to connect with people in discussion forums. For another post in 2018, where the author complained about the assessment in the course, the GPT-2 model generated texts with empathy and encouraged the author to provide suggestions for improvements.

The demo was made simplistically to demonstrate the use of NLG models with deep learning in MOOC contexts. In a production environment, more complex systems can be built to streamline the support process of NLG models. For example, webhooks of MOOC platforms can be used to subscribe to new post creation, and replies can be generated for posts without replies within a given amount of time. A webhook is a system that can be configured to respond in conditioned circumstances with pre-programmed functions. Generated replies can also be automatically evaluated for quality control based on the F-K grade level and cosine similarity.

This study showed an exploratory work on using NLG with deep learning models to support MOOC learners. Future work may develop based on this study’s methodology or to extend this study’s research goal for improvements. For example, will the types of support offered by human replies and generated replies differ in other contexts such as discussion forums for online learning or MOOCs with technical topics? Or instead of providing support randomly or for posts without replies, will students benefit more in terms of informational, emotional, and community support if their posts are automatically classified as in urgent need? Future studies may also explore what forms of automatic textual support students would prefer to receive. For example, textual support can be generated directly on discussion forums, or the textual support can be given with a private web page.

Limitations

The study demonstrated the potential of using NLG with deep learning to provide students with automatic socio-emotional support. The capability to generate high-quality texts and the potential generalizability in new contexts of the deep learning model have demonstrated the possibility of achieving the target goal. However, there are a number of limitations, three of which are noted here.

The first limitation is that the current study has limited data from real-case applications. Although we have conducted a small-scale experiment to understand if the NLG model can generate texts similar to those generated by humans, more data is needed to offer sufficient statistical power for inference. Furthermore, collecting students’ outcome data with the NLG model applied within a real MOOC environment might help us better understand the effects of machine-generated texts, which is not available in this study. Second, the data used in this study is from a cognitively less challenging environment than courses of topics such as engineering. Students mainly used the discussion forum to form groups and share ideas, and only a few posts were expressing negative emotions or seeking help. Therefore, the evaluation of informational and emotional support might not apply to more challenging MOOCs. Lastly, although it is helpful to evaluate the quality of generated texts with F-K grade level and cosine similarity, results cannot be guaranteed to be of high quality. Generated texts may have desired readability and coherence metrics while still being unreadable or out of context. The unpredictability of low-quality generated texts might yield negative experiences for MOOC learners.

Conclusion

By examining the texts generated by deep learning models RNN and GPT-2, this study conducted preliminary work to provide social support with automatically generated texts. The results suggested that the GPT-2 model greatly outperformed the RNN model in terms of word perplexity, readability, and coherence with contexts. Evaluation of the informational, emotional, and community support provided by human replies and GPT-2-generated replies indicated the generated texts by GPT-2 could provide a similar extent of emotional and community support as humans, while significantly less informational support was detected in machine-generated replies. Further survey analysis suggested that although differences exist between texts generated by the machine and humans in terms of easiness of differentiation, social support, and content quality, such differences were small and aligned with our prior results. The findings imply that it is possible to provide MOOC learners with automatic social support on a large scale. Ultimately, the goal is to support students by providing machine-generated replies for their posts in MOOC discussion forums so that there can be more interactions. As a result, it might lead to more engagement and lower dropout rates as students are socio-emotionally supported in a timely manner. The use of NLG with deep learning models also enriches the literature on providing learners with automatic support and extends the literature on supporting learning with NLG and deep learning.