Keywords

1 Introduction

Online discussion forums have gained substantial traction over the past decade, and are now a significant avenue of knowledge sharing on the Internet. Attracting learners with diverse interests and backgrounds, some platforms (e.g., Stack Overflow, MathOverflow) target specific technical subjects, while others (e.g., Quora, Reddit) cover a wide range of topics from politics to entertainment.

More recently, discussion forums have become a significant component of online education, enabling students in online courses to learn socially as a supplement to their studying of the course content individually [2]; social interactions between learners have been seen to improve learning outcomes [4]. In particular, massive open online courses (MOOCs) often have tens of thousands of learners within single sessions, making the social interactions via these forums critical to scaling up instruction [3]. In addition to serving as a versatile complement to self-regulated learning [24], research has shown that learner participation on forums can be predictive of learning outcomes [26].

In this paper, we ask: How can we model the activity of individual learners in MOOC discussion forums? Such a model, designed correctly, presents several opportunities to optimize the learning process, including personalized news feeds to help learners sort through forum content efficiently, and analytics on factors driving participation.

1.1 Prior Work on Discussion Forums

Generic Online Discussion Sites. There is vast literature on analyzing user interactions in online social networks (e.g., on Facebook, Google+, and Twitter). Researchers have developed methods for tasks including link prediction [10, 17], tweet cascade analysis [7, 23], post topic analysis [21], and latent network structure estimation [14, 15]. These methods are not directly applicable to modeling MOOC discussion forums since MOOCs do not support an inherent social structure; learners cannot become “friends” or “follow” one another.

Generic online discussion forums (e.g., Stack Overflow, Quora) have also generated substantial research. Researchers have developed methods for tasks including question-answer pair extraction [5], topic dynamics analysis [27], post structure analysis [25], and user grouping [22]. While these types of forums also lack explicit social structure, MOOC discussion forums exhibit several unique characteristics that need to be accounted for. First, topics in MOOC discussion forums are mostly centered around course content, assignments, and course logistics [3], making them far more structured than generic forums; thus, topic modeling can be used to organize threads and predict future activity. Second, there are no sub-forums in MOOCs: learners all post in the same venue even though their interests in the course vary. Modeling individual interest levels on each topic can thus assist learners in navigating through posts.

MOOC Forums. A few studies on MOOC discussion forums have emerged recently. The works in [19, 20] extracted forum structure and post sentiment information by combining unsupervised topic models with sets of expert-specified course keywords. In this work, our objective is to model learners’ forum behavior, which requires analyzing not only the content of posts but also individual learner interests and temporal dynamics of the posts.

In terms of learner modeling, the work in [8] employed Bayesian nonnegative matrix factorization to group learners into communities according to their posting behavior. This work relies on topic labels of each discussion post, though, which are either not available or not reliable in most MOOC forums. The work in [2] inferred learners’ topic-specific seeking and disseminating tendencies on forums to quantify the efficiency of social learning networks. However, this work relies on separate models for learners and topics, whereas we propose a unified model. The work in [9] couples social network analysis and association rule mining for thread recommendation; while their approach considers social interactions among learners, they ignore the content and timing of posts.

As for modeling temporal dynamics, the work in [3] proposed a method that classifies threads into different categories (e.g., small-talk, course-specific) and ranks thread relevance for learners over time. This model falls short of making recommendations, though, since it does not consider learners individually. The work in [28] employed matrix factorization for thread recommendation and studied the effect of window size, i.e., recommending only threads with posts in a recent time window. However, this model uses temporal information only in post-processing, which limits the insights it offers. The work in [16] focuses on learner thread viewing rather than posting behavior, which is different from our study of social interactions since learners view threads independently.

The model proposed in [18] is perhaps most similar to ours, as it uses point processes to analyze discussion forum posts and associates different timescales with different types of posts to reflect recurring user behavior. With the task of predicting which Reddit sub-forum a user will post in next, the authors base their point processes model on self-excitations, as such behavior is mostly driven by a user’s own posting history. Our task, on the contrary, is to recommend threads to learners taking a particular online course: here, excitations induced by other learners (e.g., explicit replies) can significantly affect a learner’s posting behavior. As a result, the model we develop incorporates mutual excitation. Moreover, [18] labels each post based on the Reddit sub-forum it belongs to; no such sub-forums exist in MOOCs.

1.2 Our Model and Contributions

In this paper, we propose and experimentally validate a probabilistic model for learners posting on MOOC discussion forums. Our main contributions are as follows.

First, through point processes, our model captures several important factors that influence a learner’s decision to post. In particular, it models the probability that a learner makes a post in a thread at a particular point in time based on four key factors: (i) the interest level of the learner on the topic of the thread, (ii) the timescale of the thread topic (which corresponds to how fast the excitation induced by new posts on the topic decay over time), (iii) the timing of the previous posts in the thread, and (iv) the nature of the previous posts regarding this learner (e.g., whether they explicitly reply to the learner). Through evaluation on three real-world datasets—the largest having more than 6,000 learners making more than 40,000 posts in more than 5,000 threads—we show that our model significantly outperforms several baselines in terms of thread recommendation, thus showing promise of being able to direct learners to threads they are interested in.

Second, we derive a Gibbs sampling parameter inference algorithm for our model. While existing work has relied on thread labels to identify forum topics, such metadata is usually not available for MOOC forum threads. As a result, we jointly analyze the post timestamp information and the text of the thread by coupling the point process model with a topic model, enabling us to learn the topics and other latent variables through a single procedure.

Third, we demonstrate several types of analytics that our model parameters can provide, using our datasets as examples. These include: (i) identifying the timescales (measured as half-lives) of different topics, from which we find that course logistics-related topics have the longest-lasting excitations, (ii) showing that learners are much (20–30 times) more likely to post again in threads they have already posted in, and (iii) showing that learners receiving explicit replies in threads are much (300–500 times) more likely to post again in these threads to respond to these replies.

2 Point Processes Forum Model

An online course discussion forum is generally comprised of a series of threads, with each thread containing a sequence of posts and comments on posts. Each post/comment contains a body of text, written by a particular learner at a particular point in time. A thread can further be associated with a topic, based on analysis of the text written in the thread. See our online technical report [11] for an example of a thread in a MOOC consisting of eight posts and comments and more intuitive explanations of the model setup. Moving forward, the terminology “posting in a thread” will refer to a learner writing either a post or a comment.

We postulate that a learner’s decision to post in a thread at a certain point in time is driven by four main factors: (i) the learner’s interest in the thread’s topic, (ii) the timescale of the thread’s topic, (iii) the number and timing of previous posts in the thread, and (iv) the learner’s prior activity in the thread (e.g., whether there are posts that explicitly reply to the learner). The first factor is consistent with the fact that MOOC forums generally have no sub-forums: in the presence of diverse threads, learners are most likely to post in those covering topics they are interested in. The second factor reflects the observation that different topics exhibit different patterns of temporal dynamics. The third factor captures the common options for thread-ranking that online forums provide to users, e.g., by popularity or recency; learners are more likely to visit those at the top of these rankings. The fourth factor captures the common setup of notifications in discussion forums: learners are typically subscribed to threads automatically once they post in them, and notified of any new posts (especially those that explicitly reply to them) in these threads. To capture these dynamics, we model learners’ posts in threads as events in temporal point processes [6], which will be described next.

Point Processes. A point process, the discretization of a Poisson process, is characterized by a rate function \(\lambda (t)\) that models the probability that an event will happen in an infinitesimal time window \(\mathrm {d}t\) [6]. Formally, the rate function at time t is given by

$$\begin{aligned} \lambda (t) = \mathbb {P}\left( \text {event in } [t, t+\mathrm {d}t)\right) = \textstyle \lim _{\mathrm {d}t \rightarrow 0} \frac{N(t+\mathrm {d}t) - N(t)}{\mathrm {d}t}, \end{aligned}$$
(1)

where N(t) denotes the number of events up to time t [6]. Assuming the time period of interest is [0, T), the likelihood of a series of events at times \(t_1, \ldots , t_N < T\) is given by:

$$\begin{aligned} \mathcal {L}(\{{t_i}\}_{i=1}^N) = \left( \textstyle \prod _{i=1}^N \lambda (t_i)\right) e^{-\int _0^T \lambda (\tau ) \mathrm {d} \tau }. \end{aligned}$$
(2)

In this paper, we are interested in rate functions that are affected by excitations of past events (e.g., forum posts in the same thread). Thus, we resort to Hawkes processes [18], which characterize the rate function at time t given a series of past events at \(t_1, \ldots , t_{N'} < t\) as

$$\begin{aligned} \lambda (t) = \mu + a \textstyle \sum _{i=1}^{N'} \kappa (t-t_i), \end{aligned}$$

where \(\mu \ge 0\) denotes the constant background rate, \(a \ge 0\) denotes the amount of excitation each event induces, i.e., the increase in the rate function after an event, and \(\kappa (\cdot ): \mathbb {R}_+ \rightarrow [0,1]\) denotes a non-increasing decay kernel that controls the decay in the excitation of past events over time. In this paper, we use the standard exponential decay kernel \(\kappa (t) = e^{-\gamma t}\), where \(\gamma \) denotes the decay rate. Through our model, different decay rates can be associated with different topics [18]; as we will see, this model choice enables us to categorize posts into groups (e.g., course content-related, small talk, or course logistics) based on their timescales, which leads to better model analytics.

Rate Function for New Posts. Let U, K, and R denote the number of learners, topics, and threads in a discussion forum, indexed by u, k, and r, respectively. We assume that each thread r functions independently, and that each learner’s activities in each thread and on each topic are independent. Further, let \(z_r\) denote the topic of thread r, and let \(P_r\) denote the total number of posts in the thread, indexed by p; for each post p, we use \(u_p^r\) and \(t_p^r\) to denote the learner index and time of the post, and we use \(p^r_i(u)\) to denote the \(i^\text {th}\) post of learner u in thread r. Note that posts in a thread are indexed in chronological order, i.e., \(p < p'\) if and only if \(t_p^r < t_{p'}^r\). Finally, let \(\gamma _k \ge 0\) denote the decay rate of each topic and let \(a_{u,k}\) denote the interest level of learner u on topic k. We model the rate function that characterizes learner u posting in thread r (on topic \(z_r = k\)) at time t given all previous posts in the thread (i.e., posts with \(t_p^r < t\)) as

$$\begin{aligned} \lambda _{u,k}^r (t) = \left\{ \begin{array}{ll} a_{u,k} \sum \nolimits _p e^{-\gamma _k (t - t_p^r)} &{} \text {if} \;\; t< t_{p_1^r (u)}^r \\ a_{u,k} \sum \nolimits _{p: p < p_1^r (u)} e^{-\gamma _k (t - t_p^r)} \\ \, + \alpha \, a_{u,k} \sum \nolimits _{p: p \ge p_1^r (u), u \notin d_p^r} e^{-\gamma _k (t - t_p^r)} \\ \, + \beta \alpha \, a_{u,k} \sum \nolimits _{p: u \in d_p^r} e^{-\gamma _k (t - t_p^r)} &{} \text {if} \;\; t \ge t_{p_1^r (u)}^r. \end{array} \right. \end{aligned}$$
(3)

In our model, \(a_{u,k}\) characterizes the base level of excitation that learner u receives from posts in threads on topic k, which captures the different interest levels of learners on different topics. The exponential decay kernel models a topic-specific decay in excitation of rate \(\gamma _k\) from the time of the post.

Before \(t_{p_1^r(u)}^r\) (the timestamp of the first post learner u makes in thread r), learner u’s rate is given solely by the number and recency of posts in r (\(t_{p_1^r (u)}^r = \infty \) if the learner never posts in this thread), while all posts occurring after \(t_{p_1^r(u)}^r\) induce additional excitation characterized by the scalar variable \(\alpha \). This model choice captures the common setup in MOOC forums that learners are automatically subscribed to threads after they post in them. Therefore, we postulate that \(\alpha > 1\), since new post notifications that come with thread subscriptions tend to increase a learner’s chance of viewing these new posts, in turn increasing their likelihood of posting again in these threads. The observation of users posting immediately after receiving notifications is sometimes referred to as the “bursty” nature of posts on social media [7].

We further separate posts made after \(t_{p_1^r(u)}^r\) by whether or not they constitute explicit replies to learner u. A post \(p'\) is considered to be an explicit reply to a post p in the same thread r if \(t^r_{p'} > t^r_p\) and one of the following conditions is met: (i) \(p'\) makes direct reference (e.g., through name or the @ symbol) to the learner who made post p, or (ii) \(p'\) is the first comment under p.Footnote 1 \(d_p^r\) in (3) denotes the set of explicit recipients of p, i.e., if p is an explicit reply to learner u, then \(u \in d_p^r\), while if p is not an explicit reply to any learners then \(d_p^r = \emptyset \). This setup captures the common case of learners being notified of posts that explicitly reply to them in a thread. The scalar \(\beta \) characterizes the additional excitation these replies induce; we postulate that \(\beta > 1\), i.e., the personal nature of explicit replies to learners’ posts tends to further increase the likelihood of them posting again in the thread (e.g., to address these explicit replies).

Rate Function for Initial Posts. We must also model the process of generating the initial posts in threads. We characterize the rate function of these posts as time-invariant:

$$\begin{aligned} \lambda _{u,k}^r (t) = \mu _{u,k}, \end{aligned}$$
(4)

where \(\mu _{u,k}\) denotes the background posting rate of learner u on topic k. Separating the initial posts in threads from future posts in this way enables us to model learners’ knowledge seeking (i.e., starting threads) and knowledge disseminating (i.e., posting responses in threads) behavior [2], through the background (\(\mu _{u,k}\)) and excitation levels (\(a_{u,k}\)), respectively.

Post Text Modeling. Finally, we must also model the text of each thread. Given the topic \(z_r = k\) of thread r, we model \(\mathcal {W}_r\)—the bag-of-words representation of the text in r across all posts—as being generated from the standard latent Dirichlet allocation (LDA) model [1], with topic-word distributions parameterized by \({\phi }_k\). Details on the LDA model and the posterior inference step for \({\phi }_k\) via collapsed Gibbs sampling in our parameter inference algorithm are omitted for simplicity of exposition.

3 Parameter Inference

We now derive the parameter inference algorithm for our model. We perform inference using Gibbs sampling, i.e., iteratively sampling from the posterior distributions of each latent variable, conditioned on the other latent variables. The detailed steps are as follows:

  1. 1.

    Sample \(z_r\). To sample from the posterior distribution of the topic of each thread, \(z_r\), we put a uniform prior over each topic and arrive at the posterior

    $$\begin{aligned} P(z_r = k\! \mid \!\ldots ) \propto&\; P(\mathcal {W}_r\! \mid \!z_r) \textstyle \prod _{k'} P(\{t_1^{r'}\}_{r': z_{r'} = k', u^r_1 = u^{r'}_1}\! \mid \!\mu _{u_1^r,k'}) \\&\quad \cdot \textstyle \prod _u P(\{t_p^r\}_{p: u_p^r = u}\! \mid \!a_{u,k}, \alpha , \beta , \gamma _k), \end{aligned}$$

    where \(\ldots \) denotes all variables except \(z_r\). \(P(\mathcal {W}_r\!\! \mid \!\!z_r)\) denotes the likelihood of observing the text of thread r given its topic. \(P(\{t_1^{r'}\}_{r': z_{r'} = k', u^r_1 = u^{r'}_1}\!\! \mid \!\mu _{u_1^r,k'})\) denotes the likelihood of observing the sequence of initial thread posts on topic \(k'\) made by the learner who also made the initial post in thread r;Footnote 2 this is given by substituting (4) into (2) as

    $$\begin{aligned}&P(\{t_1^{r'}\}_{r': z_{r'} = k', u^r_1 = u^{r'}_1}\! \mid \!\mu _{u_1^r,k'}) = \mu _{u_1^r,k'}^{\sum _{r'} \mathbf {1}_{u_1^r = u^{r'}_1, z_{r'} = k'}} e^{-\mu _{u_1^r,k'}T} \propto \mu _{u_1^r,k'}, \end{aligned}$$
    (5)

    where \(\mathbf {1}_x\) denotes the indicator function that takes the value 1 when condition x holds and 0 otherwise. \(P(\{t_p^r\}_{p: u_p^r = u}\! \mid \!a_{u,k}, \alpha , \beta , \gamma _k)\) denotes the likelihood of observing the sequence of posts made by learner u in thread r,Footnote 3 given by

    $$\begin{aligned}&P(\{t_p^r\}_{p: u_p^r = u}\! \mid \!a_{u,k}, \alpha , \beta , \gamma _k) = \left( \textstyle \prod _{p: u^r_p = u} \lambda _{u,z_r}^r(t^r_p)\right) \left( e^{-\int _0^T \lambda _{u,z_r}^r (t) \mathrm {d}t} \right) , \end{aligned}$$
    (6)

    where the rate function \(\lambda _{u,k}^r (t)\) for learner u in thread r (with topic k) is given by (3).

  2. 2.

    Sample \(\gamma _k\). There is no conjugate prior distribution for the excitation decay rate variable \(\gamma _k\). Therefore, we resort to a pre-defined set of decay rates \(\gamma _k \in \{\gamma _s\}_{s=1}^S\). We put a uniform prior on \(\gamma _k\) over values in this set, and arrive at the posterior given by

    $$\begin{aligned}&P(\gamma _k = \gamma _s\! \mid \!\ldots ) \propto \textstyle \prod _{r: z_r = k} \textstyle \prod _u P(\{t_p^r\}_{p: u_p^r = u}\! \mid \!a_{u,k}, \alpha , \beta , \gamma _s). \end{aligned}$$
  3. 3.

    Sample \(\mu _{u,k}\). The conjugate prior of the learner background topic interest level variable \(\mu _{u,k}\) is the Gamma distribution. Therefore, we put a prior on \(\mu _{u,k}\) as \(\mu _{u,k} \sim \text{ Gam }(\alpha _\mu ,\beta _\mu )\) and arrive at the posterior distribution

    $$P(\mu _{u,k}\! \mid \!\ldots ) \propto \text{ Gam }(\alpha _\mu ',\beta _\mu ')$$

    where

    $$\begin{aligned} \alpha _\mu ' = \alpha _\mu + \textstyle \sum _r \mathbf {1}_{u_1^r = u, z_r = k}, \qquad \beta _\mu ' = \beta _\mu + T. \end{aligned}$$
  4. 4.

    Sample \(a_{u,k}\), \(\alpha \), and \(\beta \). The latent variables \(\alpha \) and \(\beta \) have no conjugate priors. As a result, we introduce an auxiliary latent variable [14, 23] \(e_p^r\) for each post p, where \(e_{p'}^r = p\) means that post p is the “parent” of post \(p'\) in thread r, i.e., post \(p'\) was caused by the excitation that the previous post p induced. We first sample the parent variable for each post p according to

    $$\begin{aligned} P(e_{p'}^r = p) \propto a^r(p,p') e^{- \gamma _{z_r} (t_{p'}^r - t_p^r)}, \end{aligned}$$

    where \(a^r(p,p') \in \{a_{u_{p'}^r,z_r}, \alpha a_{u_{p'}^r,z_r}, \beta \alpha a_{u_{p'}^r,z_r} \}\) depending on the relationship between posts p and \(p'\) from our model, i.e., whether \(p'\) is the first post of \(u_{p'}\) in the thread, and if not, whether p is an explicit reply to \(u_{p'}\). In general, the set of possible parents of p is all prior posts \(1, \ldots , p-1\) in r, but in practice, we make use of the structure of each thread to narrow down the set of possible parents for some posts.

    With these parent variables, we can write \(\mathcal {L}(\{t_p^r\}_{p: u_p^r=u})\), the likelihood of the series of posts learner u makes in thread r as

    $$\begin{aligned}&\mathcal {L} = \textstyle \prod _r \mathcal {L}(\{t_p^r\}_{p=1}^{P_r}) = \textstyle \prod _r \textstyle \prod _u \mathcal {L}(\{t_p^r\}_{p: u_p^r=u}), \end{aligned}$$

    where \(\mathcal {L}(\{t_p^r\}_{p: u_p^r=u})\) denotes the likelihood of the series of posts learner u makes in thread r. We can then expand the likelihood using the parent variables as

    $$\begin{aligned}&\mathcal {L}(\{t_p^r\}_{u_p^r=u}) = \textstyle \prod _{p: p < p_1^r(u)} e^{-\frac{a_{u,z_r}}{\gamma _{z_r}}(1-e^{-\gamma _{z_r}(T - t_p^r)})} \\&\quad \left( \textstyle \prod _{p': u_{p'}^r = u, e_{p'}^r = p} a_{u,z_r} e^{-\gamma _{z_r}(t_{p'}^r - t_p^r)} \right) \textstyle \prod _{p: p \ge p_1^r (u), u \notin d_p^r} e^{-\frac{\alpha a_{u,z_r}}{\gamma _{z_r}}(1-e^{-\gamma _{z_r}(T - t_p^r)})} \\&\quad \left( \textstyle \prod _{p': u_{p'}^r = u, e_{p'}^r = p} \alpha a_{u,z_r} e^{-\gamma _{z_r}(t_{p'}^r - t_p^r)} \right) \textstyle \prod _{p: u \in d_p^r} e^{-\frac{\beta \alpha a_{u,z_r}}{\gamma _{z_r}}(1-e^{-\gamma _{z_r}(T - t_p^r)})} \\&\quad \quad \quad \quad \quad \quad \cdot \left( \textstyle \prod _{p': u_{p'}^r = u, e_{p'}^r = p} \beta \alpha a_{u,z_r} e^{-\gamma _{z_r}(t_{p'}^r - t_p^r)} \right) . \end{aligned}$$

    We now see that Gamma distributions are conjugate priors for \(a_{u,k}\), \(\alpha \), and \(\beta \). Specifically, if \(a_{u,k} \sim \text{ Gam }(\alpha _a,\beta _a)\), its posterior is given by \(P(a_{u,k} | \ldots ) \sim \text{ Gam }(\alpha _a',\beta _a')\) where

    $$\begin{aligned}&\alpha _a' = \alpha _a + \textstyle \sum _{r:z_r = k} \textstyle \sum _p \mathbf {1}_{u_{p}^r = u}, \\&\beta _a' = \beta _a + \textstyle \sum _{r:z_r = k} \Big (\textstyle \sum _{p: p < p_1^r(u)} \frac{1}{\gamma _k} (1-e^{-\gamma _k (T - t_p^r)}) \\&\quad + \textstyle \sum _{p:p \ge p_1^r(u), u \notin d_p^r} \frac{\alpha }{\gamma _k} (1-e^{-\gamma _k (T - t_p^r)}) + \textstyle \sum _{p: u \in d_p^r} \frac{\beta \alpha }{\gamma _k} (1-e^{-\gamma _k (T - t_p^r)}) \Big ). \end{aligned}$$

    Similarly, if \(\alpha \sim \text{ Gam }(\alpha _\alpha ,\beta _\alpha )\), the posterior is \(P(\alpha | \ldots ) \sim \text{ Gam }(\alpha _\alpha ',\beta _\alpha ')\) where

    $$\begin{aligned}&\alpha _\alpha ' = \alpha _\alpha + \textstyle \sum _r \textstyle \sum _p \textstyle \sum _{p'} \mathbf {1}_{e_{p'}^r = p, p \ge p_1^r (u_{p'}^r)}, \\&\beta _\alpha ' = \beta _\alpha + \textstyle \sum _r \textstyle \sum _u \Big ( \textstyle \sum _{p:p \ge p_1^r(u), u \notin d_p^r} \frac{a_{u,z_r}}{\gamma _{z_r}} (1-e^{-\gamma _{z_r} (T - t_p^r)}) \\ {}&\quad + \textstyle \sum _{p: u \in d_p^r} \frac{\beta a_{u,z_r}}{\gamma _{z_r}} (1-e^{-\gamma _{z_r} (T - t_p^r)}) \Big ). \end{aligned}$$

    Finally, if \(\beta \sim \text{ Gam }(\alpha _\beta ,\beta _\beta )\), the posterior is \(P(\beta | \ldots ) \sim \text{ Gam }(\alpha _\beta ',\beta _\beta ')\) where

    $$\begin{aligned}&\alpha _\beta ' = \alpha _\beta + \textstyle \sum _r \textstyle \sum _p \textstyle \sum _{p'} \mathbf {1}_{e_{p'}^r = p, u_{p'}^r \in d_p^r}, \\&\beta _\beta ' = \beta _\beta + \textstyle \sum _r \textstyle \sum _u \textstyle \sum _{p: u \in d_p^r} \frac{\alpha a_{u,z_r}}{\gamma _{z_r}} (1-e^{-\gamma _{z_r} (T - t_p^r)}). \end{aligned}$$

We iterate the sampling steps 1–4 above after randomly initializing the latent variables according to their prior distributions. After a burn-in period, we take samples from the posterior distribution of each variable over multiple iterations, and use the average of these samples as its estimate.

4 Experiments

In this section, we experimentally validate our proposed model using three real-world MOOC discussion forum datasets. In particular, we first show that our model obtains substantial gains in thread recommendation performance over several baselines. Subsequently, we demonstrate the analytics on forum content and learner behavior that our model offers.

4.1 Datasets

We obtained three discussion forum datasets from 2012 offerings of MOOCs on Coursera: Machine Learning (ml), Algorithms, Part I (algo), and English Composition I (comp). The number of threads, posts and learners appearing in the forums, and the duration (the number of weeks with non-zero discussion forum activity) of the courses are given in Table 1.

Table 1. Basic statistics on the datasets.

Prior to experimentation, we perform a series of pre-processing steps. First, we prepare the text for topic modeling by (i) removing non-ascii characters, url links, punctuations and words that contain digits, (ii) converting nouns and verbs to base forms, (iii) removing stopwords,Footnote 4 and (iv) removing words that appear fewer than 10 times or in more than 10% of threads. Second, we extract the following information for each post: (i) the ID of the learner who made the post (\(u_p^r\)), (ii) the timestamp of the post (\(t_p^r\)), and (iii) the set of learners it explicitly replies to as defined in the model (\(d_p^r\)). For posts made anonymously, we do not include rates for them (\(\lambda _{u,k}^r(t)\)) when computing the likelihood of a thread, but we do include them as sources of excitation for non-anonymous learners in the thread.

4.2 Thread Recommendation

Experimental Setup. We now test the performance of our model on personalized thread recommendation. We run three different experiments, splitting the dataset based on the time of each post. The training set includes only threads initiated during the time interval \([0, T_1)\), i.e., \(\{r: t_1^r \in [0, T_1)\}\), and only posts on those threads made before \(T_1\), i.e., \(\{p: t_p^r \le T_1\}\). The test set contains posts made in time interval \([T_1, T_2)\), i.e., \(\{p: t_p^r \in [T_1, T_2)\}\), but excludes new threads initiated during the test interval.

In the first experiment, we hold the length of the testing interval fixed to 1 day, i.e., \(\varDelta T = T_2 - T_1 = 1\,\text {day}\), and vary the length of the training interval as \(T_1 \in \{1\,\text {week}, \ldots , W-1\,\text {weeks} \}\), where W denotes the number of weeks that the discussion forum stays active. We set W to 10, 8, and 8 for ml, comp, and algo, respectively, to ensure the number of posts in the testing set is large enough. These numbers are less than those in Table 1 since learners drop out during the course, which leads to decreasing forum activity. In the second experiment, we hold the length of the training interval fixed at \(W-1\) weeks and vary the length of the testing interval as \(\varDelta T \in \{ 1\,\text {day}, \ldots , 7\, \text {days}\}\). In the first two experiments, we fix \(K = 5\), while in the third experiment, we fix the length of the training and testing intervals to 7 weeks and 1 week, respectively, and vary the number of latent topics as \(K \in \{2, 3, \ldots , 10, 12, 15, 20\}\).

For training, we set the values of the hyperparameters to \(\alpha _a = \alpha _\mu = 10^{-4}\), and \(\beta _a = \beta _\mu = \alpha _\alpha = \beta _\alpha = \alpha _\beta = \beta _\beta = 1\). We set the pre-defined decay rates \(\{\gamma _s\}_{s=1}^S\) to correspond to half-lives (i.e., the time for the excitation of a post to decay to half of its original value) ranging from minutes to weeks. We run the inference algorithm for a total of 2, 000 iterations, with 1, 000 of these being burn-in iterations for good mixing.

Baselines. We compare the performance of our point process model (PPS) against four baselines: (i) Popularity (PPL), which ranks threads from most to least popular based on the total number of posts in each thread during the training time interval; (ii) Recency (REC), which ranks threads from newest to oldest based on the timestamp of their most recent post; (iii) Social influence (SOC), a variant of our PPS model that replaces learner topic interest levels with learner social influences (the “Hwk” baseline in [7]); and (iv) Adaptive matrix factorization (AMF), our implementation of the matrix factorization-based algorithm proposed in [28]. See our online technical report [11] for more explanations on the AMF baseline and a detailed, head-to-head comparison under the same experimental setting in [28].

To rank threads in our model for each learner, we calculate the probability that learner u will reply to thread r during the testing time interval as

$$\begin{aligned} P(u\mathrm {~posts~in~}r)&= \textstyle \sum _k P(u\mathrm {~posts~in~}r\! \mid \!z_r=k) \,P(z_r=k) \\&= \textstyle \sum _k \Big (1- e^{-\int _{T_1}^{T_2}\lambda _{u,k}^r(t)\mathrm {d}t} \Big )\, P(z_r=k). \end{aligned}$$

The rate function \(\lambda _{u,k}^r(t)\) is given by (3). \(P(z_r=k)\) is given by

$$\begin{aligned} P(z_r=k)&\propto P(z_r=k\! \mid \!u_1^r) \, P(\mathcal {W}_r\! \mid \!z_r=k) \textstyle \prod _u P(\{t_p^r\}_{p: u_p^r = u, t_p^r < T_1}\! \mid \!z_r=k), \end{aligned}$$

where the likelihoods of the initial post and other posts are given by (2) and (5), and the thread text likelihood \(P(\mathcal {W}_r\!\! \mid \!\!z_r=k)\) is given by the standard LDA model. The threads are then ranked from highest to lowest posting probability.

Evaluation Metric. We evaluate recommendation performance using the standard mean average precision for top-N recommendation (MAP@N) metric. This metric is defined by taking the mean (over all learners who posted during the testing time interval) of the average precision

$$\begin{aligned} AP_u \text {@}N = \textstyle \sum _{n=1}^{N} \frac{P_u \text {@} n \cdot \mathbf {1}_{\text {u posted in thread}\,r_u(n)}}{\min \{|\mathcal {R}_u|,N\}}, \end{aligned}$$

where \(\mathcal {R}_u\) denotes the set of threads learner u posted in during the testing time interval \([T_1, T_2)\), \(r_u(n)\) denotes the \(n^\text {th}\) thread recommended to the learner, \(P_u \text {@} n\) denotes the precision at n, i.e., the fraction of threads among the top n recommendations that the learner actually posted in, and \(\mathbf {1}\) denotes the indicator function. We use \(N = 5\) in the first two experiments, and vary \(N \in \{ 3, 5, 10\}\) in the third experiment.

Fig. 1.
figure 1

Plot of recommendation performance over different lengths of the training time window \(T_1\) on all datasets. Our model significantly outperforms every baseline.

Fig. 2.
figure 2

Recommendation performance of the algorithms for varying testing window length \(\varDelta T\) on the algo dataset. The point process-based algorithms have highest performance and are more robust to \(\varDelta T\).

Results and Discussion. Figure 1 plots the recommendation performance of our model and the baselines over different lengths of the training time window \(T_1\) for each dataset. Overall, we see that our model significantly outperforms the baselines in each case, achieving 15%–400% improvement over the strongest baseline.Footnote 5 The fact that PPS outperforms the SOC baseline confirms our hypothesis that in MOOC forums, learner topic preference is a stronger driver of posting behavior than social influence, consistent with the fact that most forums do not have an explicit social network (e.g., of friends or followers). The fact that PPS outperforms the AMF baseline emphasizes the benefit of the temporal element of point processes in capturing the dynamics in thread activities over time, compared to the (mostly) static matrix factorization-based algorithms. Note also that as the amount of training data increases in the first several weeks, the recommendation performance tends to increase for the point processes-based algorithms while decreasing for PPL and REC. The observed fluctuations can be explained by the decreasing numbers of learners in the test sets as courses progress, since they tend to drop out before the end (see also Fig. 4).

Figure 2 plots the recommendation performance over different lengths of the testing time window \(\varDelta T\) for the algo dataset. As in Fig. 1, our model significantly outperforms every baseline. We also see that recommendation performance tends to decrease as the length of the testing time window increases, but while the performance of point process-based algorithms decay only slightly, the performance of the PPL and AMF baselines decrease significantly (by around 50%). This observation suggests that our model excels at modeling long-term learner posting behavior.

Finally, Fig. 3 plots the recommendation performance of the PPS model over different numbers of topics K for the ml dataset, for different choices of N, \(T_1\) and \(\varDelta T\). In each case, the performance rises slightly up to \(K \approx 5\) and then drops for larger values (when overfitting occurs). Overall, the performance is relatively robust to K, for \(K \le 10\).

Fig. 3.
figure 3

Plot of recommendation performance of our model over the number of topics K on the ml dataset. The best performance is obtained at \(K \approx 5\), though performance is stable for \(K \le 10\).

4.3 Model Analytics

Beyond thread recommendation, we also explore a few types of analytics that our trained model parameters can provide. For this experiment, we set \(K = 10\) in order to achieve finer granularity in the topics; we found that this leads to more useful analytics.

Topic Timescales and Thread Categories. Table 2 shows the estimated half-lives \(\gamma _k\) and most representative words for five selected topics in the ml dataset that are associated with at least 100 threads. Figure 4 plots the total number of posts made on these topics each week during the course.

Table 2. Estimated half-lives and highest constituent words (obtained by sorting the estimated topic-word distribution parameter vectors \(\phi _k\)) for selected topics in the ml dataset with at least 100 threads. Different types of topics (course content-related, small-talk, or course logistics) exhibit different half-lives.

We observe topics with half-lives ranging from hours to weeks. We can use these timescales to categorize threads: course content-related topics (Topics 1 and 2) mostly have short half-lives of hours, small-talk topics (Topics 3 and 4) stay active for longer with half-lives of around one day, and course logistics topics (Topic 5) have much longer half-lives of around one week. Activities in threads on course content-related topics develop and decay rapidly, since they are most likely spurred by specific course materials or assignments. For example, posts on Topic 1 are about implementing gradient descent, which is covered in the second and third weeks of the course, and posts on Topic 2 are about neural networks, which is covered in the fourth and fifth weeks. Small-talk discussions are extremely common at the beginning and the end of the course, while course logistics discussions (e.g., concerning technical issues) are less frequent but steady in volume throughout the course.

Fig. 4.
figure 4

Plot of the total number of posts on each topic week-by-week in the ml dataset. The week-to-week activity levels vary significantly across topics.

Table 3. Estimated levels of additional excitation brought by new activity notifications and explicit replies.

Excitation from Notifications. Table 3 shows the estimated additional excitation induced by new activity notifications (\(\widehat{\alpha }\)) and explicit replies (\(\widehat{\beta }\)). In each course, we see that notifications increase the likelihood of participation significantly; for example, in ml, a learner’s likelihood of posting after an explicit reply is 473 times higher than without any notification. Notice also that \(\widehat{\beta }\) is lowest while \(\widehat{\alpha }\) is highest in comp. This observation is consistent with the fact that in humanities courses like comp the discussions in each thread will tend to be longer [2], leading to more new activity notifications, while in engineering courses like ml and algo we would expect learners to more directly answer each other’s questions, leading to more explicit replies.

5 Conclusions and Future Work

In this paper, we proposed a point processed-based probabilistic model for MOOC discussion forum posts, and demonstrated its performance in thread recommendation and analytics using real-world datasets. Possible avenues of future work include (i) jointly analyzing discussion forum data and time-varying learner grades [12, 13] to better quantify the “flow of knowledge” between learners, (ii) incorporating up-votes and down-votes on the posts into the model, and (iii) leveraging the course syllabus to better model the emergence of new threads.