Keywords

1 Introduction

Over the last decades marketing has made a significant shift from traditional product/brand based to customer-centric and data-driven by intensively using analytical models and tools. One of the important aspects of applying analytics in marketing is to predict customer profitability over time based on customer purchasing history and a certain profitability measure, such as customer life-time value (CLV), and recency, frequency, and monetary (RFM) values. With regard to modelling techniques, there are mainly two categories of models [1]: probabilistic models and machine learning models. A fundamental question to be asked here is that if a customer’s profitability is predictable, and what models can be best suitable a given prediction problem [2]. The primary aim of this research is to provide a case study for such a prediction.

In this paper, a UK-based online retail is examined for customer profitability prediction. A real transactional data set collected from the retail is used for the analysis. The RFM values of each customer are employed as a profitability measure, and an associated monthly RFM time series for every customer can be created accordingly based on their historical purchasing records. By using the k-means clustering, at any given time point, all the customers are segmented into three groups based on their RFM values: high, medium, or low profitability groups. The prediction problem concerned here is how a customer’s membership in terms of profitability group would evolve over time. For comparison purpose, twelve different models of three types are utilized for prediction including both probabilistic and machine learning models in open-loop and closed-loop modes: regression, multilayer perceptron (MLP), and Naïve Bayesian models. Choosing RFM-based measure is because of its simplicity and easy interpretability in practice, and the models selected are classic, simple and widely used in business for marketing purpose.

A comparative analysis with the given data set and the models has demonstrated a good predictability of the chosen measure for the business under consideration in terms of customer profitability. It also shows how to use the certain context of the business to help to interpret the modelling outcomes.

The remainder of this paper is organized as follows. Section 2 gives a brief discussion on the relevant work. Sections 3 describes in detail the methodology adopted in this work including the creation of RFM-based time series, customers grouping and model selection. Detailed experimental settings and the experimental results are provided in Sect. 4, and a discussion on the outcomes is given in Sect. 5. Finally concluding remarks are given in Sect. 6 along with suggested further work.

2 Related Works

In recent years, predicting customer’s profitability over time has been an active, yet very challenging research topic. In general, such a prediction mainly involve three interrelated factors:

  • The nature of the business under consideration;

  • Which measure(s) to be used to indicate a customer’s profitability; and

  • Which models to be employed to best fit the modelling requirements.

The nature of the business under consideration will be directly linked to what measures could be adopted, for example, an on-line business in the retail industry and a marketing consultancy company in the fashion industry may use a completely different set of measures. On the other hand, depending on the measures to be adopted, a static or a dynamic model could be applied for modelling purpose.

In [3] an RFM score-based time series was created using the k-means clustering analysis and used to measure and describe a customer’s profitability for an on-line retail. Furthermore, multilayer feedforward neural network models were trained to identify the dynamics in terms of how customer profitability evolved over time.

Interestingly in [4], RFM was employed to calculate customer loyalty and Apriori algorithm was used to determine the association rules of product bundles. In addition, the work in [5] suggested convolutional neural network structures for predicting the CLV of individual players of video games, and in [6], recurrent neural networks were proposed for customer behavior prediction based on the client loyalty number and RFM values.

Other measures have been also considered, such as Pareto/NBD (negative binomial distribution) [7].

In summary, the main work in this area appears to be subject/domain-specific and has no unified approaches. CLV and RFM are the most popular measures adopted to reflect customer profitability/loyalty. The most diverse aspect of the relevant research is on modelling approaches, and a range of models have been proposed, from very classic regression models to deep learning paradigms.

This research presents a case study for customer profitability prediction in which multiple models are used with a simple yet practically easy-to-implement profitability measure.

3 Methodology

This Section gives in detail the main approaches, models, and procedures adopted in this research.

3.1 Recency, Frequency, and Monetary Model

The RFM model [8] has received much attention and has been widely used in customer relationship management (CRM) and direct marketing due to its simplicity and effectiveness for evaluating a customer’s profitability.

Given a set of transactional records of a business over a certain period of time, Recency indicates how recently a customer made a purchase with the business; Frequency shows how often a customer has purchased; and Monetary indicates the total (or average) a customer has spent. Therefore, each customer of a business can characterized by a set of RFM values, and furthermore all the customers can be grouped into meaningful segments based on their RFM values so that various marketing strategies can be adopted to different customer groups accordingly.

Note that a time series of RFM values can be generated for each customer if they are calculated at consecutive time points, such as at the end of each calendar month over a period of time.

3.2 k-Means Clustering

k-means clustering is one of the most popular algorithms in data mining for grouping samples into a certain number of groups (clusters) based on Euclidean distance measure. Assume \( V_{1} ,V_{2} , \ldots ,V_{n} \) are a set of vectors, for instance, a vector represents a customer’s RFM values in the form a vector, and these vectors are to be assigned to k clusters \( S_{1} ,S_{2} , \cdots ,S_{k} \). Then the objective function of the k-means clustering is expressed as

$$ f\left( {\mu_{1} ,\mu_{2} , \cdots ,\mu_{k} } \right) = \sum\nolimits_{i = 1}^{k} {\sum\nolimits_{{V_{j} \in S_{i} }} {\left\| {V_{j} - \mu_{i} } \right\|^{2} } } $$
(1)

where \( \mu_{i} \) represents the centroid of cluster \( S_{i} \). The k-means clustering algorithm in the form of pseudocode is shown in Table 1.

Table 1. The k-means clustering algorithm.

In this paper, a group of customers are segmented into three segments using the k-means clustering based on their RFM values: low, medium, or high profitability groups.

3.3 Open-Loop Model and Closed-Loop Model for Time Series Prediction

Time series prediction can be in general formalized by open-loop and closed-loop models. Given a time series \( \{ \theta \left( t \right)|t = 1, 2, \cdots n)\} \), a prediction based on an open-loop model is expressed as

$$ \mathop \theta \limits^{ \wedge } \left( t \right) = f\left( {\theta \left( {t - 1} \right),\theta \left( {t - 2} \right), \cdots ,\theta \left( {t - n} \right)} \right) $$
(2)

where \( f\left( \cdot \right) \) donates a mapping, and \( \mathop \theta \limits^{ \wedge } \left( t \right) \) represents the predicted value of variable \( \theta \left( t \right) \) at time t using the prior \( n \) observed values of the variable at time points t1, t2,\( \cdots \), tn.

A closed-loop model can be expressed as

$$ \mathop \theta \limits^{ \wedge } \left( t \right) = f\left( {\mathop \theta \limits^{ \wedge } \left( {t - 1} \right),\mathop \theta \limits^{ \wedge } \left( {t - 2} \right), \cdots ,\mathop \theta \limits^{ \wedge } \left( {t - n} \right)} \right) $$
(3)

which uses the previous \( n \) predicted values \( \mathop \theta \limits^{ \wedge } \left( {t - 1} \right),\mathop \theta \limits^{ \wedge } \left( {t - 2} \right), \cdots ,\mathop \theta \limits^{ \wedge } \left( {t - n} \right) \) to predict to the value of variable \( \theta \left( t \right) \) at time t.

3.4 Model Selection

The mapping \( f\left( \cdot \right) \) in an open-loop or a closed model (Eqs. (2) and (3)) can be in different forms. In this paper, three models are considered for comparison purpose: Linear Regression, Multilayer Perceptron (MLP), and Naïve Bayesian.

Linear regression is perhaps the simplest model to be considered. Using this model for prediction, Eqs. (2) and (3) can be re-written, respectively, as

$$ \mathop \theta \limits^{ \wedge } \left( t \right) = w_{0} + \sum\nolimits_{i = 1}^{n} {w_{i} \theta \left( {t - i} \right)} $$
(4)

and

$$ \mathop \theta \limits^{ \wedge } \left( t \right) = w_{0} + \sum\nolimits_{i = 1}^{n} {w_{i} \mathop \theta \limits^{ \wedge } \left( {t - i} \right)} $$
(5)

where \( \{ w_{i} |i = 0, 1, \ldots n)\} \) are regression coefficients.

A multi-layer perceptron can be thought of as a regression model on a set of derived inputs via layered and successive non-linear transformations. In this paper, an MLP is used with a single hidden layer and a linear transformation for output nodes, which can be expressed as

$$ h_{j} \left( t \right) = \frac{1}{{1 + e^{{ - (w_{0j} + \mathop \sum \nolimits_{i = 1}^{n} w_{ij} \theta \left( {t - i} \right))}} }},j = 1,2 \cdots ,m $$
(6)
$$ \mathop \theta \limits^{ \wedge }_{l} \left( t \right) = w_{0l} + \sum\nolimits_{j = 1}^{m} {w_{jl} h_{j} \left( t \right),l = 1,2 \ldots ,k} $$
(7)

where \( w_{ij} \) and \( w_{jl} \) are connection weights between the \( i^{th} \) input node to the \( j^{th} \) hidden node, and the \( j^{th} \) hidden node to the \( l^{th} \) output node, respectively; \( w_{0j} \) and \( w_{0l} \) donate the bias to the \( j^{th} \) hidden node and the bias to the \( l^{th} \) output node, respectively; and \( h_{j} \left( t \right) \) and \( \widehat{\theta }_{l} \left( t \right) \) donate the output of the \( j^{th} \) hidden node and the \( l^{th} \) output node, respectively. For the closed-loop model the inputs \( \left\{ {\theta \left( {t - i} \right)} \right\} \) are substituted by \( \left\{ {\mathop \theta \limits^{ \wedge } \left( {t - i} \right)} \right\} \).

3.5 Naïve Bayesian Model

A Naïve Bayesian model is based on Bayes’ theorem as shown below

$$ p\left( {A |B} \right) = \frac{{p\left( {A,B} \right)}}{p\left( B \right)} = \frac{{p\left( {B |A} \right)p\left( A \right)}}{P\left( B \right)} $$
(8)

where \( p\left( \cdot \right) \) and \( p\left( { \cdot |\cdot } \right) \) represent a probability and a conditional probability, respectively. Applying Naïve Bayesian model, Eqs. (2) and (3) can be re-written and simplified as

$$ \begin{aligned} p\left( {\mathop \theta \limits^{ \wedge } \left( t \right) |\theta \left( {t - 1} \right), \ldots ,\theta \left( {t - n} \right)} \right) = \frac{{p\left( {\mathop \theta \limits^{ \wedge } \left( t \right),\theta \left( {t - 1} \right), \ldots ,\theta \left( {t - n} \right)} \right)}}{{p\left( {\theta \left( {t - 1} \right), \ldots ,\theta \left( {t - n} \right)} \right)}} \hfill \\ \quad \propto p\left( {\mathop \theta \limits^{ \wedge } \left( t \right),\theta \left( {t - 1} \right), \ldots ,\theta \left( {t - n} \right)} \right) \approx p\left( {\mathop \theta \limits^{ \wedge } \left( t \right)} \right)\prod\nolimits_{i = 1}^{n} {p\left( {\theta \left( {t - i} \right)} \right)} \hfill \\ \end{aligned} $$
(9)
$$ \begin{aligned} p\left( {\mathop \theta \limits^{ \wedge } \left( t \right) |\mathop \theta \limits^{ \wedge } \left( {t - 1} \right), \ldots ,\mathop \theta \limits^{ \wedge } \left( {t - n} \right)} \right) = \frac{{p\left( {\mathop \theta \limits^{ \wedge } \left( t \right),\mathop \theta \limits^{ \wedge } \left( {t - 1} \right), \ldots ,\mathop \theta \limits^{ \wedge } \left( {t - n} \right)} \right) }}{{p\left( {\theta \left( {t - 1} \right), \ldots ,\theta \left( {t - n} \right)} \right)}} \hfill \\ \quad \propto p\left( {\mathop \theta \limits^{ \wedge } \left( t \right),\mathop \theta \limits^{ \wedge } \left( {t - 1} \right), \ldots ,\mathop \theta \limits^{ \wedge } \left( {t - n} \right)} \right) \approx p\left( {\mathop \theta \limits^{ \wedge } \left( t \right)} \right)\prod\nolimits_{i = 1}^{n} {P\left( {\mathop \theta \limits^{ \wedge } \left( {t - i} \right)} \right)} \hfill \\ \end{aligned} $$
(10)

4 Case Study

4.1 Data Set and Data Pre-processing

A UK-based online retail is considered in this study [3, 9]. A data set was collected from the retail which contains all the transactions occurring from December 2010 to November 2011. The data set has 11 variables as described in Table 2. Note that the data set can be found at: https://archive.ics.uci.edu/ml/datasets/online+retail.

Table 2. Variables in the dataset.

It is worth mentioning that, over the years, the business has been functioning as both wholesale and retail, and has maintained a stable and healthy number of customers.

Appropriate pre-processing was carried out to address quality issues of the data set. Outliers and extreme values have been removed. The resultant target data set contains 751 valid customers from the UK only.

4.2 Settings for Modelling

To start the analysis, a time series of RFM values for each customer was first calculated at the end of each calendar month successively from December 2010 to November 2011, and therefore each RFM time series consists of 12 data points.

Further at each time point of the monthly-based RFM time series, the customers were grouped using the k-means clustering into three profitability groups as shown in Fig. 1, where Recency is in month and Monetary is in Sterling, and symbols ‘*’, ‘+’, and ‘o’ indicate high, medium, and low profitability groups, respectively. The sub-graphs in Fig. 1 are arranged sequentially by month in ascending order. As such, each customer belongs to a certain profitability group at a given time point of the time series. Before conducting the clustering, the RFM values were normalised by using range normalisation.

Fig. 1.
figure 1

Customers segmented into three profitability groups: high (*), medium (+), or low (o). Calculations were made at the end of each calendar month from Dec. 2010 to Nov. 2011.

Next, the three types of predictive models discussed in the previous section were applied to predict each customer’s profitability group using open- and closed- loop models. The three profitability groups were encoded into three orthogonal unit vectors \( \left[ {1,0,0} \right] \), \( \left[ {0,1,0} \right] \) and \( \left[ {0,0,1} \right] \), and these vectors were used as the desired outputs of all the models for training to represent mutually exclusive three classes. Both the open- and closed- loop linear regression models had two or three terms. The topology of the MLP models were set to: three input nodes, ten hidden nodes and three output nodes. The initial connection weights and biases were generated randomly.

All the models were trained and tested 10 times, and each time 70% of the samples in the data set were randomly selected for training and the remaining 30% for testing. The data in December 2010 and January 2011 was used as the initial inputs for the closed-loop models. Note that, regardless what predictive models to be used, the training procedures for both the open-loop and closed-loop models are the same; However, when applying a trained closed-loop model, the first n observations will be used as the initial inputs to the model, and then the predicted values will be fed back sequentially to the model as inputs to generate further predications in an autonomous manner.

4.3 Experimental Results

With the given settings, the relevant experiments were conducted accordingly to examine how well a customer’s membership in terms of profitability groups can be predicted over time. The average prediction accuracies generated by different models are given in Tables 3 and 4.

Table 3. Average prediction accuracy using observations at one previous time point.
Table 4. Average prediction accuracy using observations at two previous time points.

5 Discussion

From the experiment results obtained, it is evident that the RFM time series under consideration was well predictable, and a customer’s profitability group was stable.

Under all the experimental conditions, the prediction models using observations at one previous time point performed well and had a similar performance to those using observations at two previous time points. This can be further interpreted as the transit probability of a customer from one profitability group to another at any two consecutive time points was low.

An examination on the transit probability of the customers from one profitability group to another over time has revealed that, on average, the transit probability was not more than 6%. A summary of the average transit probability is given in Table 5, where the element \( TP_{ij} \), \( i, j = 1,2,3, \) in the \( 3 \times 3 \) matrix indicates the average transit probability from the \( i^{th} \) group to the \( j^{th} \) group if \( i \ne j \), and the average percentage of customers remained the \( i^{th} \) group if \( i \ne j \).

Table 5. Average customer transit probability over time.

Since the business has been running as wholesale as well, the prediction results are quite interpretable and understandable. As such, the profitability of a customer in month \( t \) only depended on the profitability of the customer in month \( t - 1 \). Therefore, it’s not necessary to use more past time points in the prediction.

In addition, the MLP and the Naïve Bayesian models were slightly more stable than the regression models.

The open-loop prediction models could achieve 84% accuracy and those models were useful for a short-term prediction. The closed-loop prediction models have achieved an accuracy of 79% and they could be applied for a long-term prediction.

6 Conclusions and Future Work

In this study, a comparative study has been conducted on predicting customer profitability dynamically based on monthly RFM time series using multiple models. The study shows a good predictability of the time series under consideration. The context of the business of interest has helped to interpret the prediction results.

Further work includes:

  • Using real transactional data collected over a longer period of time, such as two or three years, to examine the predictability of the RFM time series;

  • To investigate how prediction accuracy might be affected by the frequency at which the RFM values are calculated with a given transactional data; and

  • Using other possible profitability measures to conduct comparative research.