1 Introduction

Traditional learning to rank (LTR) requires labelled data to permit the learning of a ranker: that is, a training dataset with relevance assessments for every query-document pair is required. The acquisition of such labelled datasets presents a number of drawbacks: they are expensive to construct [5, 25], there may be ethical issues in privacy-sensitive tasks like email search [37], and they cannot capture changes in user’s preferences [19].

The reliance on users implicit feedbacks such as clicks is an attractive alternative to the construction of editorially labelled datasets, as this data does not present the aforementioned limitations [15]. However, this does not come without its own drawbacks and challenges. User implicit feedback cannot be directly treated as (pure) relevance labels because it presents a number of biases, and part of this implicit user signal may actually be noise. For example, in web search, users often examine the search engine result page (SERP) from top to bottom. Thus, higher ranked documents have a higher probability to be examined, attracting more clicks (position bias), which in turn may infer these results as relevant even when they are not [7, 18, 24]. Other types of biases may affect this implicit feedback including selection and presentation bias [2, 16, 40]. In addition, clicks on SERP items may be due to noise, e.g., sometimes users may click for unexpected reasons (e.g., clickbaits and serendipity), and these noisy clicks may hurt the learnt ranker. Hence, in order to leverage the benefits of implicit feedback, LTR algorithms have to be robust to these biases and noises. There are two main categories of approaches to learning a ranker from implicit feedback [14]:

(1):

Offline LTR: Methods in this category learn a ranker using historical click-through log data collected from a production system (logging ranker). A representative method in this category is Counterfactual Learning to Rank (CLTR) [18], where a user’s observation probability (known as propensity) is adopted to construct an unbiased estimator which is used as the objective function to train the ranker.

(2):

Online LTR (OLTR): Methods in this category interactively optimize a ranker given the current user’s interactions. A representative method in this category is Dueling Bandit Gradient Descent (DBGD) [39], where multiple rankers are used to produce an interleavedFootnote 1 results list to display to the user and collect clicks. This signal is used to unbiasedly indicate which rankers that participated in the interleaving process are better (Online Evaluation) and to trigger an update of the ranker in production.

The aim of the counterfactual and the online evaluations is similar: they both attempt to unbiasedly evaluate the effectiveness of a ranker and thus can provide LTR algorithms with reliable updating information.

In this paper, we introduce Counterfactual Online Learning to Rank (COLTR), the first online LTR algorithm that combines the key aspects of both CLTR and OLTR approaches to obtain an effective ranker that can learn online from user feedback. COLTR uses the DBGD framework from OLTR to interactively update the ranker used in production, but it uses the counterfactual evaluation mechanism of CLTR in place of online evaluation. The main challenge we address is that counterfactual evaluation cannot be directly used in online learning settings because the propensity model is unknown. This is resolved by mirroring solutions developed for learning in the bandit feedback problem (and specifically the Self-Normalized Estimator [34]) within the considered ranking task – this provides a position-unbiased evaluation of rankers. Our empirical results show that COTLR significantly improves the traditional DBGD baseline algorithm. In addition, because COTLR does not require interleaving or multileaving, which is the most computationally expensive part in online evaluation [28], COLTR is more efficient than DBGD. We also find that COLTR performance is at par with the current state-of-the-art OLTR method [22] under noisy click settings, while presenting a number of avenues for further improvement.

2 Related Work

The goal of counterfactual learning to rank (CLTR) is to learn a ranker from historical user interaction logs obtained with the ranker used in production. An advantage of this approach is that candidate rankers are trained and evaluated offline, i.e., before being deployed in production, thus avoiding exposing users to rankers of lesser quality compared to that currently in production. However, unlike traditional supervised LTR methods [20], users interaction data provides only partial feedback which cannot be directly treated as absolute relevance labels [14, 16]. This is because clicks may have not been observed on some results because of position or selection bias, and clicks may have instead been observed because of noise or errors. As a result, much of the prior work has focused on removing these biases and noise.

According to position bias, users are more likely to click on top-ranked search results than those at the bottom of the SERP [2, 16, 18]: in CLTR this probability is referred to as propensity. Joachims et al. [18] developed an unbiased (with respect to position) LTR that relies on clicks using a modified SVMRank approach that optimizes the empirical risk computed using the Inverse Propensity Scoring (IPS) estimator. The IPS is an unbiased estimator which can indicate the effectiveness of a ranker given propensity (the probability that the user will examine a document) and click data [18]. However, this approach requires a propensity model to compute the IPS score. To estimate this, randomization experiments are usually required when collecting the interaction data and the propensity model is estimated under offline setting [37, 38].

Aside from position bias, selection bias is also important, and it dominates problems in other ranking tasks such as recommendation and ad placement. Selection bias refers to the fact that users can only interact with items presented to them. Typically, in ad placement systems, the assumption is made that users examine the displayed ads with certainty if only one item is shown: thus no position bias. However, users are given the chance to click on the displayed item only, so clicks are heavily biased due to selection. User interactions with this kind of systems are referred to as bandit feedback [17, 33, 34]. The Counterfactual Risk Minimization (CRM) learning principle [33] is used to remove the bias from bandit feedback. Instead of a deterministic ranker, this group of methods assume the system relies on the hypothesis that a probability distribution is available over the candidate items, which is used to sample items to show to users. Importance sampling [3] is commonly used to remove selection bias.

Online Learning to Rank aims to optimize the production ranker interactively by exploiting user clicks [10, 22, 23, 29]. Unlike CLTR, OLTR algorithms do not require a propensity model to handle position or selection bias. Instead, they assume that relevant documents are more likely to receive more clicks than non-relevant documents and exploits clicks to identify the gradient’s direction.

Dueling Bandit Gradient Descent (DBGD) based algorithms [39] are commonly used in OLTR. The traditional DBGD uses online evaluation to unbiasedly compare two or more rankers given a user interaction [12, 29]. Subsequent methods developed more reliable or more efficient online evaluation methods, including Probabilistic Interleaving (PIGD) which has been proven to be unbiased [12]. The Probabilistic Multileaving extension (PMGD) [28], compares multiple rankers at each interaction, resulting in the best DBGD-based algorithm, which reaches a better convergence given less training impressions [23]. However, this method suffers from a high computational cost because it requires sampling ranking assignments to infer outcomes. Further variations that reuse historical interaction data to accelerate the learning in DBDG have also been investigated [10].

The current state-of-the-art OLTR algorithm is Pairwise Differentiable Gradient Descent (PDGD) [22], which does not require sampling candidate rankers to create interleaved results lists for online evaluation. Instead, PDGD creates a probability distribution over the document set and constructs the result list by sampling documents from this distribution. Then the gradients are estimated from pairwise documents preferences based on user clicks. This algorithm provides much better performance than traditional DBGD-based methods in terms of final convergence and user online experience.

3 Counterfactual Online Learning to Rank

3.1 Counterfactual Evaluation for Online Learning to Rank

The proposed COLTR method uses counterfactual evaluation to estimate the effectiveness of candidate rankers based on the click data collected by the logging ranker. This is unlike DBGD and other OLTR methods that use interleaving. In the counterfactual learning to rank setting, the IPS estimator is used to eliminate position bias [18], providing an unbiased estimation. However, the IPS estimator requires that the propensities of result documents are known. The propensity of a document is the probability that the user will examine the document. In offline LTR settings, propensities are estimated using offline click-through data, via a randomization experiment [38]. Offline click-through data is not available in the online setting we consider, and thus the use of IPS in such an online setting becomes a challenge. To overcome this, we adapt the counterfactual estimator used in batch learning from logged bandit feedback [32, 34]. This type of counterfactual learning treats rankers as policies and samples documents from a probability distribution to create the result list. This allows us to use importance sampling to fix the distribution mismatch between policies and to use Monte Carlo approximation to estimate the risk function \(\mathcal {R}(f_{\theta ^{'}})\):

$$\begin{aligned} \mathcal {R}(f_{\theta ^{'}}) = \frac{1}{k}\sum _{i=1}^k \delta _{i} \frac{p(d_{i}|f_{\theta ^{'}},D)}{p(d_{i}|f_{\theta },D)} \end{aligned}$$
(1)

Where k is the number of documents in the result list, \(\theta \) is the feature weights of the logging ranker, \(\theta ^{'}\) is the new ranker’s feature weights which need to be estimated, and \(\delta \) is the reward function. Following the Counterfactual Risk Minimization (CRM) learning principle [33], we set:

$$\begin{aligned} \delta _{i} = {\left\{ \begin{array}{ll} 0, &{} \text{ if } \text{ the } \text{ user } \text{ clicked } \text{ or } \text{ did } \text{ not } \text{ examine } d_{i}\\ 1, &{} \text{ if } \text{ the } \text{ user } \text{ examined } \text{ but } \text{ did } \text{ not } \text{ click } d_{i} \end{array}\right. } \end{aligned}$$
(2)

In counterfactual learning to rank, the user examination is modelled as propensity. In learning from logged bandit feedback, only the examined documents are considered. In the online setting, however, it is unclear how to determine which documents the user has examined (e.g. a user may have considered a snippet, but did not click on it). We make the assumption that users always examine documents from top to bottom, and thus consider the documents ranked above the one that was clicked last as having been examined. With this in place, the reward function described in Eq. 2 can be used to assign rewards to documents in the result list.

Unlike traditional DBGD-based OLTR which ranks documents according to the scores assigned by the ranking function (i.e., deterministically), COLTR creates the result list to be provided to the user for gathering feedback by sampling documents from a known probability distribution. That is, document \(d_{i}\) is drawn from a distribution \(p(d_{i}|f_{\theta },D)\) computed by the logging ranker \(\theta \). We use softmax to convert document scores into a probability distribution:

$$\begin{aligned} p(d_{i}|f_{\theta },D) = \frac{e^{\frac{f_{\theta }(d_{i})}{\tau }}}{\sum _{d\in D} e^{\frac{f_{\theta }(d)}{\tau }}} \end{aligned}$$
(3)

where \(\tau \) is the temperature parameter, which is commonly used in the field of reinforcement learning to control the sharpness of the probability distribution [31]. For high values of \(\tau \) (\(\tau \rightarrow \infty \)), the distribution becomes uniform. For low values (\(\tau \rightarrow 0\)), the probability of the document with the highest score tends to 1. After a document has been picked, the probability distribution will be renormalized to avoid sampling duplicates. This kind of probabilistic ranker has been used in previous works [4, 14, 22].

While it has been proved that the risk estimator in Eq. 1 is an unbiased estimator, it does suffer from the propensity overfitting problem [34], i.e., the learning algorithm may learn a ranker that assigns small probability values over all the documents \(d_i\) in the result list, as this can minimize the risk function. To address this problem, we use the self-normalized risk estimator \(\mathcal {R}^{SN}(f_{\theta ^{'}}) \) (similar to [34]):

$$\begin{aligned} \mathcal {R}^{SN}(f_{\theta ^{'}}) =\frac{\mathcal {R}(f_{\theta ^{'}})}{\mathcal {S}(f_{\theta ^{'}})} \end{aligned}$$
(4)

where:

$$\begin{aligned} \mathcal {S}(f_{\theta ^{'}}) =\frac{1}{k} \sum _{i=1}^k \frac{p(d_{i}|f_{\theta ^{'}},D)}{p(d_{i}|f_{\theta },D)} \end{aligned}$$
(5)

Intuitively, if propensity overfitting does occur, \(\mathcal {S}(f_{\theta ^{'}})\) will be small, giving a penalty to \(\mathcal {R}^{SN}(f_{\theta ^{'}})\).

Following the CRM principle, the aim of the learning algorithm is to find a ranker with feature weights \(\theta \) that can optimize the self-normalized risk estimator, as well as its empirical standard deviation; formally:

$$\begin{aligned} \theta ^{CRM} = argmin \left( \mathcal {R}^{SN}(f_{\theta }^{'}) + \lambda \sqrt{\dfrac{Var(\mathcal {R}^{SN}(f_{\theta }^{'}))}{k}} \right) \end{aligned}$$
(6)

The \(Var(\mathcal {R}^{SN}(f_{\theta }^{'}))\) is the empirical variance of \(\mathcal {R}^{SN}(f_{\theta ^{'}})\), to compute which we use an approximate variance estimation [27], where \(\lambda =1\) controls the impact of empirical variance:

$$\begin{aligned} Var(\mathcal {R}^{SN}(f_{\theta }^{'})) = \frac{\sum _{i=1}^k \big (\delta _{i} - \mathcal {R}^{SN}(f_{\theta ^{'}})\big )^2 \Big (\frac{p(d_{i}|f_{\theta ^{'}},D)}{p(d_{i}|f_{\theta },D}\Big )^2}{\Big (\sum _{i=1}^k \frac{p(d_{i}|f_{\theta ^{'}},D)}{p(d_{i}|f_{\theta },D)}\Big )^2} \end{aligned}$$
(7)
figure a
figure b

3.2 Learning a Ranker with COLTR

The previous section described the counterfactual evaluation that can be used in an online learning to rank setting. Next, we introduce the COLTR algorithm that can leverage the counterfactual evaluation to update the current ranker weights \(\theta _t\). COLTR uses the DBGD framework to optimize the current production ranker, but it does not rely on interleaving or multileaving comparisons.

Algorithm 1 describes the COLTR updating process: similar to DBGD, it requires the initial ranker weights \(\theta _1\), the learning rate \(\alpha \) which is used to control the update speed, and the step size \(\eta \) which controls the gradient size. At each timestamp t, i.e., at each round of user interactions (line 2), the search engine receives a query \(q_t\) issued by a user (line 3). Then the candidate document set \(D_t\) is generated given \(q_t\) (line 4), and the results list \(L_t\) is created by sampling documents \(d_i\) without replacement from the probability distribution computed by Eq. 3 (line 5). The results list is then presented to the user and clicks observed. Then the reward label vector \(\delta _{t}\) is generated according to Eq. 2 (line 6)Footnote 2. Next, an empty candidate ranker pool C is created (line 7) and candidate rankers are generated and added to the pool (lines 8–12). Counterfactual evaluation is used to compute the risk associated to each ranker, as described in Algorithm 2. The rankers with a risk lower than the logging ranker are said to win and are placed in the set W (line 13). Finally, the current ranker weights are updated by adding the mean of the winners’ unit vector (line 14) modulated by the learning rate \(\alpha \).

The method COLTR uses for computing gradients is similar to that of DBGD with Multileaving (PMGD) [29]. However, COLTR is more efficient. In fact, it does not need to generate an interleaved or multileaved result list for exploring user preferences. When the length of the result list is large, the computational cost for multileaving becomes considerable. In addition, using online evaluation to infer outcomes is very expensive, especially for probabilistic multileaving evaluation [28]: this type of evaluation requires sampling a large number of ranking assignments to decide which ones are the winner rankers – a computationally expensive operation. In contrast, the time complexity for counterfactual evaluation increases linearly with the number of candidate rankers (the for loop in Algorithm 2, line 3Footnote 3). To compute the probabilities of sampling documents for the logging and new rankers (Algorithm 2, line 6 and 7), the document scores in Eq. 3 need to be renormalized after each rank: this attracts additional computational cost. For efficiency reasons, we approximate these probabilities by assuming independence, so that we can compute the probabilities only onceFootnote 4. As a result, COLTR can efficiently compare a large number of candidate rankers at each interaction.

4 Empirical Evaluation

Datasets. We used four publicly available web search LTR datasets to evaluate COLTR. Each dataset contains query-document pair features and (graded) relevance labels. All feature values are normalised using MinMax at the query level. The datasets are split into training, validation and test sets using the splits according to the datasets. The smallest datasets in our experiments are MQ2007 (1,700 queries) and MQ2008 (800 queries) [25], which are a subset of LETOR 4.0. They rely on the Gov2 collection and the query set from the TREC Million Query Track [1]. Query-document pairs are represented with respect to 46 features and 3-graded relevance (from 0, not relevant, to 2, very relevant). In addition to these datasets, we use the larger MLSR-WEB10K [25] and Yahoo! Learning to Rank Challenge datasets [5]. Data for these datasets comes from commercial search engines (Bing and Yahoo, respectively), and relevance labels are assigned on a five-point scale (0 to 4). MLSR-WEB10K contains 10,000 queries and 125 retrieved documents on average, which are represented with respect to 136 features; while, Yahoo! is the largest dataset we consider, with 29,921 queries and 709,877 documents, represented using 700 features.

Simulating User Behaviour. Following previous OLTR work [9, 11, 22, 23, 29, 41], we use the cascade click model (CCM) [6, 8] to generate user clicks. This click model assumes users examine documents in the result list from top to bottom and decide to click with a probability \(p(click=1|R)\), where R is the relevance grade of the examined document. After a document is clicked, the user may stop examining the remainder of the list with probability \(p(stop=1|R)\). In line with previous work, we study three different user behaviours and the corresponding click models. The perfect model simulates the user who clicks on every relevant document in the result list and never clicks on non-relevant documents. The navigational model simulates the user looking for a single highly relevant document and thus is unlikely to continue after finding the first relevant one. The informational model represents the user that searches for topical information and that exhibits a much nosier click behaviour. We use the settings used by previous work for instantiating these click behaviours, e.g., see Table 1 in [23]. In our experiments, the issuing of queries is simulated by uniformly sampling from the training dataset (line 3 in Algorithm 1). Then a result list is generated in answer to the query and the list is then displayed to the user. Finally, user’s clicks on displayed results are simulated using CCM.

Baselines. Three baselines are considered for comparison with COLTR. The traditional DBGD with probabilistic interleaving (PIGD) [12] is used as a representative OLTR method – note that COLTR also uses DBGD, but with counterfactual evaluation in place of the interleaving method. For PIGD, only one candidate ranker is sampled at each interaction; sampling occurs by randomly varying feature weights on the unit sphere with step size \(\eta = 1\), and updating the current ranker with learning rate \(\alpha = 0.01\). The Probabilistic Multileaving Gradient Descent method (PMGD) [23] is also used in our experiments, as it is the DBGD-based method that has been reported to achieve the highest performance so far for this class of approaches [21]. For this baseline, we use the same parameters settings reported in previous work [22], where the number of candidates was set to \(n = 49\), step size to \(\eta = 1\) and learning rate to \(\alpha = 0.01\). The third baseline we consider is the Pairwise Differentiable Gradient Descent (PDGD) [22], which is the current state-of-the-art OLTR method. We set PDGD’s parameters according to Oosterhuis et al. [22], and specify learning rate \(\alpha = 0.1\) and use zero initialization. For COLTR, we use \(\eta = 1\). We use a learning rate decay for \(\alpha \): in this case, \(\alpha \) starts at 0.1 and decreases according to \(\alpha = \alpha * 0.99966\) after each update. We set the temperature parameter \(\tau = 0.1\) when sampling documents and test different numbers of candidate rankers from \(n=1\) to \(n=999\). For all experiments, we only display \(k = 10\) documents to the user, and all methods are used to optimize a linear ranker. Note, we do not directly compared with Counterfactual LTR approaches like that of Joachims et al. [18] because we consider an online setup (while counterfactual LTR requires a large datasets of previous interactions, and the estimation of propensity, which is unfeasible to be performed in an online setting).

Fig. 1.
figure 1

Offline performance on the MQ2007 with the informational click model.

Evaluation Measures. The Effectiveness of the considered OLTR methods is measured with respect to both offline and online performance. For offline performance, we average the nDCG@10 scores of the production ranker over the queries in the held-out test set. This measure indicates the effectiveness of the learned ranker. The offline performance of each method is measured for 10,000 impressions, and the final offline performance is also recorded. Online performance is computed as the nDCG@10 score produced by the result list displayed to the user during the training phase [13]. This measure indicates the quality of the user experience during training. A discount factor \(\gamma \) is used to ensure that long-term impressions have less impact, i.e. \(\sum _{t=1} nDCG(L_t) \cdot \gamma ^{t-1}\). Following previous work [21,22,23], we choose \(\gamma = 0.9995\) so that impressions after the horizon of 10,000 have less than a \(1 \%\) impact. We repeated each experiment 125 times, spread over different training folds. The evaluation results are averaged and statistically significant differences between system pairs are computed using a two-tailed t-test.

Fig. 2.
figure 2

Offline performance under three different click models

5 Results Analysis

5.1 Offline Performance: Final Ranker Convergence

We first investigate how the number of candidate rankers impacts offline performance. Figure 1(a) displays the offline nDCG of COLTR and the baselines under the informational click setting when a different number of candidate rankers is used by COLTR (recall that PIGD uses two rankers and PMGD uses 49 rankers). Consider COLTR with one candidate ranker in addition to the production ranker (\(n=1\)) and PIGD: both are considering a single alternative ranker to that in production. From the figure, it is clear that PIGD achieves a better offline performance than COLTR. However, when more candidate rankers are considered, e.g., n is increased to 49, the offline performance of COLTR becomes significantly higher than that of PIGD. Furthermore, COLTR is also better than PMGD when the same number of candidate rankers are considered. Moreover, COLTR allows to efficiently compare a large number of candidate rankers at each interaction (impression), and thus can test with a larger set of candidate rankers. We find that increasing the number of candidate rankers can help boosting the offline performance of COLTR and achieve a higher final converge. When \(n=499\), COLTR can reach significantly better (\(p<0.01\)) offline performance than PDGD, the current state-of-the-art OLTR method. However, beyond \(n=499\) there are only minor improvements in offline performance, achieved at a higher computational cost – thus, in the remaining experiments, we consider only \(n=499\).

We also consider long-term convergence. Figure 1(b) displays the results for COLTR (with \(n=499\)) and the baselines after 100,000 impressions. Because a learning rate decay is used in COLTR, the learning rate becomes insignificant after 30,000 impressions. In order to prevent this to happen, we stop the learning rate decay when \(\alpha <0.01\), and we leave \(\alpha =0.01\) constant for the remaining impressions. The figure shows that, contrary to the results in Fig. 1(a), PMGD can reach much higher performance than PIGD when enough impressions are considered – this finding is consistent with previously reported observations [22]. Nevertheless, both COLTR and PDGD are still significantly better than PIGD and PMGD, and have similar convergence: their offline performance is less affected by the long term impressions.

Figure 2 displays the offline performance across datasets of varying dimensions (small: MQ2007, and large: MSLR-WEB10K) under three different click models and for 10,000 impressions. The results show that PDGD and COLTR outperform PIGD and PMGD for all click models. We also find that, overall, COLTR and the current state-of-the-art online LTR approach, PDGD have very similar learning curves across all click models and datasets, apart for the perfect click model on the MSLR-WEB10K dataset, for which COLTR is severely outperformed by PDGD. Note, the trends observed in Fig. 2 found also for the majority of the remaining datasets. For space reasons, we omit these results from the paper, but we make them available as an online appendix at http://ielab.io/COLTR. Table 1 reports the final convergence performance for all datasets and click models (including statistical significance analysis), displaying similar trends across the considered datasets.

Table 1. Offline nDCG performance obtained under different click models. Significant gains and losses of COLTR over PIGD, PMGD and PDGD are marked by \(^{\vartriangle }\), \(^{\triangledown }\) (\(p<0.05\)) and \(^{\blacktriangle }\), \(^{\blacktriangledown }\) (\(p<0.01\)) respectively.

5.2 Online Performance: User Experience

Along with the performance obtained by rankers once training is over, the user experience obtained during training should also be considered. Table 2 reports the online performance of all methods, for all datasets and click models. The state-of-the-art PDGD has the best online performance across all conditions. COLTR outperforms PIGD and PMGD when considering the perfect click model. For other click models, COLTR is better than PIGD but it does provide less cumulative online performance than PMGD, even if it achieves a better offline performance. We posit that this is because PMGD uses a deterministic ranking function to create the result list the user observes, and via multileaving it guarantees that the interleaved result list is not worse than that of the worst candidate ranker. COLTR instead uses a probabilistic ranking function, and if the document sampling distribution is too similar to a uniform distribution, the result list may incorrectly contain many non-relevant documents: this results in a bad online performance. A uniform sampling distribution is obtained because noisy clicks result in some candidate rankers randomly winning the counterfactual evaluation and thus slowing down the gradient convergence and achieving an “elastic effect”, where the weight vectors go forward in one interaction, and backwards in the next. This will cause the margins between the documents’ scores assigned by the ranking function to become too small and thus the softmax function will not generate a “deterministic" distribution. This also explains why the online performance is much better when clicks are perfect: the gradient directions corresponding to the winning candidates are likely similar, leading the current ranker moving fast through large gradient updates (no elastic effect).

Table 2. Online cumulative nDCG performance under different click models. Significant gains and losses of COLTR over PIGD, PMGD and PDGD are marked by \(^{\vartriangle }\), \(^{\triangledown }\) (\(p<0.05\)) and \(^{\blacktriangle }\), \(^{\blacktriangledown }\) (\(p<0.01\)) respectively.

6 Conclusion

In this paper, we have presented a novel online learning to rank algorithm that combines the key aspects of counterfactual learning and OLTR. Our method, counterfactual online learning to rank (COLTR), replaces online evaluation, which is the most computational expensive step in the traditional DBGD-style OLTR methods, with counterfactual evaluation. COLTR does not derive a gradient function and use it to optimise an objective, but still samples different rankers, akin to the online evaluation practice. As a result, COLTR can evaluate a large number of candidate rankers at a much lower computational expense.

Our empirical results, based on publicly available web search LTR datasets, also show that the COLTR can significantly outperform DBGD-style OLTR methods across different datasets and click models for offline performance. We also find that COLTR achieves the same offline performance as the state-of-the-art OLTR model, the PDGD, across all datasets under noisy click settings. This means COLTR can provide a robust and effective ranker to be deployed into production, once trained online. However, due to the uniform sampling distribution employed by COLTR to select among candidate documents, COLTR has worse online performance than PMGD and PDGD.

Future work will investigate the difference between gradients provided by PDGD and COLTR, as they both use a probabilistic ranker to create the result list. This analysis could provide further indications about the reasons why the online performance of COLTR is limited. Other improvements could be implemented for COLTR. First, instead of stochastically learning at each interaction, historical user interaction data could be used to perform batch learning, which may provide even more reliable gradients under noisy clicks. Note that this extension is possible, and methodologically simple for COLTR, but not for PDGD. Second, the use of the exploration variance reduction method [35, 36] could be investigated to reduce the gradient exploration space: this may solve the uniform sampling distribution problem.