1 Introduction

A typical supervised learning scenario comprises the computation of a function, which maps from inputs (samples) to outputs (labels), where it is assumed that exists an oracle who gives the correct label (also known as ground truth or gold standard) for each sample in the training set [1]. However, in many real-world applications, the gold standard is not available, because the process to acquire it is expensive, unfeasible, time-consuming or the label corresponds to a subjective assessment [2]. Instead of the ground truth, it is possible to access several labels provided by multiple annotators or sources. This information can be acquired using web sources, crowdsourcing platforms or the opinion of multiple experts. For instance, social networks (e.g., Twitter, Facebook) can be used to obtain information about a specific problem such as product rating or sentiment analysis [3]. Likewise, in problems where the gold standard is not available, we can use platforms like Amazon Mechanical Turk (AMT), LabelMe, Crowdflower.Footnote 1 This kind of platforms offers a cost-effective, and efficient way to obtain labeled data [4]. On the other hand, in problems of computer-aided diagnosis, we can obtain subjective assessment provided by different experts [5, 6]. Nevertheless, the information collected from these multiple sources could be subjective, noisy or even misleading [6]. Trivial solutions to deal with multiple labelers scenarios include (i) to consider as the gold standard the output from one of the labelers, and (ii) to assume the majority voting (or the average in the case of regression) from the annotations as an estimation for the ground truth. However, these approaches are not suitable due to they assume homogeneity between the performance of the annotators [2].

On the other hand, Learning from crowds is a particular area of supervised learning, which deals with different machine learning paradigms in the presence of multiple annotators, including classification, sequence labeling, and regression. Among the methodologies developed in the area of learning from crowds, we can identify two main groups. The first group named label aggregation are focused only on estimating the gold standard, which is then used to train a supervised learning scheme. On the other hand, the second group comprises the works that are focused on training supervised learning models directly from the labels of multiple sources. Regarding the classification paradigm, we recognize the approach proposed in [6], which comprises the estimation of the annotator expertise (in terms of sensitivity and specificity) through a maximum likelihood-based approach from repeated responses (labels). In this sense, this model estimate jointly the gold standard and the classifier parameters using a logistic regression-based framework. Similarly, the authors in [7] propose an extension of the work proposed in [6] aiming to introduce a Gaussian processes model as the classification scheme. On the other hand, with respect to real-valued label (i.e. Regression models), the authors in [1] propose a Gaussian processes model to deal with multiple annotators, where the performance of the labelers is coded by including a per-annotator variance in the likelihood function–(GPR-MAH). However, they assume that the labeler performance is homogeneous across the input space, which is a weak assumption as was demonstrated in [8]. The above assumption was relaxed by the work in [9]. This approach codes the performance using a Gaussian process model, which estimates the annotators expertise as a non-linear function of the gold standard and the input space.

In this work, we present a regression approach based on Gaussian processes, where the expertise of the labelers is non-homogeneous across the input space–(GPR-MANH). Our approach follows the idea of GPR-MAH, in the sense that we use a Gaussian processes method to model the regression function and assign a per-annotator variance to capture the performance of the labelers. However, unlike GPR-MAH, our methodology relaxes the assumption that the performance of each annotator is homogeneous across the input space by considering that the input space can be represented by a number specific of clusters, where each annotator exhibits different performances. We empirically show, using simulated annotators, that our methodology can be used to learn regression models using noisy data from multiple sources, outperforming state-of-the-art techniques. The remainder of this paper is organized as follows. Section 2 describes the background of our approach. Sections 3 and 4 present the experiments and discuss the results obtained. Finally, Sect. 5 outlines the conclusions and future work.

2 Probabilistic Formulation

A regression scenario has the primary goal to estimate a function \(f\negthinspace :\negthinspace \mathscr {X}\negthinspace \rightarrow \negthinspace \mathscr {Z}\) using a training set \(\{{\mathbf {x}}_n,z_n\}^N_{n=1},\) where \({\mathbf {x}}_n\!\,\in \!\,\mathscr {X}\subseteq \mathbb {R}^P\) is a \(P-\)dimensional input feature vector corresponding to the \(n-\)th instance with output \(z_n\!\,\in \!\,\mathscr {Z}\subseteq \mathbb {R}\). In a typical regression configuration, each sample \({\mathbf {x}}_n\) is assigned to a single output \(z_n\), i.e., the ground truth. However, in many real-world regression problems instead of the ground truth we have multiple labels provided by R sources with different levels of expertise [1]. Moreover, we assume that each annotator annotates \(N_r\le N\) observations. In this sense, it is possible to build a data set for the annotator \(r\!\,\in \!\,\left\{ 1,2,\dots , R \right\} \), \(\mathscr {D}_r\!=\!\left\{ {\mathbf {X}}_r, {\mathbf {y}}_r\right\} \), where \({\mathbf {X}}_r\in \mathbb {R}^{N_r\times P}\) and \({\mathbf {y}}_r\!\,\in \!\,\mathscr {Y}_r\subseteq \mathbb {R}\) are the input feature matrix and the labels given by the r-th annotator, respectively. Besides, \({\mathbf {X}}_r\) holds row vectors \({\mathbf {x}}_n^r\) and \({\mathbf {y}}_r\) is composed by elements \(y_n^r\), where \(y_n^r\) is the m-th annotation of sample \({\mathbf {x}}^r_n\). Now given the data set from multiple annotators \(\mathscr {D} \!=\!\left\{ {\mathbf {X}}=\cup _{r=1}^R {\mathbf {X}}_r, {\mathbf {Y}} = \left\{ {\mathbf {y}}_1, \dots , {\mathbf {y}}_R\right\} \right\} \), our goals are: First, to estimate the unknown gold standard for the instances in the training set \({\mathbf {z}} \!=\!\left[ z_1, \dots , z_N\right] \). Second, to compute the performance of the labelers as a function of the ground truth and the input space. Finally, the third objective is to build a regression model based on Gaussian processes which generalizes well on unseen data.

Concerning this, we follow the model for the labels proposed in [1], \(y_n^r = z_n + \mathscr {N}(0, \sigma _r^2),\) where they consider that the parameter \(\sigma _r^2\) (related to the performance of the r-th annotator) is homogeneous across the input space. However, as we established previously, the principal aim of our work is to model the annotator expertise based on the assumption that it is no-homogeneous across the input space. For doing so, we assume that the input space \(\mathscr {X}\) can be represented using K clusters based on the input space Euclidean distances, where each annotator exhibits a particular performance. Accordingly, the model proposed for the labels \(y_n^r\) follows \(y_n^r = z_n + \mathscr {N}(0,(\sigma _k^r)^2),\) where \((\sigma _k^r)^2 \!\,\in \!\,\mathbb {R}^{+}\) is the variance for the r-th labeler in the cluster \(k\in \left\{ 1,2,\dots , K\right\} \). Assuming independence between annotators, and the fact that each annotator labels \({\mathbf {x}}_n\) independently, the likelihood is given as follows

$$\begin{aligned} p\left( {\mathbf {Y}}|{\mathbf {z}}\right)&= \prod _{k}\prod _{n\sim k}\prod _{r\sim n}\mathscr {N}\left( y_n^r|z_n,\left( \sigma _k^r\right) ^2\right) = c\mathscr {N}\left( \hat{{\mathbf {y}}}|{\mathbf {z}},\hat{\varvec{\varSigma }}\right) , \end{aligned}$$
(1)

where \(c\!\,\in \!\,\mathbb {R}\) is independent of \({\mathbf {z}}\), the diagonal matrix \(\hat{\varvec{\varSigma }}\!\,\in \!\,\mathbb {R}^{N \times N}\) has elements \(\hat{\sigma }_{nk}^2\), the vector \(\hat{{\mathbf {y}}}\!\,\in \!\,\mathbb {R}^N\) has entries \(\hat{y}_{nk}\). Also, \(\hat{\sigma }_{nk}^{-2} = (\sum _{r\sim n}1/{\left( \sigma _k^r\right) ^2})^{-1}\), \(\hat{y}_{nk} = \hat{\sigma }_{nk}^2\sum _{r\sim n}{y_n^r}/{\left( \sigma _k^r\right) ^2}.\) The notation \(r \sim n\) refers to “take into account only the labelers who annotated the n-th observation” and \(n \sim k\) indicates the sample n belonging to the k-th cluster. Assuming a Gaussian process prior for \({\mathbf {z}}\) given as \(p({\mathbf {z}}) = \mathscr {N}({\mathbf {z}}|\mathbf {0}, {\mathbf {K}})\), with kernel matrix \({\mathbf {K}}\) computed using a particular kernel function \(k:\mathbb {R}^P\times \mathbb {R}^P \rightarrow \mathbb {R}\), the posterior over the latent variable \({\mathbf {z}}\) is computed as follows \(p({\mathbf {z}}|{\mathbf {Y}}, {\mathbf {X}}) \!=\!\mathscr {N}({\mathbf {z}}|{\mathbf {m}}, {\mathbf {V}})\), where \({\mathbf {m}} = ({\mathbf {K}}^{-1} + \hat{\varvec{\varSigma }}^{-1})^{-1}\hat{\varvec{\varSigma }}^{-1}\hat{{\mathbf {y}}}\), and \({\mathbf {V}} = ({\mathbf {K}}^{-1} + \hat{\varvec{\varSigma }}^{-1})^{-1}\). In turn, it can be shown that the posterior over a new observation \(f({\mathbf {x}}_*)\) follows

$$\begin{aligned} p(f({\mathbf {x}}_*)|{\mathbf {Y}}) = \mathscr {N}(f({\mathbf {x}}_*)|\bar{f}({\mathbf {x}}_*), k({\mathbf {x}}_*, {\mathbf {x}}_*')), \end{aligned}$$
(2)

where \(\bar{f}({\mathbf {x}}_*) \!=\!k({\mathbf {x}}_*, {\mathbf {X}})({\mathbf {K}} + \hat{\varvec{\varSigma }})^{-1}\hat{{\mathbf {y}}}\) and \(k({\mathbf {x}}_*, {\mathbf {x}}_*') \!=\!k({\mathbf {x}}_*, {\mathbf {x}}'_*) \!-\! k({\mathbf {x}}_*, {\mathbf {X}})({\mathbf {K}} + \hat{\varvec{\varSigma }})^{-1}k({\mathbf {X}}, {\mathbf {x}}_*').\) The free parameters related to the model (the hyper-parameters of the kernel function, and the variances associated to the annotators in each region) are estimated by optimizing the negative log of the evidence, which is given as

$$\begin{aligned} -\log p({\mathbf {Y}})&= {\frac{1}{2}\log |{\mathbf {K}} + \hat{\varvec{\varSigma }}|} + {\frac{1}{2}\hat{{\mathbf {y}}}^{\top }({\mathbf {K}} + \hat{\varvec{\varSigma }})^{-1}\hat{{\mathbf {y}}}} - {\frac{1}{2}\log |\hat{\varvec{\varSigma }}|}\\&+{\frac{1}{2}\sum _{k}\sum _{n\sim k}\sum _{r\sim n}\frac{(y_n^r)^2}{\left( \sigma _k^r\right) ^2}} -\frac{1}{2}\sum _{k}\sum _{n\sim k}\frac{\hat{y}_{nk}^2}{\hat{\sigma }_{nk}^2} -{\sum _k \sum _{n\sim k} \sum _{r\sim n} \log \frac{1}{\sigma _k^r}}+\frac{\zeta }{2}\log 2\pi , \end{aligned}$$

where \(\zeta = \sum _{r=1}^{R}N_r\). To summarize, we propose a regression scheme with multiple annotators based on Gaussian processes, where the performance of the annotators is coded by including a per-annotator variance in the likelihood function. Unlike GPR-MAH, we assume that the input space is represented by K regions, where the annotators exhibit a particular performance, which is represented by a variance \((\sigma _k^r)^2\).

3 Experimental Set-Up

Testing Datasets. To test our GPR-MANH, we use three datasets for regression of the well-known UCI repositoryFootnote 2. The used datasets include: Auto MPG–(Auto), Concrete Compressive Strength–(Concrete), and Boston Housing Data–(Housing). The above datasets were chosen based on state-of-the-art works [1, 9].

Simulated Annotations. The datasets from the UCI repository are mainly focused on supervised learning without multiple sources. Thus, we establish two methods for simulating multiple annotators: (i) Homogeneous Gaussian noise [1], that samples a random number \(\varepsilon _n^r\!\,\in \!\,\mathbb {R}\) from a Gaussian distribution with zero mean and variance \(\tau _r^2 \!\,\in \!\,\mathbb {R}^+\); then the annotations are simulated as, \(y_n^r = z_n + \varepsilon _n^r\). Accordingly, \(\tau _r^2\) codes the performance of the annotators, the higher is its value, the lower the expertise level of the r-th labeler. (ii) Non-homogeneous Gaussian noise [8]. This simulation approach comprises the following steps: First, we split the data into L clusters using the k-means algorithm. Next, the annotations given by the r-th annotator for samples in the l-th cluster follows, \(y_{nl}^r = z_n + \mathscr {N}\left( 0, \lambda _{lr}^2\right) \), where \(\lambda _{lr}^2\!\,\in \!\,\mathbb {R}^+\) codes the labeler expertise in the region l. Hence, we simulate labelers where its expertise varies depending on the input space.

Validation Approaches and Learning Assessments. Aiming to validate the performance of our approach, we take into account the following state-of-the-art models. (i) Gaussian Process-based regression with majority voting–(GPR-Av), where a typical regression model is trained using as the gold standard the average from the annotations. The kernel hyperparameters related to this Gaussian processes are estimated by optimizing the marginal likelihood [11]. (ii) Learning from Multiple Observers with Unknown Expertise–(LMO), that uses a Gaussian process to code the expertise of the labelers as a function of the gold standard and the input samples. The parameter estimation is carried out using a Maximum a Posterior (MAP) approach [9]. (iii) Learning from Multiple Annotators with Gaussian Processes–(GPR-MA), where a per-annotator variance is included in the likelihood function to capture the information from multiple annotators. The hyperparameters related to the kernel function and the variances of each annotator are estimated by minimizing the minus log of the evidence.

Furthermore, the validation is carried out by estimating the regression performance in terms of the mean squared error (note that we have access to the gold standard). A cross-validation scheme is carried out with 30 repetitions (70% of the samples as training and 30% as testing).

Fig. 1.
figure 1

Results for the first experiment. In (a) we expose the ground truth and the synthetic annotations, which are generated using the simulation method “Non-homogeneous Gaussian noise”. In (b), (c), (d), (e), and (f) we respectively show the regression results for GPR-GOLD, GPR-Av, LMO, GPR-MA, and GPR-MANH. Shaded areas represent the variance for the predictions.

4 Results and Discussions

First, we perform a controlled experiment aiming to verify the capability of our approach for dealing with regression setting in the context of multiple sources. For this first experiment, the training samples \({\mathbf {X}}\) are generated by randomly selecting 60 points in the interval [0, 1], and the ground truth is computed as \(z_n = \sin (2\pi x_n)\sin (6\pi x_n) \). The instances for testing are formed with 600 equally spaced samples from the interval [0, 1]. We simulate three labelers with different levels of expertise by using the simulation methods described in Sect. 3. For the “Homogeneous Gaussian noise” we use \(\varvec{\tau } = \left( 0.25,\; 0.5, \; 0.75\right) \). On the other hand, for the“Non-homogeneous Gaussian noise”, we split the input space into three regions and use the following parameters:

$$\begin{aligned} \varvec{\varLambda } = \begin{pmatrix} 0 &{} 0.65 &{} 1.0\\ 0.25 &{} 0 &{} 0.75\\ 0.1 &{} 0.75 &{} 0 \end{pmatrix}. \end{aligned}$$

Here, the matrix \(\varvec{\varLambda }\) is formed by elements \(\lambda _{lr}^2\), which indicates the variance for the r-th annotator in the cluster l. For testing our approach, we use a clustering algorithm based on affinity propagation [12] aiming to obtain a proper representation of the input space \(\mathscr {X}\). Similarly, for the Gaussian processes model, the kernel is fixed as a squared exponential function [11]. Figure 1 shows a visual comparison among the performance of our GRP-MANH and the methods considered for validation (GPR-Av, LMO, GPR-MA), considering the case when the data from multiple annotators are generated using “Non-homogeneous Gaussian noise”. Remarkably, we note that our approach can perform regression settings in scenarios where the gold standard is not available, and the expertise of the annotators is not homogeneous across the input space. In fact, it is possible to observe that the uncertainty of the predictions of our approach is remarkably lower when compared with the validation methodologies. The above can be explained in the sense that our GRP-MANH can perform a better codification of the annotator expertise.

Table 1. UCI repository regression results. Bold: the method with the highest performance excluding the upper bound (target) classifier GPR-GOLD

Now, we carry out regression experiments using three datasets from the UCI repository, where we simulate three annotators with different levels of expertise using the simulation parameters described below. Table 1 reports the mean and the standard deviation for the root mean squared error–(RMSE) predicted. Besides, the method with the highest performance is highlighted in bold, excluding the upper bound (GPR-GOLD), which is a Gaussian Processes for regression trained with the true labels. As seen, most of the regression methods from multiple annotators considered in this work (GPR-MA, and GPR-MANH) outperform the average baseline (GPR-av) in most cases, which is not surprising, since this baseline does not consider differences between the expertise of the labelers. Furthermore, we empirically demonstrated that our approach is not affected where the performance of the annotators is not homogeneous across the input space. In fact, our GRP-MANH outperforms all the models considered in this work for validation under the two methods used for generating the synthetic annotations (homogeneous Gaussian noise and non-homogeneous Gaussian noise). The above can be explained in the sense that due to GPR-MANH is based on the assumption that the input space can be represented by a defined number of partitions, where each annotator exhibits a particular performance in each cluster. We highlight that the promising results of our approach are achieved based only on the responses from multiple annotators without considering any prior information.

5 Conclusion

In this paper, we presented a probabilistic framework based on Gaussian processes, termed GPR-MANH, to deal with regression problems in the presence of multiple annotators. Our approach relaxes the assumption that the performance of each annotator is homogeneous across the input space. GPR-MANH assume that the input space can be divided into K regions, where each annotator exhibit a particular level of expertise, which is coded by a variance \((\sigma _k^r)^2\). Then, the annotations are modeled as a version of the gold standard corrupted by additive and non-homogeneous Gaussian noise with zero mean and variance \((\sigma _k^r)^2\). Furthermore, we tested our approach using synthetic datasets from the UCI repository and simulate the annotations from multiple annotators following two different models (see Sect. 3). The results show that the proposed method can be used to perform regression problems in the context of multiple labelers with different levels of expertise. In fact, in most cases, our approach achieves better results when compared to different state-of-the-art techniques [1, 9].

Finally, note that GPR-MANH loosens the assumption that the performance of the annotators only depends on the ground truth labels. As future work, this could be taken a step further by modeling the performance of the annotators as a function of the gold standard and the input samples through a Heteroscedastic Gaussian processes approach. Also, our method assumes independence between the opinions of the annotators; though it is suitable to consider that the labelers make their decisions independently, it is not true that these opinions are independent, due to there are possible correlations between the expert views. Accordingly, we expect to relax this assumption by using a probabilistic framework that allows to code the inter-annotator dependencies.