1 Introduction

Quantification [also known as “supervised prevalence estimation” (Barranquero et al. 2013), or “class prior estimation” (du Plessis et al. 2017)] is the task of estimating, given a set \(\sigma \) of unlabelled items and a set of classes \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\), the relative frequency (or “prevalence”) \(p(c_{i})\) of each class \(c_{i}\in {\mathcal {C}}\), i.e., the fraction of items in \(\sigma \) that belong to \(c_{i}\). When each item belongs to exactly one class, since \(0\le p(c_{i})\le 1\) and \(\sum _{c_{i}\in {\mathcal {C}}}p(c_{i})=1\), p is a distribution of the items in \(\sigma \) across the classes in \({\mathcal {C}}\) (the true distribution), and quantification thus amounts to estimating p (i.e., to computing a predicted distribution\({\hat{p}}\)).

Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidemiology) which usually deal with aggregate (as opposed to individual) data. In these contexts, classifying individual unlabelled instances is usually not a primary goal, while estimating the prevalence of the classes of interest in the data is. For instance, when classifying the tweets about a certain entity (e.g., a political candidate) as displaying either a Positive or a Negative stance towards the entity, we are usually not much interested in the class of a specific tweet: instead, we usually want to know the fraction of these tweets that belong to the class (Gao and Sebastiani 2016).

Quantification may in principle be solved via classification, i.e., by classifying each item in \(\sigma \) and counting, for all \(c_{i}\in {\mathcal {C}}\), how many such items have been labelled with \(c_{i}\). However, it has been shown in a multitude of works (see e.g., Barranquero et al. 2015; Bella et al. 2010; Esuli and Sebastiani 2015; Forman 2008; Gao and Sebastiani 2016; Hopkins and King 2010) that this “classify and count” (CC) method yields suboptimal quantification accuracy. Simply put, the reason of this suboptimality is that most classifiers are optimized for classification accuracy, and not for quantification accuracy. These two notions do not coincide, since the former is, by and large, inversely proportional to the sum \((FP_{i}+FN_{i})\) of the false positives and the false negatives for \(c_{i}\) in the contingency table, while the latter is, by and large, inversely proportional to the absolute difference \(|FP_{i}-FN_{i}|\) of the two. As a result, quantification has come to be no longer considered a mere byproduct of classification, and has evolved as a task of its own, devoted to designing methods and algorithms that deliver better prevalence estimates than CC (see González et al. 2017 for a survey of methods and results).

While the scientific community working on quantification has devoted a lot of attention to devising new and more accurate quantification methods, it has not devoted much to discussing how quantification accuracy should be measured, i.e., what properties an evaluation measure for quantification (EMQ) should enjoy, and which EMQs should be adopted as a result. In experimental computer science, the properties of the evaluation measure that one uses are fundamental in order to ensure a correct comparison among systems, i.e., ensure that this comparison rewards the systems that deliver the most desirable results; these properties formalize what “desirable” actually means. In the quantification literature, sometimes new EMQs have been introduced without arguing why they are supposedly better than existing ones. As a result, there is no consensus (and, what is worse: no debate) in the field as to which EMQ (if any) is the best. Different authors use different EMQs without properly justifying their choice, and the consequence is that different results, even when obtained on the same dataset, are not comparable. Even worse, it may be the case that an improvement, sanctioned by an “inappropriate” EMQ, obtained by a newly proposed method with respect to a baseline, may correspond to no real improvement when measured according to an “appropriate” EMQ.

This paper attempts to shed some light on the issue of which evaluation measure(s) should be used for quantification. In order to do so, we (a) lie down a number of interesting properties that an EMQ may or may not enjoy, (b) discuss whether (or when) each of these properties is desirable, (c) survey the EMQs that have been used so far, and (d) discuss whether they enjoy or not the above properties. As a result of this investigation, some of the EMQs that have been used in the literature turn out to be severely unfit, while others emerge as closer to “what the quantification community actually needs”. However, a significant result is that no existing measure satisfies all the properties identified as desirable, thus indicating that more research is needed in order to identify (or synthesize) a truly adequate EMQ.

This paper follows in the tradition of the so-called “axiomatic” approach to “evaluating evaluation” in information retrieval (see e.g., Amigó et al. 2011; Busin and Mizzaro 2013; Ferrante et al. 2015, 2018; Moffat 2013; Sebastiani 2015), which is based on describing (and often: arguing in favour of) a number of properties (that most of this literature calls—perhaps improperly—“axioms”) that an evaluation measure for the task being considered should intuitively satisfy. The benefit of this approach is that it shifts the discussion from the evaluation measures to their properties, which amounts to shifting the discussion from a complex construction to its building blocks: once the scientific community has agreed on a set of properties (the building blocks), it then follows whether a given measure (the construction) is satisfactory or not.

The paper is structured as follows. In Sect. 2 we set the stage and define the scope of our investigation. In Sect. 3 we formally discuss properties that may or may not characterize an EMQ, and argue if and when it is desirable that an EMQ enjoys them. In Sect. 4 we turn to examining the actual measures that have been proposed or used in the quantification literature, and discuss whether they comply or not with the properties introduced in Sect. 3. Section 5 critically reexamines the results of Sect. 4, while Sect. 6 concludes, discussing aspects that the present work still leaves open and avenues for further research.

2 Evaluating single-label quantification

Let us fix some notation. Symbols \(\sigma \), \(\sigma '\), \(\sigma ''\), ...will each denote a sample, i.e., a nonempty set of unlabelled items, while symbols \({\mathcal {C}}\), \({\mathcal {C}}'\), \({\mathcal {C}}''\), ...will each denote a nonempty set of classes (or codeframe) across which the unlabelled items in a sample are distributed. Symbols c, \(c_{1}\), \(c_{2}\), ...will each denote an individual class. Given a class \(c_{i}\), we will denote by \(\sigma _{i}\) the set of items in \(\sigma \) that belong to \(c_{i}\); we will also denote by \(|\sigma |\), \(|\sigma '|\), \(|\sigma ''|\), ...the number of items contained in samples \(\sigma \), \(\sigma '\), \(\sigma ''\), .... Symbols p, \(p'\), \(p''\) ..., will each denote a true distribution of the unlabelled items (either on the same sample \(\sigma \) or on different samples) across a codeframe \({\mathcal {C}}\), while symbols \({\hat{p}}\), \({\hat{p}}'\), \({\hat{p}}''\), ...will each denote a predicted distribution (or estimator), i.e., the result of estimating a true distribution;Footnote 1 symbol \({\mathcal {P}}\) will denote the (infinite) set of all distributions on \({\mathcal {C}}\).Footnote 2 Finally, symbols D, \(D'\), \(D''\), ...will each denote an EMQ, while symbols \(\pi \), \(\pi '\), \(\pi ''\), ...will denote properties that an EMQ may enjoy or not.

Similarly to classification, there are different quantification problems of applicative interest, based (a) on how many classes codeframe \({\mathcal {C}}\) contains, and (b) how many of the classes in \({\mathcal {C}}\) can be legitimately attributed to the same item. We characterize quantification problems as follows:

  1. 1.

    Single-label quantification (SLQ) is defined as quantification when each item belongs to exactly one of the classes in \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\).

  2. 2.

    Multi-label quantification (MLQ) is defined as quantification when the same item may belong to any number of classes (zero, one, or several) in \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\).

  3. 3.

    Binary quantification (BQ) may alternatively be defined

    1. (a)

      as SLQ with \(|{\mathcal {C}}|=2\) (in this case \({\mathcal {C}}=\{c_{1},c_{2}\}\) and each item must belong to either \(c_{1}\) or \(c_{2}\)), or

    2. (b)

      as MLQ with \(|{\mathcal {C}}|=1\) (in this case \({\mathcal {C}}=\{c\}\) and each item either belongs or does not belong to c).

Since BQ is a special case of SLQ (see bullet 3a above), any evaluation measure for SLQ is also an evaluation measure for BQ. Likewise, any evaluation measure for BQ is also an evaluation measure for MLQ, since evaluating a multi-label quantifier (i.e., a software artifact that estimates class prevalences) can be done by evaluating \(|{\mathcal {C}}|\) binary quantifiers, one for each \(c_{i}\in {\mathcal {C}}\). As a consequence, in this paper we focus on the evaluation of SLQ, knowing that all the solutions we discuss for SLQ also apply to BQ and MLQ.Footnote 3

As already discussed, given a sample \(\sigma \) of items (single-)labelled according to \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\), quantification has to do with determining, for each \(c_{i}\in {\mathcal {C}}\), the fraction \(|\sigma _{i}|/|\sigma |\) of items in \(\sigma \) that are labelled by \(c_{i}\). These \(|{\mathcal {C}}|\) fractions actually form a distribution p of the items in \(\sigma \) across the classes in \({\mathcal {C}}\); quantification may thus be seen as generating a predicted distribution \({\hat{p}}(c)\) over \({\mathcal {C}}\) that approximates a true distribution p(c) over \({\mathcal {C}}\). Evaluating quantification thus means measuring how well \({\hat{p}}(c)\) fits p(c). We will thus be concerned with discussing the properties that a function that attempts to measure this goodness-of-fit should enjoy; we hereafter use the notation \(D(p,{\hat{p}})\) to indicate such a function.Footnote 4

In this paper we assume that the EMQs we are concerned with are measures of quantification error, and not of quantification accuracy. The reason for this is that most, if not all, the EMQs that have been used so far are indeed measures of error, so it would be slightly unnatural to discuss our properties with reference to quantification accuracy. Since any measure of accuracy can be turned into a measure of error (typically: by taking its negation), this is an inessential factor anyway.

3 Properties for SLQ error measures

3.1 Seven desirable properties

In this section we examine a number of specific properties that, as we argue, an EMQ should enjoy. The spirit of our discussion will be essentially normative, i.e., we will argue whether an EMQ should or should not enjoy a given property, and whether this should hold regardless of the intended application. This is different, e.g., from the spirit of Amigó et al. (2011) (a work on the properties of evaluation measures for document filtering), which has a descriptive intent, i.e., describes a number of properties that such evaluation measures may or may not enjoy but does not necessarily argue that all measures should satisfy them.

The first four properties for EMQs that we discuss concern both mathematical “well-formedness” and ease of interpretation.

Property 1

Identity of indiscernibles (IoI) For each codeframe\({\mathcal {C}}\), true distributionp, and predicted distribution\({\hat{p}}\), it holds that\(D(p,{\hat{p}})=0\)if and only if\({\hat{p}}=p\). \(\square \)

Property 2

Non-negativity (NN) For eachcodeframe\({\mathcal {C}}\), true distributionp, and predicteddistribution\({\hat{p}}\), it holds that\(D(p,{\hat{p}})\ge 0\). \(\square \)

Imposing that an EMQ enjoys IoI and NN is reasonable, since altogether they indicate a score for the perfect estimator (defined as the estimator \({\hat{p}}\) such that \({\hat{p}}=p\)) and stipulate that any other (non-perfect) estimator must obtain a score strictly higher than it; both prescriptions fit our understanding of D as a measure or error. In mathematics, a function of two probability distributions that enjoys IoI and NN (two properties that, together, are often called Positive Definiteness) is called a divergence (a.k.a. “contrast function”).Footnote 5

Property 3

Strict monotonicity (MON) For each codeframe\({\mathcal {C}}\)and true distributionp, if there are predicteddistributions\({\hat{p}}',{\hat{p}}''\)and classes\(c_{1},c_{2}\in {\mathcal {C}}\)such that\({\hat{p}}'\)and\({\hat{p}}''\)only differ for the fact that\({\hat{p}}''(c_{1})<{\hat{p}}'(c_{1})\le p(c_{1})\)and\({\hat{p}}''(c_{2})>{\hat{p}}'(c_{2})\ge p(c_{2})\), with\(|{\hat{p}}''(c_{1})-{\hat{p}}'(c_{1})| = |{\hat{p}}''(c_{2})-{\hat{p}}'(c_{2})|\), then it holds that\(D(p,{\hat{p}}')<D(p,{\hat{p}}'')\). \(\square \)

If D satisfies MON, this means that, all other things being equal, a higher prediction error on a class \(c_{1}\) (obviously matched by a higher prediction error, of opposite sign, on another class \(c_{2}\)) implies a higher quantification error as measured by D.

Property 4

Maximum (MAX) There is a realvalue\(\beta >0\)such that, for each codeframe\({\mathcal {C}}\)and for each true distributionp, (1) there is a predicted distribution\({\hat{p}}^{*}\)such that\(D(p,{\hat{p}}^{*})=\beta \), and (2) for nopredicted distribution\({\hat{p}}\)it holds that\(D(p,{\hat{p}})>\beta \). \(\square \)

An estimator \({\hat{p}}^{*}\) that is the worst possible estimator of p for D (i.e., \({\hat{p}}^{*}=\arg \max _{{\hat{p}}\in {\mathcal {P}}}D(p,{\hat{p}})\)) will be called the perverse estimator of p for D. If D satisfies MAX and \({\hat{p}}^{*}\) is the perverse estimator of p for D, then \(D(p,{\hat{p}}^{*})=\beta \). Without loss of generality, in the rest of this paper we will assume \(\beta =1\); this assumption is unproblematic since any interval \([0, \beta ]\) can be rescaled to the [0, 1] interval.

Altogether, these first four properties state (among other things) that the range of an EMQ that satisfies them is independent of the problem setting (i.e., of \({\mathcal {C}}\), of its cardinality \(|{\mathcal {C}}|\), and of the true distribution p).Footnote 6 This is important, since in order to be able to easily judge whether a given value of D means high or low quantification error, not only we need to know what values D ranges on, but we need to know that these values are always the same. In other words, should this range depend on \({\mathcal {C}}\), or on its cardinality, or on the true distribution p, we would not be able to easily interpret the meaning of a given value of D.

An additional, possibly even more important reason for requiring this range to be independent of the problem setting is that, in order to test a given quantification method, the EMQ usually needs to be evaluated on a set of n test samples \(\sigma _{1}, \ldots , \sigma _{n}\) (each characterized by its own true distribution), and a measure of central tendency (typically: the average or the median) across the n resulting EMQ values then needs to be computed (see Sect. 5.3 for more on this). If, for these n samples, the EMQ ranges on n different intervals, this measure of central tendency will return unreliable results, since the results obtained on the samples characterized by the wider such intervals will exert a higher influence on the resulting value.

The fifth property we discuss deals with the relative impact of underestimation and overestimation.

Property 5

Impartiality (IMP) For anycodeframe\({\mathcal {C}}=\{c_{1},\ldots , c_{|{\mathcal {C}}|}\}\), truedistributionp, predicted distributions\({\hat{p}}'\)and\({\hat{p}}''\), classes\(c_{1},c_{2}\in {\mathcal {C}}\), and constant\(a\ge 0\)such that\({\hat{p}}'\)and\({\hat{p}}''\)only differ for thefact that\({\hat{p}}'(c_{1})=p(c_{1})+a\), \({\hat{p}}'(c_{2})=p(c_{2})-a\), \({\hat{p}}''(c_{1})=p(c_{1})-a\), \({\hat{p}}''(c_{2})=p(c_{2})+a\), it holds that\(D(p,{\hat{p}}')=D(p,{\hat{p}}'')\). \(\square \)

In a nutshell, for an EMQ D that enjoys IMP, underestimating a true prevalence p(c) by an amount a or overestimating it by the same amount a are equally serious mistakes. For instance, assume that \({\mathcal {C}}=\{c_{1},c_{2}\}\), \(p(c_{1})=0.10\), \(p(c_{2})=0.90\), and let \({\hat{p}}'\) and \({\hat{p}}''\) be two predicted distributions such that \({\hat{p}}'(c_{1})=0.05\), \({\hat{p}}'(c_{2})=0.95\), \({\hat{p}}''(c_{1})=0.15\), and \({\hat{p}}''(c_{2})=0.85\). If an EMQ D satisfies IMP then \(D(p,{\hat{p}}')=D(p,{\hat{p}}'')\).

We contend that IMP is indeed a desirable property of any EMQ, since underestimation and overestimation should be equally penalized, unless there is a specific reason for not doing so.Footnote 7 If, in a given application, we want to state that the two mistakes bring about different costs, we should be able to explicitly state these costs as parameters of the adopted measure.Footnote 8 However, in the absence of any such explicit statement, the two errors should be considered equally serious.

A further reason for insisting that an EMQ satisfies IMP is that the parameters of a quantifier trained via supervised learning, if optimized on a measure D that penalizes (say) the underestimation of p(c) less than it penalizes its overestimation, will be such that the quantifier will systematically tend to underestimate p(c). Depending on the type of parameters, this may be the result of optimization carried out either implicitly (i.e., via supervised learners that use D as the loss to minimize—see e.g., Esuli and Sebastiani 2015) or explicitly (i.e., via k-fold cross validation).

So far we have discussed properties that, as we claim, should be enjoyed by any EMQ. This is not the case for the next (and last) two properties since they exclude each other (i.e., an EMQ may not enjoy them both). We will claim that in some application contexts the former is desirable while in other application contexts the latter is desirable.

Property 6

Relativity (REL) For anycodeframe\({\mathcal {C}}\), constant\(a>0\), true distributions\(p'\)and\(p''\)that only differ for the fact that, for classes\(c_{1}\)and\(c_{2}\), \(p'(c_{1})<p''(c_{1})\)and\(p''(c_{2})<p'(c_{2})\)(with\(p''(c_{1})<p''(c_{2})\)), if a predicted distribution\({\hat{p}}'\)that estimates\(p'\)is such that\({\hat{p}}'(c_{1})=p'(c_{1}){\pm } a\)and a predicted distribution\({\hat{p}}''\)that estimates\(p''\)issuch that\({\hat{p}}''(c_{1})=p''(c_{1}){\pm } a\), and\({\hat{p}}'(c)={\hat{p}}''(c)\)for all\(c\not \in \{c_{1},c_{2}\}\), then it holds that\(D(p',{\hat{p}}')>D(p'',{\hat{p}}'')\). \(\square \)

In order to understand this fairly complex formulationFootnote 9 let us see a concrete example.

Example 1

Assume that \({\mathcal {C}}=\{c_{1},c_{2},c_{3},c_{4}\}\), and that \(p',p'',{\hat{p}}',{\hat{p}}''\) are described by the following table:

 

\(c_{1}\)

\(c_{2}\)

\(c_{3}\)

\(c_{4}\)

\(p'\)

0.15

0.35

0.40

0.10

\({\hat{p}}'\)

0.10

0.55

0.30

0.05

\(p''\)

0.20

0.30

0.40

0.10

\({\hat{p}}''\)

0.15

0.50

0.30

0.05

This scenario is characterized by the fact that, of the only two classes (\(c_{1}\) and \(c_{2}\)) that have different prevalence in \(p'\) and \(p''\), the one with the smallest true prevalence (\(c_{1}\)) in both \(p'\) and \(p''\) is underestimated by the same amount (0.05) by both \({\hat{p}}'\) and \({\hat{p}}''\). In this case D penalizes (if it satisfies REL) \({\hat{p}}'\) more than it penalizes \({\hat{p}}''\), since \(p'(c_{1})<p''(c_{1})\). \(\square \)

The rationale of REL is that an EMQ that satisfies it, sanctions that an error of absolute magnitude a is more serious when the true class prevalence is smaller. REL may be a desirable property in some applications of quantification. Consider, as an example, the case in which the prevalence p(c) of pathology c (say, Tubercolosis) as a cause of death in a population has to be estimated, for epidemiological purposes, from verbal descriptions of the symptoms that the deceased exhibited before dying (King and Ying 2008). In this case, REL should arguably be a property of the EMQ; in fact, predicting \({\hat{p}}'(c)=0.0101\) when \(p'(c)=0.0001\) is a much more serious mistake than predicting \({\hat{p}}''(c)=0.1100\) when \(p''(c)=0.1000\), since in the former case a very rare cause of death is overestimated by two orders of magnitude (e.g., the presence of an epidemic might mistakenly be inferred), while the same is not true in the latter case.

However, in other applications of quantification REL may be undesirable. To see this, consider an example in which we want to predict the prevalence \(p({\textsf {NoShow}})\) of the NoShow class among the passengers booked on a flight with actual capacity X (so that the airline can “overbook” additional \({\hat{p}}({\textsf {NoShow}})\cdot X\) seats). In this application, relativity should arguably not be a property of the evaluation measure, since predicting \({\hat{p}}({\textsf {NoShow}})=0.05\) when \(p({\textsf {NoShow}})=0.10\) or predicting \({\hat{p}}({\textsf {NoShow}})=0.15\) when \(p({\textsf {NoShow}})=0.20\) brings about the same cost to the airline (i.e., that \(0.05\cdot X\) seats will remain empty). Applications such as this demand that the EMQ satisfies instead the following property.

Property 7

Absoluteness (ABS) For anycodeframe\({\mathcal {C}}\), constant\(a>0\), true distributions\(p'\)and\(p''\)that only differ for the fact that, for classes\(c_{1}\)and\(c_{2}\), \(p'(c_{1})<p''(c_{1})\)and\(p''(c_{2})<p'(c_{2})\)(with\(p''(c_{1})<p''(c_{2})\)), if a predicted distribution\({\hat{p}}'\)that estimates\(p'\)is such that\({\hat{p}}'(c_{1})=p'(c_{1})\pm a\)and a predicted distribution\({\hat{p}}''\)that estimates\(p''\)issuch that\({\hat{p}}''(c_{1})=p''(c_{1})\pm a\), and\({\hat{p}}'(c)={\hat{p}}''(c)\)for all\(c\not \in \{c_{1},c_{2}\}\), then it holds that\(D(p',{\hat{p}}')=D(p'',{\hat{p}}'')\). \(\square \)

The formulation of ABS only differs from the formulation of REL for its conclusion: while REL stipulates that \(D(p',{\hat{p}}')\) must be higher than \(D(p'',{\hat{p}}'')\), ABS states that the two must be equal. The rationale of ABS is to guarantee that an error of the same magnitude has the same impact on D regardless of the true prevalence of the class. ABS and REL are thus mutually exclusive.

Note that ABS and REL are not redundant, i.e., they do not cover the entire spectrum of possibilities (see Sect. 4.6 for an example EMQ that enjoys neither). For instance, an EMQ might consider an error more serious when the true class prevalence is larger, in which case it would satisfy neither REL nor ABS. As the two examples above show, there are applications that positively demand REL to hold and others that positively demand ABS. As a result, we will not claim that an EMQ must (or must not) enjoy REL or ABS; we simply think it is important to ascertain whether a given EMQ satisfies REL or ABS or neither, since depending on this the EMQ may or may not be adequate for the application one is tackling.

3.2 Reformulating MON, IMP, REL, ABS

The formulations of four of the properties presented above (namely, MON, IMP, REL, ABS) might seem baroque, i.e., not as tight as they could be. In this section we will try to simplify them, but for this we need to discuss a further property. In this section we will define simplified versions of them, and show that if an EMQ satisfies the IND property, that we are going to define next, then each of MON, IMP, REL, ABS is equivalent to its simplified counterpart. Since, as it turns out, all the measures that we discuss in this paper satisfy IND, this will substantially simplify the task of checking whether our measures satisfy MON, IMP, REL, ABS.

Assume a codeframe \({\mathcal {C}}=\{c_{1}, \ldots , c_{n}\}\) partitioned into \({\mathcal {C}}_{1}=\{c_{1}, \ldots , c_{k}\}\) and \({\mathcal {C}}_{2}=\{c_{k+1}, \ldots , c_{n}\}\), and a true distribution p on \({\mathcal {C}}\) such that \(\sum _{c\in {\mathcal {C}}_{1}}p(c)=a\) for some constant \(0<a\le 1\). We define the projection of p on \({\mathcal {C}}_{1}\) as the distribution \(p_{{\mathcal {C}}_{1}}\) on \({\mathcal {C}}_{1}\) such that \(p_{{\mathcal {C}}_{1}}(c)=\frac{p(c)}{a}\) for all \(c\in {\mathcal {C}}_{1}\).

Example 2

Assume that \({\mathcal {C}}=\{c_{1},c_{2},c_{3},c_{4}\}\), that \({\mathcal {C}}_{1}=\{c_{1},c_{2},c_{3}\}\), and that p is as in the 1st row of the following table. The projection of p on \({\mathcal {C}}_{1}\) is then described in the 2nd row of the same table.

 

\(c_{1}\)

\(c_{2}\)

\(c_{3}\)

\(c_{4}\)

p

0.32

0.00

0.48

0.20

\(p_{{\mathcal {C}}_{1}}\)

0.40

0.00

0.60

Essentially, the projection on \({\mathcal {C}}_{1}\subset {\mathcal {C}}\) of a distribution p defined on \({\mathcal {C}}\) is a distribution defined on \({\mathcal {C}}_{1}\) such that the ratios between prevalences of classes that belong to \({\mathcal {C}}_{1}\) are the same in \({\mathcal {C}}\) and \({\mathcal {C}}_{1}\).

We are now ready to describe Property 8.

Property 8

Independence (IND) For anycodeframes\({\mathcal {C}}=\{c_{1}\), ..., \(c_{n}\}\), \({\mathcal {C}}_{1}=\{c_{1}, \ldots , c_{k}\}\)and\({\mathcal {C}}_{2}=\{c_{k+1}, \ldots , c_{n}\}\), for any true distributionpon\({\mathcal {C}}\)and predicted distributions\({\hat{p}}'\)and\({\hat{p}}''\)on\({\mathcal {C}}\)such that\({\hat{p}}'(c)= {\hat{p}}''(c)\)for all\(c\in {\mathcal {C}}_{2}\), it holds that\(D(p,{\hat{p}}')\le D(p,{\hat{p}}'')\)if and only if\(D(p_{{\mathcal {C}}_{1}},{\hat{p}}'_{{\mathcal {C}}_{1}})\le D(p_{{\mathcal {C}}_{1}},{\hat{p}}''_{{\mathcal {C}}_{1}})\). \(\square \)

If D satisfies property IND, this essentially means that when two predicted distributions estimate the prevalence of all classes \(\{c_{k+1}, \ldots , c_{n}\}\) identically, according to D their relative merit is independent from these classes, and can thus be established by focusing only on the remaining classes \(\{c_{1}, \ldots , c_{k}\}\).

We can now attempt to simplify the formulation of the MON, IMP, REL, ABS properties. For this discussion we will take MON as an example, since similar considerations also apply to the other three properties.

What we would like from a monotonicity property is to stipulate that any even small increase in quantification error must generate an increase in the value of \(D(p,{\hat{p}})\). However, the notion of an “increase in quantification error” is non-trivial. To see this, note that characterizing an increase in classification error is simple, since the units of classification (the unlabelled items) are independent of each other: in a single-label context, to generate an increase in classification error one just needs to switch the predicted label of a single test items from correct to incorrect, and the other items are not affected.Footnote 10 In a quantification context, instead, increasing the difference between \(p(c_{i})\) and \({\hat{p}}(c_{i})\) for some \(c_{i}\) does not necessarily increase quantification error, since the estimation(s) of some other class(es) in \({\mathcal {C}}/\{c_{i}\}\) is/are affected too, in many possible ways; in some cases the quantification error across the entire codeframe \({\mathcal {C}}\) unequivocally increases, while in some other cases it is not clear whether this happens or not, as the following example shows.

Example 3

Assume that \({\mathcal {C}}=\{c_{1},c_{2},c_{3},c_{4}\}\), and assume the following true distribution p and predicted distributions \({\hat{p}}',{\hat{p}}'',{\hat{p}}'''\):

 

\(c_{1}\)

\(c_{2}\)

\(c_{3}\)

\(c_{4}\)

p

0.20

0.30

0.25

0.25

\({\hat{p}}'\)

0.25

0.15

0.30

0.30

\({\hat{p}}''\)

0.35

0.15

0.25

0.25

\({\hat{p}}'''\)

0.35

0.05

0.30

0.30

In switching from \({\hat{p}}'\) to \({\hat{p}}''\) the quantification error on \(c_{1}\) increases, but the quantification error on \(c_{3}\) and \(c_{4}\) decreases, so that it is not clear whether we should consider the quantification error on \({\mathcal {C}}\) to increase or decrease. Conversely, in switching from \({\hat{p}}'\) to \({\hat{p}}'''\) the quantification errors on \(c_{1}\)and on the rest of the codeframe as a whole both increase. \(\square \)

Example 3 shows that the increase in the quantification error on a single class says nothing about how the quantification error on the entire codeframe varies. As a result, in MON we cannot stipulate (as we would have liked) that, in switching from one predicted distribution to another, D should increase with the increase in the estimation error on a single class \(c_{1}\). The only thing we can do is to impose a monotonicity condition on how D behaves in a specific case, i.e., when the increase in the estimation error on a class \(c_{1}\) is exactly matched by an estimation error (of identical magnitude but opposite sign) on another class \(c_{2}\) (which is what MON does) while the estimation errors on all the other classes do not change.

The two predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\) mentioned in MON are such that \({\hat{p}}'(c_{1})+{\hat{p}}'(c_{2})={\hat{p}}''(c_{1})+{\hat{p}}''(c_{2})=a\) for some constant \(0<a\le 1\), while both \(\sum _{c\in {\mathcal {C}}/\{c_{1},c_{2}\}}{\hat{p}}'(c)\) and \(\sum _{c\in {\mathcal {C}}/\{c_{1},c_{2}\}}{\hat{p}}''(c)\) are equal to \((1-a)\). This means that, assuming that D satisfies IND, we can reformulate MON in a way that disregards classes other than \(\{c_{1},c_{2}\}\) and considers instead the projection of p on \(\{c_{1},c_{2}\}\). In other words, if D satisfies IND we can reformulate MON in a way that tackles the problem in a binary quantification context (instead of the more general single-label quantification context). The fact that, in a binary context, \(p(c_{2})=(1-p(c_{1}))\) for any (true or predicted) distribution p, means that MON can be reformulated by simply referring to just one of the two classes, i.e.,

Property 9

Binary strict monotonicity (B-MON) For any codeframe\({\mathcal {C}}=\{c_{1},c_{2}\}\)and true distributionp, if predicted distributions\({\hat{p}}',{\hat{p}}''\)are such that\({\hat{p}}''(c_{1})<{\hat{p}}'(c_{1})\le p(c_{1})\), then it holds that\(D(p,{\hat{p}}')<D(p,{\hat{p}}'')\). \(\square \)

As a result of what we have said in this section, B-MON is, for any EMQ D that satisfies IND, equivalent to MON. It is also much more compact since, among other things, it makes reference to a single class only. Considerations analogous to the ones above can be made for IMP, REL, ABS. We reformulate them too as below.

Property 10

Binary impartiality (B-IMP) Forany codeframe\({\mathcal {C}}=\{c_{1},c_{2}\}\), true distributionp, predicted distributions\({\hat{p}}'\)and\({\hat{p}}''\), and constant\(a\ge 0\)such that\({\hat{p}}'(c_{1})=p(c_{1})+a\)and\({\hat{p}}''(c_{1})=p(c_{1})-a\), it holds that\(D(p,{\hat{p}}')=D(p,{\hat{p}}'')\). \(\square \)

Property 11

Binary relativity (B-REL) For any codeframe\({\mathcal {C}}=\{c_{1},c_{2}\}\), constant\(a>0\), truedistributions\(p'\)and\(p''\)such that\(p'(c_{1})<p''(c_{1})\)and\(p''(c_{1})<p''(c_{2})\), if a predicted distribution\({\hat{p}}'\)thatestimates\(p'\)is such that\({\hat{p}}'(c_{1})=p'(c_{1})\pm a\)and apredicted distribution\({\hat{p}}''\)that estimates\(p''\)is such that\({\hat{p}}''(c_{1})=p''(c_{1})\pm a\), then it holds that\(D(p',{\hat{p}}')>D(p'',{\hat{p}}'')\). \(\square \)

Property 12

Binary absoluteness (B-ABS) For any codeframe\({\mathcal {C}}=\{c_{1},c_{2}\}\), constant\(a>0\), true distributions\(p'\)and\(p''\)such that\(p'(c_{1})<p''(c_{1})\)and\(p''(c_{1})<p''(c_{2})\), if a predicted distribution\({\hat{p}}'\)thatestimates\(p'\)is such that\({\hat{p}}'(c_{1})=p'(c_{1})\pm a\)and apredicted distribution\({\hat{p}}''\)that estimates\(p''\)is such that\({\hat{p}}''(c_{1})=p''(c_{1})\pm a\), then it holds that\(D(p',{\hat{p}}')=D(p'',{\hat{p}}'')\). \(\square \)

In the next sections, instead of trying to prove that an EMQ verifies Properties 37, we will equivalently (1) try to prove that it verifies IND, and if successful (2) try to prove that it verifies Properties 912; the reason is, of course, the much higher simplicity and compactness of the formulations of Properties 912 with respect to Properties 37.

4 Evaluation measures for single-label quantification

In this section we turn to the functions that have been proposed and used for evaluating quantification, and discuss whether they comply or not with the properties that we have discussed in Sect. 3. In many cases these functions were originally proposed for evaluating the binary case; since the extension to SLQ is usually straightforward, for each EMQ we indicate its original proponent or user (on this see also Table 2) and disregard whether it was originally used just for BQ or for the full-blown SLQ.

We will discuss 9 measures proposed as EMQs in the literature, and for each of them we will be interested in whether they satisfy or not Properties 1 to 8. Giving \(9\times 8=72\) proofs in detail would make the paper excessively long and boring: as a result, only some of these proofs will be given in detail, while for others we will only give hints at how they can be easily obtained via the same lines of reasoning used in other cases. In several cases, given a measure D and a property \(\pi \), one can simply show that D does not enjoy \(\pi \) via a counterexample. Since the same scenario can serve as a counterexample for showing that \(\pi \) is not enjoyed by several measures, we formulate each such scenario in the form of a table that shows which measures the scenario rules out. In the "Appendix" we include a table each for properties MAX (“Appendix 2.1” section), IMP (“Appendix 2.2” section), REL (“Appendix 2.3” section), ABS (“Appendix 2.4” section); in this section, when discussing the property in the context of a specific measure that does not enjoy it, we will simply refer the reader to the appropriate table.

A 2D plot (for the case of binary quantification) of the 9 measures we will discuss is displayed in Fig. 1; Fig. 2 displays the same plots in 3D. These plots allow to graphically appreciate if a measure enjoys a certain property or not. For instance, looking at the 2D plots, a measure that enjoys both IoI and NN (i.e., a divergence) is such that the \(y=x\) diagonal is the locus of the darkest points; a measure that enjoys MON is such that, when moving away in a vertical direction (i.e., up or down) from the \(y=x\) diagonal, points get lighter; a measure that enjoys IMP is such that moving away in a vertical direction from the \(y=x\) diagonal, moving up or down by the same amount returns points of the same colour; a measure that enjoys ABS is such that moving away in a vertical direction from the \(y=x\) diagonal in a given sense (e.g., down), the difference in colour does not depend from which point of the diagonal we are moving away from; etc.

4.1 Absolute error

The simplest EMQ is Absolute Error (\({{\,\mathrm{AE}\,}}\)), which corresponds to the average (across the classes in \({\mathcal {C}}\)) absolute difference between the predicted class prevalence and the true class prevalence; i.e.,

$$\begin{aligned} {{\,\mathrm{AE}\,}}(p,{\hat{p}})=\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)| \end{aligned}$$
(1)
Fig. 1
figure 1

2D plots (for a binary quantification task) for the nine EMQs of Tables 1 and 2; \(p(c_{1})\) and \(p(c_{2})\) are represented as x and \((1-x)\), respectively, while \({\hat{p}}(c_{1})\) and \({\hat{p}}(c_{2})\) are represented as y and \((1-y)\). Darker areas represent values closer to 0 (i.e., smaller error) while lighter areas represent values more distant from 0 (i.e., higher error)

It is easy to prove that \({{\,\mathrm{AE}\,}}\) enjoys IoI, NN, MON, IMP, ABS, IND. While some of these proofs are trivial, we report them in detail (in “Appendix 1” section) in order to show how the same arguments can be used to prove the same for many of the EMQs to be discussed later in this section.

Instead, as shown in “Appendix 2.1” section, \({{\,\mathrm{AE}\,}}\) does not enjoy MAX, because its range depends on the true distribution p. More specifically, \({{\,\mathrm{AE}\,}}\) ranges between 0 (best) and

$$\begin{aligned} z_{{{\,\mathrm{AE}\,}}}=\displaystyle \frac{2\left( 1-\displaystyle \min _{c\in {\mathcal {C}}}p(c)\right) }{|{\mathcal {C}}|} \end{aligned}$$
(2)

(worst), i.e., its range depends also on the cardinality of \({\mathcal {C}}\). In fact, it is easy to verify that, given a true distribution p on \({\mathcal {C}}\), the perverse estimator of p is the one such that (a) \({\hat{p}}(c^{*})=1\) for class \(c^{*}=\arg \min _{c\in {\mathcal {C}}}p(c)\), and (b) \({\hat{p}}(c)=0\) for all \(c\in {\mathcal {C}}/\{c^{*}\}\). In this case, the total error derives (1) from overestimating \(p(c^{*})\), which brings about an error of \((1-p(c^{*}))\), and (2) from underestimating p(c) for all \(c\in {\mathcal {C}}/\{c^{*}\}\), which collectively brings about an additional error of \((1-p(c^{*}))\). \({{\,\mathrm{AE}\,}}\) is obtained by dividing this \(2(1-p(c^{*}))\) quantity by \(|{\mathcal {C}}|\).

Concerning REL, just note that since \({{\,\mathrm{AE}\,}}\) satisfies ABS, it cannot (as observed in Sect. 3) satisfy REL. (That \({{\,\mathrm{AE}\,}}\) does not enjoy REL is also shown via a counterexample in “Appendix 2.3” section.)

The properties that \({{\,\mathrm{AE}\,}}\) enjoys (and those it does not enjoy) are conveniently summarized in Table 1, along with the same for all the measures discussed in the rest of this paper.

In the literature, \({{\,\mathrm{AE}\,}}\) also goes by the name of Variational Distance (Csiszár and Shields 2004, §4), (Lin 1991; Zhang and Zhou 2010), or Percentage Discrepancy (Esuli and Sebastiani 2010; Baccianella et al. 2013). Also, if viewed as a generic function of dissimilarity between vectors (and not just probability distributions), \({{\,\mathrm{AE}\,}}\) is nothing else than the well-known “city-block distance” normalized by the number of classes. Some recent papers (Beijbom et al. 2015; González et al. 2017) that tackle quantification in the context of ecological modelling discuss or use, as an EMQ, Bray–Curtis dissimilarity (BCD), a measure popular in ecology for measuring the dissimilarity of two samples. However, when used to measure the dissimilarity of two probability distributions, BCD defaults to \({{\,\mathrm{AE}\,}}\); as a result we will not analyse BCD any further.

Note that \({{\,\mathrm{AE}\,}}\) often goes by the name of Mean Absolute Error; for simplicity, for this and the other measures we discuss in the rest of this paper we will omit the qualification “Mean”, since every measure mediates across the class-specific values in its own way.

As an EMQ, \({{\,\mathrm{AE}\,}}\) was used for the first time by Saerens et al. (2002), and in many other papers ever since. For \({{\,\mathrm{AE}\,}}\) and for all the other EMQs discussed in this paper, Table 2 lists the papers where the measure has been proposed and those which have subsequently used it for evaluation purposes.

4.2 Normalized absolute error

Following what we have said in Sect. 4.1, a normalized version of \({{\,\mathrm{AE}\,}}\) that always ranges between 0 (best) and 1 (worst) can be obtained as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{NAE}\,}}(p,{\hat{p}})&= \ \dfrac{{{\,\mathrm{AE}\,}}(p,{\hat{p}})}{z_{{{\,\mathrm{AE}\,}}}} = \ \frac{\sum _{c\in {\mathcal {C}}}|{\hat{p}}(c)-p(c)|}{2\left( 1-\displaystyle \min\nolimits _{c\in {\mathcal {C}}}p(c)\right) } \end{aligned} \end{aligned}$$
(3)

where \(z_{{{\,\mathrm{AE}\,}}}\) is as in Eq. 2. It is easy to verify that \({{\,\mathrm{NAE}\,}}\) enjoys IoI, NN, MON, IMP, IND. \({{\,\mathrm{NAE}\,}}\) also enjoys (by construction) MAX.

Fig. 2
figure 2

3D plots (for a binary quantification task) for the nine EMQs of Tables 1 and 2; \(p(c_{1})\) and \(p(c_{2})\) are represented as x and \((1-x)\), respectively, while \({\hat{p}}(c_{1})\) and \({\hat{p}}(c_{2})\) are represented as y and \((1-y)\); error is represented as z (higher values of z represent higher error)

Given that \({{\,\mathrm{NAE}\,}}\) is just a normalized version of \({{\,\mathrm{AE}\,}}\), and given that \({{\,\mathrm{AE}\,}}\) enjoys ABS, one might expect that \({{\,\mathrm{NAE}\,}}\) enjoys ABS too. Surprisingly enough, this is not the case, as shown in the counterexample of “Appendix 2.4” section. The reason for this is that, for the two distributions \(p'\) and \(p''\) (and their respective predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\)) mentioned in the formulation of Property 7 (ABS), and exemplified in the counterexample of “Appendix 2.4” section, the numerator of Eq. 3 is the same but the denominator (i.e., the normalizing constant) is different, which means that the value of \({{\,\mathrm{NAE}\,}}\) is also different. \({{\,\mathrm{NAE}\,}}\) does not enjoy REL either, as also shown in “Appendix 2.3” section).

\({{\,\mathrm{NAE}\,}}\) was discussed for the first time by Esuli and Sebastiani (2014). With a similar intent, in a binary quantification context Barranquero et al. (2015) proposed Normalized Absolute Score (\({{\,\mathrm{NAS}\,}}\)). \({{\,\mathrm{NAS}\,}}\) is an accuracy (and not an error) measure; when viewed as an error measure, it is defined as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{NAS}\,}}(p,{\hat{p}})= \frac{|p(c)-{\hat{p}}(c)|}{\max \{p(c),(1-p(c))\}} \end{aligned} \end{aligned}$$
(4)

where c is any class in \({\mathcal {C}}=\{c_{1},c_{2}\}\). We will not discuss \({{\,\mathrm{NAS}\,}}\) in detail since (a) it is only defined for the binary case, and (b) it is easy to show that in this case it coincides with \({{\,\mathrm{NAE}\,}}\).

4.3 Relative absolute error

Relative Absolute Error (\({{\,\mathrm{RAE}\,}}\)) relativises the value \(|{\hat{p}}(c)-p(c)|\) in Eq. 1 to the true class prevalence, i.e.,

$$\begin{aligned} {{\,\mathrm{RAE}\,}}(p,{\hat{p}})=\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}\displaystyle \frac{|{\hat{p}}(c)-p(c)|}{p(c)} \end{aligned}$$
(5)

\({{\,\mathrm{RAE}\,}}\) may be undefined in some cases, due to the presence of zero denominators. To solve this problem, in computing \({{\,\mathrm{RAE}\,}}\) we can smooth both p(c) and \({\hat{p}}(c)\) via additive smoothing, i.e., we take

$$\begin{aligned} p_{s}(c)=\frac{\epsilon +p(c)}{\epsilon |{\mathcal {C}}|+\displaystyle \sum \nolimits _{c\in {\mathcal {C}}}p(c)} \end{aligned}$$
(6)

where \(p_{s}(c)\) denotes the smoothed version of p(c) and the denominator is just a normalizing factor (same for the \({\hat{p}}_{s}(c)\)’s); the quantity \(\epsilon =\frac{1}{2 |\sigma |}\) is often used (and will always be used in the rest of this paper) as a smoothing factor. The smoothed versions of p(c) and \({\hat{p}}(c)\) are then used in place of their original non-smoothed versions in Eq. 5; as a result, \({{\,\mathrm{RAE}\,}}\) is always defined.

Using arguments analogous to the ones used for \({{\,\mathrm{AE}\,}}\) in “Appendix 1” section, it is immediate to show that \({{\,\mathrm{RAE}\,}}\) enjoys IoI, NN, MON, IMP, IND. It also enjoys REL by construction, which means that it does not enjoy ABS. Analogously to \({{\,\mathrm{AE}\,}}\), the fact \({{\,\mathrm{RAE}\,}}\) does not enjoy MAX, as shown via the counterexample in “Appendix 2.1” section.

It is easy to show that \({{\,\mathrm{RAE}\,}}\) ranges between 0 (best) and

$$\begin{aligned} z_{{{\,\mathrm{RAE}\,}}}=\displaystyle \frac{|{\mathcal {C}}|-1+\displaystyle \frac{1- \displaystyle \min\nolimits_{c\in {\mathcal {C}}}p(c)}{\displaystyle \min\nolimits_{c\in {\mathcal {C}}}p(c)}}{|{\mathcal {C}}|} \end{aligned}$$
(7)

(worst), i.e., its range depends also on the cardinality of \({\mathcal {C}}\). In fact, similarly to the case of \({{\,\mathrm{AE}\,}}\), it is easy to verify that, given a true distribution p on \({\mathcal {C}}\), the perverse estimator of p is obtained when (a) \({\hat{p}}(c)=1\) for the class \(c^{*}=\arg \min _{c\in {\mathcal {C}}}p(c)\), and (b) \({\hat{p}}(c)=0\) for all \(c\in {\mathcal {C}}/\{c^{*}\}\). In this case, the total relative absolute error derives (1) from overestimating \(p(c^{*})\), which brings about an error of \(\frac{1-p(c^{*})}{p(c^{*})}\), and (2) from underestimating p(c) for all \(c\in {\mathcal {C}}/\{c^{*}\}\), which brings about an additional error of 1 for each class in \({\mathcal {C}}/\{c^{*}\}\). The value of \({{\,\mathrm{RAE}\,}}\) is then obtained by dividing the resulting \(\left( |{\mathcal {C}}|-1+\displaystyle \frac{1- p(c^{*})}{p(c^{*})}\right) \) by \(|{\mathcal {C}}|\).

As an EMQ, \({{\,\mathrm{RAE}\,}}\) was used for the first time by González-Castro et al. (2010), and by several other papers after it.

4.4 Normalized relative absolute error

Following what we have said in Sect. 4.3, a normalized version of \({{\,\mathrm{RAE}\,}}\) that always ranges between 0 (best) and 1 (worst) can thus be obtained as

$$\begin{aligned} {{\,\mathrm{NRAE}\,}}(p,{\hat{p}})= \frac{{{\,\mathrm{RAE}\,}}(p,{\hat{p}})}{z_{{{\,\mathrm{RAE}\,}}}}= \frac{\displaystyle \sum _{c\in {\mathcal {C}}}\displaystyle \frac{|{\hat{p}}(c) - p(c)|}{p(c)}}{|{\mathcal {C}}|-1+\displaystyle \frac{1- \displaystyle \min\nolimits _{c\in {\mathcal {C}}}p(c)}{\displaystyle \min\nolimits_{c\in {\mathcal {C}}}p(c)}} \end{aligned}$$
(8)

where \(z_{{{\,\mathrm{RAE}\,}}}\) is as in Eq. 7. Since the various denominators of Eq. 8 may be undefined, the smoothed values of Eq. 6 must be used in Eq. 8 too.

It is straightforward to verify that \({{\,\mathrm{NRAE}\,}}\), which was first proposed by Esuli and Sebastiani (2014), enjoys IoI, NN, MON, IMP, IND, and also enjoys (by construction) MAX.

Somehow similarly to what we said in Sect. 4.2 about \({{\,\mathrm{NAE}\,}}\) and ABS, given that \({{\,\mathrm{NRAE}\,}}\) is just a normalized version of \({{\,\mathrm{RAE}\,}}\), and given that \({{\,\mathrm{RAE}\,}}\) enjoys REL, one might expect that \({{\,\mathrm{NRAE}\,}}\) enjoys REL too. Again, this is not the case, as shown in the counterexample of “Appendix 2.3” section. The reason for this is that, for the two distributions \(p'\) and \(p''\) (and their respective predicted distributions \({\hat{p}}'\) and \({\hat{p}}''\)) mentioned in the formulation of Property 6 (REL), and exemplified in the counterexample of “Appendix 2.3” section, while \({{\,\mathrm{RAE}\,}}\) (the numerator of Eq. 8) does enjoy REL, the normalizing constant (the denominator of Eq. 8) invalidates it, since it is different for \(p'\) and \(p''\). \({{\,\mathrm{NAE}\,}}\) does not enjoy ABS either, as also shown in “Appendix 2.4” section.

4.5 Squared error

Another measure that has been used in the quantification literature is Squared Error (\({{\,\mathrm{SE}\,}}\)), defined as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{SE}\,}}(p,{\hat{p}}) = \frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}} (p(c)-{\hat{p}}(c))^{2} \end{aligned} \end{aligned}$$
(9)

When viewed as a generic function of dissimilarity between vectors (and not just probability distributions), \({{\,\mathrm{SE}\,}}\) is the well-known \(L^{2}\)-distance. As an EMQ, \({{\,\mathrm{SE}\,}}\) was used for the first time by Bella et al. (2010).

The mathematical form of \({{\,\mathrm{SE}\,}}\) is very similar to that of \({{\,\mathrm{AE}\,}}\), and it can be trivially shown that \({{\,\mathrm{SE}\,}}\) enjoys all the properties that \({{\,\mathrm{AE}\,}}\) enjoys and does not enjoy all the properties that \({{\,\mathrm{AE}\,}}\) does not enjoy. In particular, \({{\,\mathrm{SE}\,}}\) does not enjoy MAX since \({{\,\mathrm{SE}\,}}\) ranges between 0 (best) and

$$\begin{aligned} z_{{{\,\mathrm{SE}\,}}}=\displaystyle \frac{(1-p(c^{*}))^{2}+\sum _{c\in {\mathcal {C}}/\{c^{*}\}}p(c)^{2}}{|{\mathcal {C}}|} \end{aligned}$$
(10)

(worst), where \(c^{*}=\arg \min _{c\in {\mathcal {C}}}p(c)\); i.e., the range of \({{\,\mathrm{SE}\,}}\) depends on p and \(|{\mathcal {C}}|\). In fact, similarly to the case of \({{\,\mathrm{AE}\,}}\), it is easy to verify that the perverse estimator of a true distribution p is the one such (a) \({\hat{p}}(c^{*})=1\) and (b) \({\hat{p}}(c)=0\) for all \(c\in {\mathcal {C}}/\{c^{*}\}\). In this case, the squared error derives (1) from overestimating \(p(c^{*})\), which brings about an error of \(\frac{(1-p(c^{*}))^{2}}{|{\mathcal {C}}|}\), and (2) from underestimating p(c) for all \(c\in {\mathcal {C}}/\{c^{*}\}\), which brings about an additional error of \(\frac{p(c)^{2}}{|{\mathcal {C}}|}\) for each class in \({\mathcal {C}}/\{c^{*}\}\). We could thus define a normalized version of \({{\,\mathrm{SE}\,}}\) as

$$\begin{aligned} {{\,\mathrm{NSE}\,}}(p,{\hat{p}})= \frac{{{\,\mathrm{SE}\,}}(p,{\hat{p}})}{z_{{{\,\mathrm{SE}\,}}}}= \frac{\sum _{c\in {\mathcal {C}}} (p(c)-{\hat{p}}(c))^{2}}{(1-p(c^{*}))^{2}+\sum _{c\in {\mathcal {C}}/\{c^{*}\}}p(c)^{2}} \end{aligned}$$
(11)

which would, quite obviously, enjoy and not enjoy exactly the same properties that \({{\,\mathrm{NAE}\,}}\) enjoys and does not enjoy.

\({{\,\mathrm{SE}\,}}\) is structurally similar to \({{\,\mathrm{AE}\,}}\) but (as can also be appreciated from Fig. 1) is less sensitive than it, i.e., it is always the case that \({{\,\mathrm{SE}\,}}(p,{\hat{p}})\le {{\,\mathrm{AE}\,}}(p,{\hat{p}})\) (since it is always the case that \((p(c)-{\hat{p}}(c))^{2}\le |p(c)-{\hat{p}}(c)|\)).

In the binary quantification literature, other proxies of \({{\,\mathrm{SE}\,}}\) have been used; one example is Normalized Squared Score (Barranquero et al. 2015), defined as \({{\,\mathrm{NSS}\,}}(p(c),{\hat{p}}(c))\equiv 1-(\frac{p(c)-{\hat{p}}(c)}{\max \{p(c),(1-p(c))\}})^{2}\), where c is any class in \({\mathcal {C}}=\{c_{1},c_{2}\}\). Similarly to the \({{\,\mathrm{NAS}\,}}\) measure discussed at the end of Sect. 4.1, \({\text {NSS}}\) is an accuracy (and not an error) measure; when viewed as an error measure, it would be defined as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{NSS}\,}}(p,{\hat{p}})= \left( \frac{p(c)-{\hat{p}}(c)}{\max \{p(c),(1-p(c))\}}\right) ^{2} \end{aligned} \end{aligned}$$
(12)

where c is any class in \({\mathcal {C}}=\{c_{1},c_{2}\}\). We will not discuss \({{\,\mathrm{NSS}\,}}\) in detail since (a) it is only defined for the binary case, and (b) it is easy to show that in this case it coincides with \({{\,\mathrm{NSE}\,}}\).

4.6 Discordance ratio

Levin and Roitman (2017) introduce an EMQ that they call Concordance Ratio (CR). \({{\,\mathrm{CR}\,}}\) is a measure of accuracy, and not a measure of error; for better consistency with the rest of this paper, instead of \({{\,\mathrm{CR}\,}}\) we consider what might be called Discordance Ratio, i.e., its complement \({{\,\mathrm{DR}\,}}=(1-{{\,\mathrm{CR}\,}})\), defined as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{DR}\,}}(p,{\hat{p}})=&\ 1 - {{\,\mathrm{CR}\,}}\\ =&\ 1-\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}\dfrac{\min \{(p(c),{\hat{p}}(c)\}}{\max \{(p(c),{\hat{p}}(c)\}} \\ =&\ \frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}\dfrac{\max \{(p(c),{\hat{p}}(c)\}-\min \{(p(c),{\hat{p}}(c)\}}{\max \{(p(c),{\hat{p}}(c)\}} \\ =&\ \frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}\dfrac{|(p(c)-{\hat{p}}(c)|}{\max \{(p(c),{\hat{p}}(c)\}} \end{aligned} \end{aligned}$$
(13)

\({{\,\mathrm{DR}\,}}\) is undefined when, for a given class c, both p(c) and \({\hat{p}}(c)\) are zero; the smoothed values of Eq. 6 must thus be used within Eq. 13 in order to avoid this problem.

It is easy to verify, along the lines sketched in “Appendix 1” section, that \({{\,\mathrm{DR}\,}}\) enjoys IoI, NN, MON, IND. \({{\,\mathrm{DR}\,}}\) also enjoys REL; this can be seen by the fact that, for the same amount a of misprediction, \(\sum _{c\in {\mathcal {C}}}\frac{\min \{(p(c),{\hat{p}}(c)\}}{\max \{(p(c),{\hat{p}}(c)\}}\) is smaller (hence \({{\,\mathrm{DR}\,}}(p,{\hat{p}})\) is larger) when the true prevalence of the class \(c_{1}\) mentioned in the formulation of Property 6 (REL) is smaller. Instead, \({{\,\mathrm{DR}\,}}\) enjoys neither MAX, nor IMP, nor ABS, as shown in “Appendices 2.1, 2.2 and 2.4” section, respectively.

4.7 Kullback–Leibler divergence

An EMQ that has become somehow standard in the evaluation of single-label (and, a fortiori, binary) quantification, is Kullback–Leibler Divergence (\({{\,\mathrm{KLD}\,}}\)—also called Information Divergence, or Relative Entropy) (Csiszár and Shields 2004) and defined asFootnote 11

$$\begin{aligned} {{\,\mathrm{KLD}\,}}(p,{\hat{p}}) = \sum _{c\in {\mathcal {C}}} p(c)\log \frac{p(c)}{{\hat{p}}(c)} \end{aligned}$$
(14)

As an EMQ, \({{\,\mathrm{KLD}\,}}\) was used for the first time (under the name Normalized Cross-Entropy) by Forman (2005). It should also be noted that \({{\,\mathrm{KLD}\,}}\) has been adopted as the official evaluation measure of the only quantification-related shared task that has been organized so far, Subtask D “Tweet Quantification on a 2-point Scale” of SemEval-2016 and Semeval-2017 “Task 4: Sentiment Analysis in Twitter” (Nakov et al. 2016, 2017).

\({{\,\mathrm{KLD}\,}}\) may be undefined in some cases. While the case in which \(p(c)=0\) is not problematic (since continuity arguments indicate that \(0 \log \frac{0}{a}\) should be taken to be 0 for any \(a\ge 0\)), the case in which \({\hat{p}}(c)=0\) and \(p(c)>0\) is indeed problematic, since \(a\log \frac{a}{0}\) is undefined for \(a>0\). To solve this problem, we smooth values in the same way as already described in Sect. 4.3; as a result, \({{\,\mathrm{KLD}\,}}\) is always defined.

The fact that \({{\,\mathrm{KLD}\,}}\) enjoys IoI and NN (i.e., the fact that \({{\,\mathrm{KLD}\,}}\) is indeed a divergence) is not self-evident (since \(p(c)\log \frac{p(c)}{{\hat{p}}(c)}\) is negative whenever \(p(c)<{\hat{p}}(c)\)), and is known as Gibbs’ inequality. A formal proof of it can be found on several information theory textbooks (see e.g., MacKay 2003, p. 44).

Indeed, \({{\,\mathrm{KLD}\,}}\) is a well-known member of the class of f-divergences (Ali and Silvey 1966; Csiszár and Shields 2004, §4), a class of functions that measure the difference between two probability distributions, and that all enjoy IoI and NN.

The fact that \({{\,\mathrm{KLD}\,}}\) enjoys MON is also not self-evident, essentially for the same reasons for which it is not self-evident that it enjoys IoI and NN. The proof that \({{\,\mathrm{KLD}\,}}\) enjoys MON is given in “Appendix 3” section, where we use the fact that \({{\,\mathrm{KLD}\,}}\) enjoys IND (something which can be easily shown via the arguments used in “Appendix 1” section) and thus limit ourselves to proving that it enjoys B-MON.

The fact that \({{\,\mathrm{KLD}\,}}\) enjoys neither MAX, nor IMP, nor REL, nor ABS is shown in “Appendices 2.1, 2.2, 2.3, 2.4” section, respectively. Concerning MAX we note that, in theory, the upper bound of \({{\,\mathrm{KLD}\,}}\) is not finite, since Eq. 14 has predicted probabilities, and not true probabilities, at the denominator. That is, by making a predicted probability \({\hat{p}}(c)\) infinitely small we can make \({{\,\mathrm{KLD}\,}}\) infinitely large. However, since we use smoothed values, the fact that both p and \({\hat{p}}\) are lower-bounded by \(\epsilon \), and not by 0, has the consequence that \({{\,\mathrm{KLD}\,}}\) has a finite upper bound. The perverse estimator for \({{\,\mathrm{KLD}\,}}\) is the one such (a) \({\hat{p}}(c^{*})=1\) and (b) \({\hat{p}}(c)=0\) for all \(c\in {\mathcal {C}}/\{c^{*}\}\). The value of this estimator is

$$\begin{aligned} z_{{{\,\mathrm{KLD}\,}}}(p,{\hat{p}})=p_{s}(c^{*})\log \frac{p_{s}(c^{*})}{1-(|{\mathcal {C}}|-1)\cdot \epsilon }+\sum _{c\in {\mathcal {C}}/\{c^{*}\}}p_{s}(c)\log \frac{p_{s}(c)}{\epsilon } \end{aligned}$$
(15)

which shows that the range of \({{\,\mathrm{KLD}\,}}\) depends on p, the cardinality of \({\mathcal {C}}\), and even on the value of \(\epsilon \). This is a further proof that \({{\,\mathrm{KLD}\,}}\) does not enjoy MAX.

4.8 Normalized Kullback–Leibler divergence

Given what we have said in Sect. 4.7, one might define a normalized version of \({{\,\mathrm{KLD}\,}}\) (i.e., one that also enjoys MAX) as \(\frac{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}{z_{{{\,\mathrm{KLD}\,}}}(p,{\hat{p}})}\), where \(z_{{{\,\mathrm{KLD}\,}}}(p,{\hat{p}})\) is as in Eq. 15. Esuli and Sebastiani (2014) follow instead a different route, and define a normalized version of \({{\,\mathrm{KLD}\,}}\) by applying to it a logistic function,Footnote 12 i.e.,Footnote 13

$$\begin{aligned} \begin{aligned} {{\,\mathrm{NKLD}\,}}(p,{\hat{p}})&= 2\frac{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}}{e^{{{\,\mathrm{KLD}\,}}(p,{\hat{p}})}+1}-1 \\ \end{aligned} \end{aligned}$$
(16)

Like other previously discussed measures, also \({{\,\mathrm{NKLD}\,}}\) may be undefined in some cases; therefore, also in computing \({{\,\mathrm{NKLD}\,}}\) we need to use the smoothed values of Eq. 6 in place of the original p(c)’s and \({\hat{p}}(c)\)’s.

\({{\,\mathrm{NKLD}\,}}\) enjoys some of our properties of interest for the simple reason that \({{\,\mathrm{KLD}\,}}\) enjoys them; it is easy to verify that this is the case of IoI and NN. \({{\,\mathrm{NKLD}\,}}\) also enjoys MON and IND; this descends from the fact that \({{\,\mathrm{NKLD}\,}}(d,d')<{{\,\mathrm{NKLD}\,}}(d,d'')\) if and only if \({{\,\mathrm{KLD}\,}}(d,d')<{{\,\mathrm{KLD}\,}}(d,d'')\) (this derives from the fact that the logistic function is a monotonic transformation) and from the fact that \({{\,\mathrm{KLD}\,}}\) enjoys MON and IND, respectively. Concerning MAX, \({{\,\mathrm{NKLD}\,}}\) enjoys it by construction, because when a predicted prevalence \({\hat{p}}(c)\) tends to 0 \({{\,\mathrm{KLD}\,}}\) tends to \(+\,\infty \), and \({{\,\mathrm{NKLD}\,}}\) thus tends to 1.Footnote 14

The fact that \({{\,\mathrm{KLD}\,}}\) enjoys neither IMP, nor REL, nor ABS, is shown in “Appendices 2.2, 2.3 and 2.4” section, respectively.

4.9 Pearson divergence

The last EMQ we discuss is the Pearson Divergence (\({{\,\mathrm{PD}\,}}\)—see du Plessis and Sugiyama 2012), also called the \(\chi ^{2}\)Divergence (Liese and Vajda 2006), and defined as

$$\begin{aligned} \begin{aligned} {{\,\mathrm{PD}\,}}(p,{\hat{p}}) =&\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}} \frac{(p(c)-{\hat{p}}(c))^{2}}{{\hat{p}}(c)} \end{aligned} \end{aligned}$$
(17)

As an EMQ, \({{\,\mathrm{PD}\,}}\) has been first used by Ceron et al. (2016). \({{\,\mathrm{PD}\,}}\) is undefined when, for a given class c, \({\hat{p}}(c)\) is zero; the smoothed values of Eq. 6 must thus be used within Equation 17 in order to avoid this problem.

The arguments already used for \({{\,\mathrm{AE}\,}}\) in “Appendix 1” section can be easily used to show that \({{\,\mathrm{PD}\,}}\) enjoys IoI, NN, and IND. That \({{\,\mathrm{PD}\,}}\) enjoys MON is instead not self-evident; the proof that it indeed does is reported in “Appendix 3” section.

That \({{\,\mathrm{PD}\,}}\) enjoys neither MAX, nor IMP, nor REL, nor ABS, is shown in “Appendices 2.1, 2.2, 2.3, 2.4” section, respectively. The fact that \({{\,\mathrm{PD}\,}}\) does not enjoy MAX can also be shown with arguments used for showing the same for \({{\,\mathrm{KLD}\,}}\); that is, when a predicted probability \({\hat{p}}(c)\) is very small, \({{\,\mathrm{PD}\,}}\) becomes very large. Thanks to the fact that we use smoothed values, though, \({\hat{p}}\) is lower-bounded by \(\epsilon \), and \({{\,\mathrm{PD}\,}}\) has thus a finite upper bound. Like for other EMQs we have already discussed, the perverse estimator for \({{\,\mathrm{PD}\,}}\) is the one that attributes 1 to the probability of class \(c^{*}=\arg \min _{c\in {\mathcal {C}}}p(c)\) and 0 to the other classes, and its value is thus

$$\begin{aligned} z_{{{\,\mathrm{PD}\,}}}(p,{\hat{p}})=\frac{1}{|{\mathcal {C}}|}\left( \frac{1-(|{\mathcal {C}}|-1)\cdot \epsilon - p_{s}(c^{*})}{1-(|{\mathcal {C}}|-1)\cdot \epsilon }+\sum _{c\in {\mathcal {C}}/\{c^{*}\}}\frac{(p(c)-\epsilon )^{2}}{\epsilon }\right) \end{aligned}$$
(18)

which shows that the range of \({{\,\mathrm{PD}\,}}\) depends on p, the cardinality of \({\mathcal {C}}\), and the value of \(\epsilon \). This suffices to show that \({{\,\mathrm{PD}\,}}\) does not enjoy MAX.

5 Discussion

The properties that the EMQs of Sect. 4 enjoy and do not enjoy are conveniently summarized in Table 1. Table 2 lists instead the papers where the various EMQs have been proposed and the papers where they have subsequently been used for evaluation purposes.

Table 1 Properties of the EMQs discussed in this paper
Table 2 Research works about quantification where the EMQs discussed in this paper have been first proposed (\(\bigstar \)) and later used (\(\checkmark \))

5.1 Are all our properties equally important?

An examination of Table 1 allows us to make a number of general considerations. The first one is that some of our properties (namely: IoI, NN, MON, IND) are unproblematic, since all the EMQs proposed so far satisfy them, while other properties (namely: MAX, IMP, REL, ABS) are failed by several EMQs, including ones (e.g., \({{\,\mathrm{AE}\,}}\), \({{\,\mathrm{KLD}\,}}\)) that are almost standard in the quantification literature. The second, related observation is that, if we agree on the fact that the eight properties we have discussed are desirable, a number of EMQs that have been proposed in the quantification literature emerge as severely inadequate, since they fail several among these properties; this is true even if we discount the fact that, as we have already observed, REL and ABS are mutually exclusive. The case of \({{\,\mathrm{KLD}\,}}\) (which fails on counts of MAX, IMP, REL, ABS) is of special significance, since \({{\,\mathrm{KLD}\,}}\) has almost become a standard in the evaluation of single-label (and binary) quantification (from Table 2\({{\,\mathrm{KLD}\,}}\) emerges as the 2nd most frequently used EMQ, after \({{\,\mathrm{AE}\,}}\)).

However, an even more compelling fact that emerges from Table 1 is that no EMQ among those proposed so far satisfies (even discounting the mutual exclusivity of REL and ABS) all the proposed properties. This suggests that more research is needed in order to identify, or synthesize, an EMQ more satisfactory than all the existing ones.

At the same time, in the absence of a truly satisfactory EMQ, we think that it is important to analyse whether all of our properties are equally important, or if some of them is less important than others and can thus be “sacrificed”. Judging from Table 1, the key stumbling block seems to be the MAX property, since all the EMQs that satisfy MAX (namely: \({{\,\mathrm{NAE}\,}}\), \({{\,\mathrm{NRAE}\,}}\), \({{\,\mathrm{NKLD}\,}}\)) satisfy neither REL nor ABS. This is undesirable since, as argued at the end of Sect. 3.1, some applications of quantification do require REL, while some other applications do require ABS (and we can think of no application that requires neither). Among the EMQs that satisfy ABS (and not REL), \({{\,\mathrm{AE}\,}}\) and \({{\,\mathrm{SE}\,}}\) satisfy all other properties but MAX, while among the ones that satisfy REL (and not ABS), also \({{\,\mathrm{RAE}\,}}\) satisfies all other properties but MAX.

In other words, if we stick to available EMQs, if we want ABS or REL we need to renounce to MAX, while if we want MAX we need to renounce to both ABS and REL. How relatively desirable are these three properties? We recall from Sect. 3.1 that

  1. 1.

    the argument in favour of REL is that it reflects the needs of applications in which an estimation error of a given absolute magnitude should be considered more serious if it affects a rarer class;

  2. 2.

    the argument in favour of ABS is that it reflects the needs of applications in which an estimation error of a given absolute magnitude should be considered to have the same impact independently from the true prevalence of the affected class;

  3. 3.

    the main (although not the only) argument in favour of MAX is that, if an EMQ does not satisfy it, the n samples on which we may want to compare our quantification algorithms will each have a different weight on the final result.

The relative importance of these three arguments is probably a matter of opinion. However, it is our impression that Arguments 1 and 2 are more compelling than Argument 3, since 1 and 2 are really about how an evaluation measure reflects the needs of the application for which one performs a given task (quantification, in our case); if the corresponding properties are not satisfied, one may argue that the quantification accuracy (or error) being measured is only loosely related to what the user really wants.

Argument 3, while important, “only” implies that, if MAX is not satisfied, (1) results obtained on codeframes of different cardinality will not be comparable, and (2) results obtained on samples characterized by different true distributions will not be comparableFootnote 15; while undesirable, this does not affect the experimental comparison among different quantification systems, since each of them is affected by these disparities in the same way.Footnote 16

So, if we accept the idea of “sacrificing” MAX in order to retain REL or ABS, Table 1 indicates that our measures of choice should be

  • \({{\,\mathrm{AE}\,}}\) (or \({{\,\mathrm{SE}\,}}\), which is structurally similar), for those applications in which an estimation error of a given absolute magnitude should be considered more serious when the true prevalence of the affected class is lower; and

  • \({{\,\mathrm{RAE}\,}}\), for those applications in which an estimation error of a given absolute magnitude has the same impact independently from the true prevalence of the affected class.

5.2 Properties that escape formalization

While all the above discussion on the properties of EMQs has been unashamedly formal, we should also remember that choosing an evaluation measure instead of another should also be guided by practical considerations, i.e., by properties of the measure that are not necessarily amenable to formalization. One such property is understandability, i.e., how simple and intuitive is the mathematical form of an evaluation measure. While such simplicity might not be a primary concern for the researcher, or the mathematician, it might be for the practitioner. For instance, a company that wants to sell a text analytics product to a customer might need to run experiments on the customer’s own data and explain the results to the customer; since customers might not be mathematically savvy, the fact that the measure chosen is easily understandable to people with a minimal mathematical background is important. On this account, measures such as \({{\,\mathrm{AE}\,}}\) and \({{\,\mathrm{RAE}\,}}\) certainly win over other measures such as \({{\,\mathrm{KLD}\,}}\) and \({{\,\mathrm{NKLD}\,}}\), which the average customer would find hardly intelligible.Footnote 17

Another property that is difficult to formalize is robustness to outliers. Many EMQs often take the form of an average \(D(p,{\hat{p}})=\frac{1}{|{\mathcal {C}}|}\sum _{c\in {\mathcal {C}}}f(p(c),{\hat{p}}(c))\) across the classes in the codeframe. If \(D(p,{\hat{p}})\) is not “robust to outliers”, it means that an extreme value \(f(p(c'),{\hat{p}}(c'))\) that may occur for some \(c'\in {\mathcal {C}}\) dominates on all the other values \(f(p(c),{\hat{p}}(c))\) for \(c\in {\mathcal {C}}/\{c'\}\), giving rise to a high value of \(D(p,{\hat{p}})\) that is essentially due to \(c'\) only. As the name implies, “robustness to outliers” is usually considered a desirable property; however, in some contexts it might also viewed as undesirable (e.g., we might want to avoid quantification methods that generate blatant mistakes, so we might want a measure that penalizes the presence of even one of them). Aside from the fact that its desirability is questionable, it should also be mentioned that “robustness to outliers” comes in degrees. E.g., absolute error is more robust to outliers than squared error, but squared error is more robust to outliers than “cubic error”, etc.; and all of them are vastly more robust to outliers than \({{\,\mathrm{KLD}\,}}\) and \({{\,\mathrm{NKLD}\,}}\). Which among these enforces the “right” level of robustness to outliers? This shows that robustness to outliers, independently from its desirability, cannot be framed as a binary property (i.e., one that a measure either enjoys or not), and thus escapes the type of analysis that we have carried out in this paper.

Another property which is difficult to formalize has to do with the set of values which an EMQ ranges on when evaluating realistic quantification systems (i.e., systems that exhibit a quantification accuracy equal or superior to, say, that of a trivial “classify and count” approach using SVMs). For these systems, the actual values that an EMQ takes should occupy a fairly small subinterval of its entire range. The question is: how small? One particularly problematic EMQ, from this respect, is \({{\,\mathrm{KLD}\,}}\). While its range is \([0,z_{{{\,\mathrm{KLD}\,}}}]\), where \(z_{{{\,\mathrm{KLD}\,}}}\) is as in Eq. 15, realistic quantification systems generate very small \({{\,\mathrm{KLD}\,}}\) values, so small that they are sometimes difficult to make sense of. One result is that two genuine quantifiers that are being compared experimentally may easily obtain results several orders of magnitude away. Such differences in performance are difficult to grasp.Footnote 18 We should add that, if one wants to average \({{\,\mathrm{KLD}\,}}\) results across a set of samples (on this see also Sect. 5.3), the average is completely dominated by the value with the highest order of magnitude, and the others have little or no impact. Unfortunately, switching from \({{\,\mathrm{KLD}\,}}\) to \({{\,\mathrm{NKLD}\,}}\) does not help much in this respect since, for realistic quantification systems, \({{\,\mathrm{NKLD}\,}}(p,{\hat{p}})\approx \frac{1}{2}{{\,\mathrm{KLD}\,}}(p,{\hat{p}})\). The reason is that \({{\,\mathrm{NKLD}\,}}\) is obtained by applying a sigmoidal function (namely, the logistic function) to \({{\,\mathrm{KLD}\,}}\), and the tangent to this sigmoid for \(x=0\) is \(y=\frac{1}{2}x\); since the values of \({{\,\mathrm{KLD}\,}}\) for realistic quantifiers are (as we have observed above) very close to 0, for these values the \({{\,\mathrm{NKLD}\,}}(p,{\hat{p}})\) curve is well approximated by \(y=\frac{1}{2}{{\,\mathrm{KLD}\,}}(p,{\hat{p}})\). As an EMQ, \({{\,\mathrm{NKLD}\,}}\) thus de facto inherits most of the problems of \({{\,\mathrm{KLD}\,}}.\)

All of the above shows that choosing a good EMQ (and the same may well be true for tasks other than quantification) should also be based, aside from the formal properties that the EMQ enjoys, on criteria that either resist or completely escape formalization, such as understandability and ease of use.

5.3 Evaluating quantification across multiple samples

On a different note, we also need to stress a key difference between measures of classification accuracy and measures of quantification accuracy (or error). The objects of classification are individual unlabelled items, and all measures of classification accuracy (e.g., \(F_{1}\)) are defined with respect to a test set of such objects. The objects of quantification, instead, are samples, and all the measures of quantification accuracy we have discussed in this paper are defined on a single such sample (i.e., they measure how well the true distribution of the classes across this individual sample is approximated by the predicted distribution of the classes across the same sample). Since every evaluation is worthless if carried out on a single object, it is clear that quantification systems need to be evaluated on sets of samples. This means that every measure that we have discussed needs first to be evaluated on each sample, and then its global score across the test set (i.e., the set of samples on which testing is carried out) needs to be computed. This global score may be computed via any measure of central tendency, e.g., via an average, or a median, or other (for instance, if \({{\,\mathrm{NAE}\,}}\) is used, we might in turn use Average\({{\,\mathrm{NAE}\,}}\) or Median\({{\,\mathrm{NAE}\,}}\), where averages and medians are computed across a set of samples). We do not take any specific stand for or against computing global scores via any specific measure of central tendency, since each of them may serve different but legitimate purposes. Note that a weighted average (in which the weight of a sample is inversely proportional to the score that the perverse estimator would obtain on the sample) might be appropriate for measures that do not satisfy MAX.

6 Conclusions

We have presented a study that “evaluates evaluation”, in the tradition of the so-called “axiomatic” approach to the study of evaluation measures for information retrieval and related tasks. Our effort has targeted quantification, an important task at the crossroads of information retrieval, data mining, and machine learning, and has consisted of analysing previously proposed evaluation measures for quantification using the toolbox of the above-mentioned “axiomatic” approach. The work closest in spirit to the present one is our past work on the analysis of evaluation measures for classification (Sebastiani 2015). However, quantification poses more difficult problems than classification, since evaluation measures for quantification are inherently nonlinear (i.e., quantification error cannot be expressed as a linear function of the labelling error made on individual items). This is unlike classification, for which linear measures (e.g., standard accuracy, or K—see Sebastiani 2015) are possible.

We have proposed eight properties that, as we have argued, are desirable for measures that attempt to evaluate quantification (two such properties are actually mutually exclusive, and are desirable each in a different class of applications of quantification). Our analysis has revealed that, unfortunately, no existing evaluation measure for quantification satisfies all the other six properties. While this points to the fact that more research is needed to identify, or synthesize, a truly adequate such measure, this also means that, for the moment being, we have to evaluate the relative desirability of the properties that the existing measures do not satisfy. We have argued that some such properties are more important than others, and that as a result two measures (“Absolute Error” and “Relative Absolute Error”) stand out as the most satisfactory ones (interestingly enough, they are also the most time-honoured ones, and the mathematically simplest ones).

As we have argued, RAE is more adequate for application contexts (e.g., quantifying the Tubercolosis class, as discussed in Sect. 3.1) in which an estimation error of a given absolute magnitude should be considered more serious if it affects a rare class, while AE is more adequate for those applications (e.g., quantifying the NoShow class, as discussed in Sect. 3.1) in which an estimation error of a given absolute magnitude has the same impact independently from the true prevalence of the affected class. Future work should also address the problem of how to best characterize these two classes of applications. The number and the percentage of items in a sample \(\sigma \) that belong to class c, seem to be essentially one and the same thing, but some applications (e.g., the NoShow example) are inherently interested in numbers, while other applications (e.g., the Tubercolosis example) seem more interested in percentages. When is it that a certain application belongs to the former (or to the latter) class, and why?

Aside from the design and use of an appropriate evaluation measure, there are further aspects concerning the evaluation of quantification that this work does not tackle. One of them is how to devise an evaluation protocol that strikes a balance between the two contrasting goals of (a) testing quantifiers on samples that exhibit naturally occurring class prevalences [this is the approach adopted in works such as Gao and Sebastiani (2016), Nakov et al. (2016)], and (b) testing quantifiers also on samples that exhibit class prevalences (very) different from the naturally occurring ones [this is the approach adopted in works such as Forman (2008), Esuli et al. (2018)]. The realistic nature of the samples is the primary concern of the former approach, while testing quantifiers for robustness to different amounts of “prior probability shift” (i.e., difference between the prevalences in the training set and in the unlabelled set) is the one of the latter. We are working on an attempt to combine the strengths of both worlds, and hope to report results in the near future.