Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing

Chzhen, Evgenii

doi:10.3103/S1066530720020027

Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing

Published: 07 September 2021

Volume 29, pages 87–105, (2020)
Cite this article

Mathematical Methods of Statistics Aims and scope Submit manuscript

Evgenii Chzhen¹

176 Accesses
1 Citation
Explore all metrics

Abstract

This work studies the problem of binary classification with the F-score as the performance measure. We propose a post-processing algorithm for this problem which fits a threshold for any score base classifier to yield high F-score. The post-processing step involves only unlabeled data and can be performed in logarithmic time. We derive a general finite sample post-processing bound for the proposed procedure and show that the procedure is minimax rate optimal, when the underlying distribution satisfies classical nonparametric assumptions. This result improves upon previously known rates for the F-score classification and bridges the gap between standard classification risk and the F-score. Finally, we discuss the generalization of this approach to the set-valued classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The (weighted) harmonic mean of any real number and zero is defined as zero.
It is assumed that \(\mathbf{X}_{n+1},\ldots,\mathbf{X}_{n+N}\) are independent from \((\mathbf{X}_{1},Y_{1}),\ldots,(\mathbf{X}_{n},Y_{n}),(\mathbf{X},Y)\).
For this discussion let us assume that \(n\) is even.
We only assumed that \(\mathbb{P}(Y=1)>0\) which is hardly an assumption in this context.
Again it is assumed that the (weighted) harmonic mean of any real number and zero is zero.
Indeed note that \(R(\theta)=H_{1}(\theta)+H_{2}(\theta)\) with \(H_{1}(\theta)=b^{2}\theta\) being strictly increasing and \(H_{2}(\theta)=-\sum_{k=1}^{K}\mathbb{E}(\eta_{k}({x})-\theta)_{+}\) being non-decreasing.

REFERENCES

J.-Y. Audibert, Progressive Mixture Rules are Deviation Suboptimal, in NIPS, (2007), pp. 41–48.
J.-Y. Audibert and A. B. Tsybakov, ‘‘Fast learning rates for plug-in classifiers,’’ Ann. Statist. 35 (2), 608–633 (2007).
Article MathSciNet Google Scholar
H. Bao and M. Sugiyama, Calibrated surrogate maximization of linear-fractional utility in binary classification (2019), arXiv preprint arXiv:1905.12511.
M. Binkhonain and L. Zhao, A review of machine learning algorithms for identification and classification of non-functional requirements. Expert Systems with Applications: X, 1:100001 (2019).
E. Chzhen, C. Denis, and M. Hebiri, Minimax semi-supervised confidence sets for multi-class classification. preprint (2019). https://arxiv.org/abs/1904.12527.
S. Conte and C. Boor, Elementary Numerical Analysis: An Algorithmic Approach (McGraw-Hill Higher Education, 3rd edition, 1980).
MATH Google Scholar
J. Del Coz, J. Díez, and A. Bahamonde, ‘‘Learning nondeterministic classifiers,’’ Journal of Machine Learning Research 10 (10) (2009).
K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hüllermeier, ‘‘Optimizing the f-measure in multi-label classification: Plugin rule approach versus structured loss minimization,’’ in International conference on machine learning, 1130–1138 (2013).
K. Dembczynski, W. Waegeman, W. Cheng, and E. Hüllermeier, ‘‘An exact algorithm for f-measure maximization,’’ Advances in neural information processing systems 24, 1404–1412 (2011).
Google Scholar
C. Denis and M. Hebiri, ‘‘Confidence sets with expected sizes for multiclass classification,’’ JMLR 18 (1), 3571–3598 (2017).
MathSciNet MATH Google Scholar
L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, vol. 31 of Applications of Mathematics (Springer-Verlag, New York, 1996).
P. Flach, ‘‘Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward,’’ In Proceedings of the AAAI Conference on Artificial Intelligence 33, 9808–9814 (2019).
Article Google Scholar
S. Gadat, T. Klein, and C. Marteau, ‘‘Classification in general finite dimensional spaces with the k-nearest neighbor rule,’’ Ann. Statist. 44 (3), 982–1009 (2016).
Article MathSciNet Google Scholar
A. Gunawardana and G. Shani, ‘‘A survey of accuracy evaluation metrics of recommendation tasks,’’ Journal of Machine Learning Research 10 (12) (2009).
M. Jansche, ‘‘Maximum expected f-measure training of logistic regression models,’’ in Proceedings of the conference on human language technology and empirical methods in natural language processing, Association for Computational Linguistics, 692–699 (2005).
N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Classification Perspective (Cambridge University Press, 2011).
Book Google Scholar
O. Koyejo, N. Natarajan, P. Ravikumar, and I. Dhillon, ‘‘Consistent binary classification with generalized performance metrics,’’ in NIPS, 2744–2752 (2014).
S. Kpotufe and G. Martinet, ‘‘Marginal singularity, and the benefits of labels in covariate-shift,’’ in Conference on Learning Theory, 1882–1886 (2018).
M. Lapin, M. Hein, and B. Schiele, ‘‘Top-k multiclass svm,’’ in Advances in Neural Information Processing Systems, 325–333 (2015).
O. Mac Aodha, E. Cole, and P. Perona, ‘‘Presence-only geographical priors for fine-grained image classification,’’ in Proceedings of the IEEE International Conference on Computer Vision, 9596–9606 (2019).
E. Mammen and A. B. Tsybakov, ‘‘Smooth discrimination analysis,’’ Ann. Statist. 27 (6), 1808–1829 (1999).
MathSciNet MATH Google Scholar
D. R. Martin, C. C. Fowlkes, and J. Malik, ‘‘Learning to detect natural image boundaries using local brightness, color, and texture cues,’’ IEEE transactions on pattern analysis and machine intelligence 26 (5), 530–549 (2004).
Article Google Scholar
T. Mortier, M. Wydmuch, K. Dembczyński, E. Hüllermeier, and W. Waegeman, Efficient set-valued prediction in multi-class classification (2019), arXiv preprint arXiv:1906.08129.
D. R. Musicant, V. Kumar, A. Ozgur, et al. ‘‘Optimizing f-measure with support vector machines,’’ in FLAIRS conference, 356–360 (2003).
H. Narasimhan, R. Vaish, and S. Agarwal, ‘‘On the statistical consistency of plug-in classifiers for non-decomposable performance measures,’’ in NIPS, 1493–1501 (2014).
S. P. Parambath, N. Usunier, and Y. Grandvalet, ‘‘Optimizing f-measures by cost-sensitive classification,’’ in Advances in Neural Information Processing Systems, 2123–2131 (2014).
W. Polonik, ‘‘Measuring mass concentrations and estimating density contour clusters-an excess mass approach,’’ Ann. Statist. 23 (3), 855–881 (1995).
Article MathSciNet Google Scholar
H. G. Ramaswamy, A. Tewari, S. Agarwal, et al. ‘‘Consistent algorithms for multiclass classification with an abstain option,’’ Electronic Journal of Statistics 12 (1), 530–554 (2018).
Article MathSciNet Google Scholar
P. Rigollet, R. Vert, et al., ‘‘Optimal rates for plug-in estimators of density level sets,’’ Bernoulli 15 (4), 1154–1178 (2009).
Article MathSciNet Google Scholar
M. Sadinle, J. Lei, and L. Wasserman, ‘‘Least ambiguous set-valued classifiers with bounded error levels,’’ Journal of the American Statistical Association 114 (525), 223–234 (2019).
Article MathSciNet Google Scholar
C. Scott, ‘‘Calibrated asymmetric surrogate losses,’’ Electronic Journal of Statistics 6, 958–992 (2012).
Article MathSciNet Google Scholar
C. J. Stone, ‘‘Consistent nonparametric regression,’’ The annals of statistics, 595–620 (1977).
E. F. Tjong Kim Sang and F. De Meulder, ‘‘Introduction to the conll-2003 shared task: Language-independent named entity recognition,’’ in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, Vol. 4, pp. 142–147.
A. B. Tsybakov, ‘‘Optimal aggregation of classifiers in statistical learning,’’ Ann. Statist. 32 (1), 135–166 (2004).
Article MathSciNet Google Scholar
A. B. Tsybakov, Introduction to Nonparametric Estimation. Springer Series in Statistics (Springer, New York, 2009).
Book Google Scholar
V. Vovk, I. Nouretdinov, V. Fedorova, I. Petej, and A. Gammerman, ‘‘Criteria of efficiency for set-valued classification,’’ Annals of Mathematics and Artificial Intelligence 81, 21–47 (2017).
Article MathSciNet Google Scholar
W. Waegeman, K. Dembczyński, A. Jachnik, W. Cheng, and E. Hüllermeier, ‘‘On the bayes-optimality of f-measure maximizers,’’ Journal of Machine Learning Research 15, 3333–3388 (2014).
MathSciNet MATH Google Scholar
B. Yan, S. Koyejo, K. Zhong, and P. Ravikumar, Binary classification with karmic, threshold-quasi-concave metrics, in ICML, vol. 80 (2018).
Y. Yang, ‘‘Minimax nonparametric classification: Rates of convergence,’’ IEEE Transactions on Information Theory 45 (7), 2271–2284 (1999).
Article MathSciNet Google Scholar
M.-J. Zhao, N. Edakunni, A. Pocock, and G. Brown, ‘‘Beyond fano’s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications,’’ JMLR 14(Apr), 1033–1090 (2013).
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

LMO, Université Paris-Saclay, CNRS, Inria, France
Evgenii Chzhen

Authors

Evgenii Chzhen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evgenii Chzhen.

About this article

Cite this article

Chzhen, E. Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing. Math. Meth. Stat. 29, 87–105 (2020). https://doi.org/10.3103/S1066530720020027

Download citation

Received: 12 October 2020
Revised: 21 December 2020
Accepted: 02 February 2021
Published: 07 September 2021
Issue Date: April 2020
DOI: https://doi.org/10.3103/S1066530720020027

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions