Abstract
A discrimination rule is chosen from a possibly infinite collection of discrimination rules based upon the minimization of the observed error in a test sample. For example, the collection could include all k nearest neighbor rules (for all k), all linear discriminators, and all kernel-based rules (for all possible choices of the smoothing parameter). We do not put any restrictions on the collection.
We study how close the probability of error of the selected rule is to the (unknown) minimal probability of error over the entire collection. If both training sample and test sample have n observations, the expected value of the difference is shown to be \(O\left( {\sqrt {\log (n)/n} } \right)\) for many reasonable collections, such as the one mentioned above. General inequalities governing this error are given which are of a combinatorial nature, i.e., they are valid for all possible distributions of the data, and most practical collections of rules.
The theory is based in part on the work of Vapnik and Chervonenkis regarding minimization of the empirical risk. For all proofs, technical details, and additional examples, we refer to Devroye (1986).
As a by-product, we establish that for some nonparametric rules, the probability of error of the selected rule converges at the optimal rate (achievable within the given collection of non-parametric rules) to the Bayes probability of error, and this without actually knowing the optimal rate of convergence to the Bayes probability of error.
Research of the author was sponsored by NSERC Grant A3456 and by FCAR Grant EQ-16T8
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
O. Bashkirov, E.M. Braverman, and I.E. Muchnik, “Potential function algorithms for pattern recognition learning machines,” Automation and Remote Control, vol. 25, pp. 692–695, 1964.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees, Wadsworth International, Belmont, CA., 1984.
T. Cacoullos, “Estimation of a multivariate density,” Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179–190, 1965.
R.G. Casey and G. Nagy, “Decision tree design using a probabilistic model,” IEEE Transactions on Information Theory, vol. IT-30, pp. 93–99, 1984.
T.M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326–334, 1965.
T.M. Cover, “Learning in pattern recognition,” in Methodologies of Pattern Recognition, ed. S. Watanabe, pp. 111–132, Academic Press, New York, N.Y., 1969.
T.M. Cover and T.J. Wagner, “Topics in statistical pattern recognition,” Communication and Cybernetics, vol. 10, pp. 15–46, 1975.
L. Devroye, “A universal k-nearest neighbor procedure in discrimination,” in Proceedings of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Processing, pp. 142–147, 1978.
L. Devroye and T.J. Wagner, “Distribution-free performance bounds for potential function rules,” IEEE Transactions on Information Theory, vol. IT-25, pp. 601–604, 1979.
L. Devroye and T.J. Wagner, “Distribution-free performance bounds with the re- substitution error estimate,” IEEE Transactions on Information Theory, vol. IT-25, pp. 208–210, 1979.
L. Devroye and T.J. Wagner, “Distribution-free inequalities for the deleted and holdout error estimates,” IEEE Transactions on Information Theory, vol. IT-25, pp. 202–207, 1979.
L. Devroye and T.J. Wagner, “Distribution-free consistency results in non- parametric discrimination and regression function estimation,” Annals of Statistics, vol. 8, pp. 231–239, 1980.
L. Devroye, “Bounds for the uniform deviation of empirical measures,” Journal of Multivariate Analysis, vol. 12, pp. 72–79, 1982.
L. Devroye and L. Gyorfi, Nonparametric Density Estimation: the L 1 View, John Wiley, New York, 1985.
L. Devroye, “Automatic pattern recognition: a study of the probability of error,” Technical Report, School of Computer Science, McGill University, 1986.
R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley, New York, N.Y., 1973.
B. Efron, “Bootstrap methods: another look at the jackknife,” Annals of Statistics, vol. 7, pp. 1–26, 1979.
B. Efron, “Estimating the error rate of a prediction rule: improvement on cross validation,” Journal of the American Statistical Association, vol. 78, pp. 316–331, 1983.
L. Feinholz, “Estimation of the performance of partitioning algorithms in pattern classification,” M.Sc.Thesis, Department of Mathematics, McGill University, Montreal, 1979.
E. Fix and J.L. Hodges, “Discriminating analysis, nonparametric discrimination, consistency properties,” Report 21–49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
E. Fix and J.L. Hodges, “Discriminatory analysis: small sample performance,” Report 21–49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1952.
N. Glick, “Sample-based classification procedures derived from density estimators,” Journal of the American Statistical Association, vol. 67, pp. 116–122, 1972.
N. Glick, “Sample-based classification procedures related to empiric distributions,” Transactions on Information Theory, vol. IT-22, pp. 454–461, 1976.
N. Glick, “Additive estimators for probabilities of correct classification,” Pattern Recognition, vol. 10, pp. 211–222, 1978.
W. Greblicki, A. Krzyzak, and M. Pawlak, “Distribution-free pointwise consistency of kernel regression estimate,” Annals of Statistics, vol. 12, pp. 1570–1575, 1984.
L.N. Kanal, “Pattern in pattern recognition,” IEEE Transactions on Information Theory, vol. IT-20, pp. 697–722, 1974.
Y.K. Lin and K.S. Fu, “Automatic classification of cervical cells using a binary tree classifier,” Pattern Recognition, vol. 16, pp. 69–80, 1983.
A.L. Lunts and V.L. Brailosvsky, “Evaluation of attributes obtained in statistical decision rules,” Engineering Cybernetics, vol. 5, pp. 98–109, 1967.
P. Massart, “Vitesse de convergence dans le théorème de la limite centrale pour le processus empirique,” Ph.D. Dissertation, Université de Paris-Sud, Orsay, France, 1983.
W. Meisel, “Potential functions in mathematical pattern recognition,” IEEE Transactions on Computers, vol. C-18, pp. 911–918, 1969.
R.A. Olshen, “Comments on a paper by C.J. Stone,” Annals of Statistics, vol. 5, pp. 632–633, 1977.
E. Parzen, “On the estimation of a probability density function and the mode,” Annals of Mathematical Statistics, vol. 33, pp. 1065–1076, 1962.
H.J. Payne and W.S. Meisel, “An algorithm for constructing optimal binary decision trees,” IEEE Transactions on Computers, vol. C-26, pp. 905–916, 1977.
M. Rosenblatt, “Remark on some nonparametric estimates of a density function,” Annals of Mathematical Statistics, vol. 27, pp. 832–837, 1956.
G. Sebestyen,Decision Making Processes in Pattern Recognition, Macmillan, New York, N.Y., 1962.
I.K. Sethi and B. Chatterjee, “Efficient decision tree design for discrete variable pattern recognition problems,” Pattern Recognition, vol. 9, pp. 197–206, 1977.
I.K. Sethi and G.P.R. Sarvarayudu, “Hierarchical classifier design using mutual Information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PA MI-4, pp. 441–445, 1981.
C. Spiegelman and J. Sacks, “Consistent window estimation in nonparametric regression,” Annals of Statistics, vol. 8, pp. 240–246, 1980.
C.J. Stone, “Consistent nonparametric regression,” Annals of Statistics, vol. 8, pp. 1348–1360, 1977.
M. Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society, vol. 36, pp. 111–147, 1974.
G.T. Toussaint, “Bibliography on estimation of misclassification,” IEEE Transactions on Information Theory, vol. IT-20, pp. 474–479, 1974.
J. VanRyzin, “Bayes risk consistency of classification procedures using density estimation,” Sankhya Series A, vol. 28, pp. 161–170, 1966.
V.N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer- Verlag, 1982.
V.N. Vapnik and A. Ya. Chervonenkis, “Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data,” Automation and Remote Control, vol. 32, pp. 207–217, 1971.
V.N. Vapnik and A. Ya. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and its Applications, vol. 16, pp. 264–280, 1971.
V.N. Vapnik and A. Ya. Chervonenkis, “Ordered risk minimization. I,” Automation and Remote Control, vol. 35, pp. 1226–1235, 1974.
V.N. Vapnik and A. Ya. Chervonenkis, “Ordered risk minimization. II,” Automation and Remote Control, vol. 35, pp. 1043–1412, 1974.
V.N. Vapnik and A. Ya. Chervonenkis, Theory of Pattern Recognition, Nauka, Moscow, 1974.
V. N. Vapnik and A. Ya. Chervonenkis, “Necessary and sufficient conditions for the uniform convergence of means to their expectations,” Theory of Probability and its Applications, vol. 26, pp. 532–553, 1981.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1987 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Devroye, L. (1987). Automatic Selection of a Discrimination Rule Based upon Minimization of the Empirical Risk. In: Devijver, P.A., Kittler, J. (eds) Pattern Recognition Theory and Applications. NATO ASI Series, vol 30. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-83069-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-83069-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-83071-6
Online ISBN: 978-3-642-83069-3
eBook Packages: Springer Book Archive