Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11100))

Abstract

This paper reviews the checkered history of predictive distributions in statistics and discusses two developments, one from recent literature and the other new. The first development is bringing predictive distributions into machine learning, whose early development was so deeply influenced by two remarkable groups at the Institute of Automation and Remote Control. As result, they become more robust and their validity ceases to depend on Bayesian or narrow parametric assumptions. The second development is combining predictive distributions with kernel methods, which were originated by one of those groups, including Emmanuel Braverman. As result, they become more flexible and, therefore, their predictive efficiency improves significantly for realistic non-linear data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Burnaev, E., Vovk, V.: Efficiency of conformalized ridge regression. In: JMLR: Workshop and Conference Proceedings, COLT 2014, vol. 35, pp. 605–622 (2014)

    Google Scholar 

  2. Burnaev, E.V., Nazarov, I.N.: Conformalized Kernel Ridge Regression. Technical report arXiv:1609.05959 [stat.ML], arXiv.org e-Print archive, September 2016. Conference version: Proceedings of the Fifteenth International Conference on Machine Learning and Applications (ICMLA 2016), pp. 45–52

  3. Chatterjee, S., Hadi, A.S.: Sensitivity Analysis in Linear Regression. Wiley, New York (1988)

    Book  Google Scholar 

  4. Cox, D.R.: Some problems connected with statistical inference. Ann. Math. Stat. 29, 357–372 (1958)

    Article  MathSciNet  Google Scholar 

  5. Dawid, A.P.: Statistical theory: the prequential approach (with discussion). J. Royal Stat. Soc. A 147, 278–292 (1984)

    Article  Google Scholar 

  6. Dawid, A.P., Vovk, V.: Prequential probability: principles and properties. Bernoulli 5, 125–162 (1999)

    Article  MathSciNet  Google Scholar 

  7. Efron, B.: R. A. Fisher in the 21st century. Stat. Sci. 13, 95–122 (1998)

    Article  MathSciNet  Google Scholar 

  8. Gneiting, T., Katzfuss, M.: Probabilistic forecasting. Ann. Rev. Stat. Appl. 1, 125–151 (2014)

    Article  Google Scholar 

  9. Goldberg, P.W., Williams, C.K.I., Bishop, C.M.: Regression with input-dependent noise: a Gaussian process treatment. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems 10, pp. 493–499. MIT Press, Cambridge (1998)

    Google Scholar 

  10. Henderson, H.V., Searle, S.R.: On deriving the inverse of a sum of matrices. SIAM Rev. 23, 53–60 (1981)

    Article  MathSciNet  Google Scholar 

  11. Knight, F.H.: Risk, Uncertainty, and Profit. Houghton Mifflin Company, Boston (1921)

    Google Scholar 

  12. Le, Q.V., Smola, A.J., Canu, S.: Heteroscedastic Gaussian process regression. In: Dechter, R., Richardson, T. (eds.) Proceedings of the Twenty Second International Conference on Machine Learning, pp. 461–468. ACM, New York (2005)

    Google Scholar 

  13. McCullagh, P., Vovk, V., Nouretdinov, I., Devetyarov, D., Gammerman, A.: Conditional prediction intervals for linear regression. In: Proceedings of the Eighth International Conference on Machine Learning and Applications (ICMLA 2009), pp. 131–138 (2009). http://www.stat.uchicago.edu/~pmcc/reports/predict.pdf

  14. Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 5th edn. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  15. Platt, J.C.: Probabilities for SV machines. In: Smola, A.J., Bartlett, P.L., Schölkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press (2000)

    Google Scholar 

  16. Schweder, T., Hjort, N.L.: Confidence, Likelihood, Probability: Statistical Inference with Confidence Distributions. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  17. Shen, J., Liu, R., Xie, M.: Prediction with confidence–A general framework for predictive inference. J. Stat. Plann. Infer. 195, 126–140 (2018)

    Article  MathSciNet  Google Scholar 

  18. Shiryaev, A.N.: (Probability), 3rd edn. , Moscow (2004)

    Google Scholar 

  19. Snelson, E., Ghahramani, Z.: Variable noise and dimensionality reduction for sparse Gaussian processes. In: Dechter, R., Richardson, T. (eds.) Proceedings of the Twenty Second Conference on Uncertainty in Artifical Intelligence (UAI 2006), pp. 461–468. AUAI Press, Arlington (2006)

    Google Scholar 

  20. Steinwart, I.: On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2, 67–93 (2001)

    MathSciNet  MATH  Google Scholar 

  21. Thomas-Agnan, C.: Computing a family of reproducing kernels for statistical applications. Numer. Algorithms 13, 21–32 (1996)

    Article  MathSciNet  Google Scholar 

  22. Vovk, V.: Universally consistent predictive distributions. Technical report. arXiv:1708.01902 [cs.LG], arXiv.org e-Print archive, August 2017

  23. Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)

    MATH  Google Scholar 

  24. Vovk, V., Nouretdinov, I., Gammerman, A.: On-line predictive linear regression. Ann. Stat. 37, 1566–1590 (2009)

    Article  MathSciNet  Google Scholar 

  25. Vovk, V., Papadopoulos, H., Gammerman, A. (eds.): Measures of Complexity: Festschrift for Alexey Chervonenkis. Springer, Heidelberg (2015)

    MATH  Google Scholar 

  26. Vovk, V., Petej, I.: Venn-Abers predictors. In: Zhang, N.L., Tian, J. (eds.) Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 829–838. AUAI Press, Corvallis (2014)

    Google Scholar 

  27. Vovk, V., Shen, J., Manokhin, V., Xie, M.: Nonparametric predictive distributions based on conformal prediction. In: Proceedings of Machine Learning Research, COPA 2017, vol. 60, pp. 82–102 (2017)

    Google Scholar 

  28. Wasserman, L.: Frasian inference. Stat. Sci. 26, 322–325 (2011)

    Article  Google Scholar 

  29. Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 609–616. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the EU Horizon 2020 Research and Innovation programme (in the framework of the ExCAPE project under grant agreement 671555) and Astra Zeneca (in the framework of the project “Machine Learning for Chemical Synthesis”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vladimir Vovk .

Editor information

Editors and Affiliations

A Properties of the Hat Matrix

A Properties of the Hat Matrix

In the kernelized setting of this paper the hat matrix is defined as \(H=(K+aI)^{-1}K\), where K is a symmetric positive semidefinite matrix whose size is denoted \(n\times n\) in this appendix (cf. (10); in our current abstract setting we drop the bars over H and K and write n in place of \(n+1\)). We will prove, or give references for, various properties of the hat matrix used in the main part of the paper.

Numerous useful properties of the hat matrix can be found in literature (see, e.g., [3]). However, the usual definition of the hat matrix is different from ours, since it is not kernelized; therefore, we start from reducing our kernelized definition to the standard one. Since K is symmetric positive semidefinite, it can be represented in the form \(K=XX'\) for some matrix X, whose size will be denoted \(n\times p\) (in fact, a matrix is symmetric positive semidefinite if and only if it can be represented as the Gram matrix of n vectors; this easily follows from the fact that a symmetric positive semidefinite K can be diagonalized: \(K=Q'\varLambda Q\), where Q and \(\varLambda \) are \(n\times n\) matrices, \(\varLambda \) is diagonal with nonnegative entries, and \(Q'Q=I\)). Now we can transform the hat matrix as

$$ H = (K+aI)^{-1}K = (XX'+aI)^{-1}XX' = X(X'X+aI)^{-1}X' $$

(the last equality can be checked by multiplying both sides by \((XX'+aI)\) on the left). If we now extend X by adding \(\sqrt{a}I_p\) on top of it (where \(I_p=I\) is the \(p\times p\) unit matrix),

$$\begin{aligned} \tilde{X} := \begin{pmatrix} \sqrt{a}I_p \\ X \end{pmatrix}, \end{aligned}$$
(30)

and set

$$\begin{aligned} \tilde{H} := \tilde{X}(\tilde{X}'\tilde{X})^{-1}\tilde{X}' = \tilde{X}(X'X+aI)^{-1}\tilde{X}', \end{aligned}$$
(31)

we will obtain a \((p+n)\times (p+n)\) matrix containing H in its lower right \(n\times n\) corner. To find HY for a vector \(Y\in \mathbb {R}^n\), we can extend Y to \(\tilde{Y}\in \mathbb {R}^{p+n}\) by adding p zeros at the beginning of Y and then discard the first p elements in \(\tilde{H} \tilde{Y}\). Notice that \(\tilde{H}\) is the usual definition of the hat matrix associated with the data matrix \(\tilde{X}\) (cf. [3, (1.4a)]).

When discussing (11), we used the fact that the diagonal elements of H are in [0, 1). It is well-known that the diagonal elements of the usual hat matrix, such as \(\tilde{H}\), are in [0, 1] (see, e.g., [3, Property 2.5(a)]). Therefore, the diagonal elements of H are also in [0, 1]. Let us check that \(h_i\) are in fact in the semi-open interval [0, 1) directly, without using the representation in terms of \(\tilde{H}\). Representing \(K=Q'\varLambda Q\) as above, where \(\varLambda \) is diagonal with nonnegative entries and \(Q'Q=I\), we have

$$\begin{aligned} H = (K+aI)^{-1} K&= (Q'\varLambda Q+aI)^{-1} Q'\varLambda Q = (Q'(\varLambda +aI)Q)^{-1} Q'\varLambda Q \nonumber \\&= Q^{-1}(\varLambda +aI)^{-1}(Q')^{-1} Q'\varLambda Q = Q'(\varLambda +aI)^{-1}\varLambda Q. \end{aligned}$$
(32)

The matrix \((\varLambda +aI)^{-1}\varLambda \) is diagonal with the diagonal entries in the semi-open interval [0, 1). Since \(Q'Q=I\), the columns of Q are vectors of length 1. By (32), each diagonal element of H is then of the form \(\sum _{i=1}^n \lambda _i q_i^2\), where all \(\lambda _i\in [0,1)\) and \(\sum _{i=1}^n q_i^2 = 1\). We can see that each diagonal element of H is in [0, 1).

The equality (11) itself was used only for motivation, so we do not prove it; for a proof in the non-kernelized case, see, e.g., [14, (4.11) and Appendix C.7].

Proof of Lemma  2

In our proof of \(B_i>0\) we will assume \(a>0\), as usual. We will apply the results discussed so far in this appendix to the matrix \(\bar{H}\) in place of H and to \(n+1\) in place of n.

Our goal is to check the strict inequality

$$\begin{aligned} \sqrt{1-\bar{h}_{n+1}} + \frac{\bar{h}_{i,n+1}}{\sqrt{1-\bar{h}_{i}}} > 0; \end{aligned}$$
(33)

remember that both \(\bar{h}_{n+1}\) and \(\bar{h}_i\) are numbers in the semi-open interval [0, 1). The inequality (33) can be rewritten as

$$\begin{aligned} \bar{h}_{i,n+1} > -\sqrt{(1-\bar{h}_{n+1})(1-\bar{h}_{i})} \end{aligned}$$
(34)

and in the weakened form

$$\begin{aligned} \bar{h}_{i,n+1} \ge -\sqrt{(1-\bar{h}_{n+1})(1-\bar{h}_{i})} \end{aligned}$$
(35)

follows from [3, Property 2.6(b)] (which can be applied to \(\tilde{H}\)).

Instead of the original hat matrix \(\bar{H}\) we will consider the extended matrix (31), where \(\tilde{X}\) is defined by (30) with \(\bar{X}\) in place of X. The elements of \(\tilde{H}\) will be denoted \(\tilde{h}\) with suitable indices, which will run from \(-p+1\) to \(n+1\), in order to have the familiar indices for the submatrix \(\bar{H}\). We will assume that we have an equality in (34) and arrive at a contradiction. There will still be an equality in (34) if we replace \(\bar{h}\) by \(\tilde{h}\), since \(\tilde{H}\) contains \(\bar{H}\). Consider auxiliary “random residuals” \(E:=(I-\tilde{H})\epsilon \), where \(\epsilon \) is a standard Gaussian random vector in \(\mathbb {R}^{p+n+1}\); there are \(p+n+1\) random residuals \(E_{-p+1},\ldots ,E_{n+1}\). Since the correlation between the random residuals \(E_i\) and \(E_{n+1}\) is

$$ {{\mathrm{corr}}}(E_i,E_{n+1}) = \frac{-\tilde{h}_{i,n+1}}{\sqrt{(1-\tilde{h}_{n+1})(1-\tilde{h}_{i})}} $$

(this easily follows from \(I-\tilde{H}\) being a projection matrix and is given in, e.g., [3, p. 11]), (35) is indeed true. Since we have an equality in (34) (with \(\tilde{h}\) in place of \(\bar{h}\)), \(E_i\) and \(E_{n+1}\) are perfectly correlated. Remember that neither row number i nor row number \(n+1\) of the matrix \(I-\bar{H}\) are zero (since the diagonal elements of \(\bar{H}\) are in the semi-open interval [0, 1)), and so neither \(E_i\) nor \(E_{n+1}\) are zero vectors. Since \(E_i\) and \(E_{n+1}\) are perfectly correlated, the row number i of the matrix \(I-\tilde{H}\) is equal to a positive scalar c times its row number \(n+1\). The projection matrix \(I-\tilde{H}\) then projects \(\mathbb {R}^{p+n+1}\) onto a subspace of the hyperplane in \(\mathbb {R}^{p+n+1}\) consisting of the points with coordinate number i being c times the coordinate number \(n+1\). The orthogonal complement of this subspace, i.e., the range of \(\tilde{H}\), will contain the vector \((0,\ldots ,0,-1,0,\ldots ,0,c)\) (\(-1\) being its coordinate number i). Therefore, this vector will be in the range of \(\tilde{X}\) (cf. (31)). Therefore, this vector will be a linear combination of the columns of the extended matrix (30) (with \(\bar{X}\) in place of X), which is impossible because of the first p rows \(\sqrt{a}I_p\) of the extended matrix.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Vovk, V., Nouretdinov, I., Manokhin, V., Gammerman, A. (2018). Conformal Predictive Distributions with Kernels. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Lecture Notes in Computer Science(), vol 11100. Springer, Cham. https://doi.org/10.1007/978-3-319-99492-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99492-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99491-8

  • Online ISBN: 978-3-319-99492-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics