Adaptive Estimation

Wellner, Jon A.

doi:10.1007/978-1-4614-5544-8_5

Jon A. Wellner⁹

Part of the book series: Selected Works in Probability and Statistics ((SWPS,volume 13))

2031 Accesses

Abstract

I discuss four papers of Peter Bickel and coauthors: Bickel (1982), Bickel and Klaassen (1986), Bickel and Ritov (1987), and Ritov and Bickel (1990).

You have full access to this open access chapter, Download chapter PDF

Gauss on least-squares and maximum-likelihood estimation

Article Open access 02 April 2022

Introduction

Modern Analytic Methods: Part I

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction to Four Papers on Semiparametric and Nonparametric Estimation

5.1.1 Introduction: Setting the Stage

I discuss four papers of Peter Bickel and coauthors: Bickel (1982), Bickel and Klaassen (1986), Bickel and Ritov (1987), and Ritov and Bickel (1990).

The four papers by Peter Bickel (and co-authors Chris Klaassen and Ya’acov Ritov) to be discussed here all deal with various aspects of estimation in semiparametric and nonparametric models. All four papers were published in the period 1982–1990, a time when semiparametric theory was in rapid development. Thus it might be useful to briefly review some of the key developments in statistical theory prior to 1982, the year in which Peter Bickel’s Wald lectures (given in 1980) appeared, in order to give some relevant background information. Because I was personally involved in some of these developments in the early 1980s, my account will necessarily be rather subjective and incomplete. I apologize in advance for oversights and a possibly incomplete version of the history.

A key spur for the development of theory for semiparametric models was the clear recognition by Neyman and Scott (1948) that maximum likelihood estimators are often inconsistent in the presence of an unbounded (with sample size) number of nuisance parameters. The simplest of these examples is as follows: suppose that

$$\begin{array}{rcl} ({X}_{i},{Y }_{i}) \sim {N}_{2}(({\mu }_{i},{\mu }_{i}),{\sigma }^{2}),\ \ i = 1,\ldots, n& &\end{array}$$

(5.1)

are independent where ${\mu }_{i} \in \mathbb{R}$ for i = 1, …, n and σ² > 0. Then the maximum likelihood estimator of σ² is

$$\hat{{\sigma }}_{n}^{2} = {(4n)}^{-1}{ \sum \nolimits }_{i=1}^{n}{({X}_{ i} - {Y }_{i})}^{2} {\rightarrow }_{ p}\frac{{\sigma }^{2}} {2}.$$

This is an example of what has come to be known as a “functional model”. The corresponding “structural model” (or mixture or latent variable model) is: (X _i, Y _i) are i.i.d. with density p _σ, G where

$${p}_{\sigma, G}(x,y) = \int \nolimits \nolimits \frac{1} {\sigma }\phi \left (\frac{x - \mu } {\sigma } \right ) \frac{1} {\sigma }\phi \left (\frac{y - \mu } {\sigma } \right )dG(\mu )$$

where ϕ is the standard normal density, σ > 0, and G is a (mixing) distribution on $\mathbb{R}$. Equivalently,

$$\left (\begin{array}{c} X\\ Y \end{array} \right ) = \left (\begin{array}{c} Z\\ Z \end{array} \right )+\sigma \left (\begin{array}{c} \delta \\ \epsilon \end{array} \right )$$

where Z ∼ G is independent of (δ, ε) ∼ N ₂(0, I), and only (X, Y ) is observed. Here the nuisance parameters {μ_i, i = 1, …, n} of the functional model (5.1) have been replaced by the (nuisance) mixing distribution G. Kiefer and Wolfowitz (1956) studied general semiparametric models of this “structural” or mixture type, $\{{p}_{\theta, G} :\ \theta \in \Theta \subset {\mathbb{R}}^{d},\ G\ \mbox{ a probability distribution}\}$, and established consistency of maximum likelihood estimators $(\hat{{\theta }}_{n},\hat{{G}}_{n})$ of (θ, G). (Further investigation of the properties of maximum likelihood estimators in structural models (or semiparametric mixture models) was pursued by Aad van der Vaart in the mid 1990s; I will return to this later.)

Nearly at the same time as the work by Kiefer and Wolfowitz (1956) and Stein (1956) studied efficient testing and estimation in problems with many nuisance parameters (or even nuisance functions) of a somewhat different type. In particular Stein considered the one-sample symmetric location model

$$\begin{array}{rcl}{ \mathcal{P}}_{1} =\{ {p}_{\theta, f}(x) = f(x - \theta ) :\ \ \theta \in \mathbb{R},\ \ f\ \ \mbox{ symmetric about }\ 0,\ \ {I}_{f} < \infty \}& & \\ \end{array}$$

and the two-sample (paired) shift model

$$\begin{array}{rcl}{ \mathcal{P}}_{2} =\{ {p}_{\mu, \nu, f}(x,y) = f(x - \mu )f(y - \nu ) :\ \ \mu, \nu \in \mathbb{R},\ \ {I}_{f} < \infty \};& & \\ \end{array}$$

here I _f ≡ ∫(f′ ∕ f)² fdx. Stein (1956) studied testing and estimation in models ${\mathcal{P}}_{1}$ and ${\mathcal{P}}_{2}$, and established necessary conditions for “adaptive estimation”: for example, conditions under which the information bounds for estimation of θ in the model ${\mathcal{P}}_{1}$ are the same as for the information bounds for estimation of θ in the sub-model in which f is known. Roughly speaking, these are both cases in which the efficient score and influence functions are orthogonal to the “nuisance tangent space” in L ₂ ⁰(P); i.e. orthogonal to all possible score functions for regular parametric submodels for the infinite-dimensional part of the model. Models of this type, and in particular the symmetric location model ${\mathcal{P}}_{1}$, remained as a focus of research during the period 1956–1982.

Over the period 1956–1982, considerable effort was devoted to finding sufficient conditions for the construction of “adaptive estimators” and “adaptive tests” in the context of the model ${\mathcal{P}}_{1}$: Hájek (1962) gave conditions for the construction of adaptive tests in the model ${\mathcal{P}}_{1}$, while van Eeden (1970) gave a construction for the sub-model of ${\mathcal{P}}_{1}$ consisting of log-concave densities (for which the score function for location is monotone non-decreasing), Beran (1974) constructed efficient estimators based on ranks, while Stone (1975) gave a construction of efficient estimators based on an “estimated” one-step approach.

This, modulo a key paper by Efron (1977) on asymptotic efficiency of Cox’s partial likelihood estimators, was roughly the state of affairs of semiparametric theory in 1980–1982. Of course this is an oversimplification: much progress had been underway from a more nonparametric perspective from several quarters: the group around Lucien Le Cam in Berkeley, including P. W. Millar and R. Beran, the Russian school including I. Ibragimov and R. Has’minskii in (now) St. Petersburg and Y. A. Koshevnik and B. Levit in Moscow, and J. Pfanzagl in Cologne. Over the decade from 1982 to 1993 these two directions would merge and be understood as a whole piece of cloth, but that was not yet the case in 1980–1982, the period when Peter Bickel gave his Wald Lectures (and prepared them for publication).

5.1.2 Paper 1

The first of these four papers, On Adaptive Estimation, represents the culmination and summary of the first period of research on the phenomena of adaptive estimation uncovered by Stein (1956): it gives a masterful exposition of the state of “adaptive estimation” in the early 1980s, and new constructions of efficient estimators in several models satisfying Stein’s necessary conditions for “adaptive estimation” in the sense of Stein (1956). Bickel (1982) begins in Sect. 5.1.2 with an explanation of “adaptive estimation”, with focus on the “i.i.d. case”, and introduces four key examples to be treated: (1) the one-sample symmetric location model ${\mathcal{P}}_{1}$ introduced above; (2) linear regression with symmetric errors; (3) linear regression with a constant and arbitrary errors, a model closely related to the two-sample shift model ${\mathcal{P}}_{2}$ introduced above; and (4) location and variance-covariance parameters of elliptic distributions. The paper then moves to an explanation of Stein’s necessary condition and presentation of a (new) set of sufficient conditions for adaptive estimation involving ${L}_{2}({P}_{{\theta }_{m},G})-$consistent estimation of the efficient influence function (“Condition H”). Bickel shows that the sufficient conditions are satisfied in the Examples (1)–(4), and hence that adaptive estimators exist in each of these problems. It was also conjectured that Condition H is necessary for adaptation. Necessary and sufficient conditions only slightly stronger than “Condition H” were established by Schick (1986) and Klaassen (1987); also see Bickel et al. (1993, 1998), Sect. 7.8.

According to the ISI Web of Science, as of 20 June 2011, this paper has received 228 citations, and thus is the most cited of the four papers reviewed here. It inspired the search for necessary and sufficient conditions for adaptive estimation (including the papers by Schick (1986) and Klaassen (1987) mentioned above). It also implicitly raised the issue of understanding efficient estimation in semiparametric models more generally. This was the focus of my joint work with Janet Begun, W. J. (Jack) Hall, and Wei-Min Huang at the University of Rochester during the period 1979–1983, resulting in Begun et al. (1983), which I will refer to in the rest of this discussion as BHHW.

5.1.3 Paper 2

Neyman and Scott (1948) had focused on inconsistency of maximum likelihood estimators in functional models, and Kiefer and Wolfowitz (1956) showed that inconsistency of likelihood-based procedures was not a difficulty for the corresponding structural (or mixture) models. Bickel and Klaassen (1986) initiated the exploration of efficiency issues in connection with functional models, with a primary focus on functional models connected with the symmetric location model ${\mathcal{P}}_{1}$. In particular, this paper examined the functional model with X _i ∼ N(θ, σ_i ²) independent with ${\sigma }_{i}^{2} \in {\mathbb{R}}^{+},\theta \in \mathbb{R}$, for 1 ≤ i ≤ n. The corresponding structural model is the normal scale mixture model with shift parameter θ, and hence is a subset of ${\mathcal{P}}_{1}$. In fact, it is a very rich subset with nuisance parameter tangent spaces (for “typical” points in the model) agreeing with that of the model ${\mathcal{P}}_{1}$. The main result of the paper is a theorem giving precise conditions under which a modified version of the estimator of Stone (1975) is asymptotically efficient, again in a precise sense defined in the paper.

This paper inspired further work on efficiency issues in functional models: see e.g. Pfanzagl (1993) and Strasser (1996). According to the ISI Web of Science (20 June 2011), it has been cited 15 times. These types of models remain popular (in September 2011, MathSciNet gives 414 hits for “functional model” and 480 hits for “structural model”), but many problems remain.

Between 1982 and publication of this paper in 1986, the paper Begun et al. (1983) appeared. In June 1983 Peter Bickel and myself had given a series of lectures at Johns Hopkins University on semiparametric theory as it stood at that time, and had started writing a book on the subject together with Klaassen and Ritov, Bickel et al. (1993, 1998), which was optimistically announced in the references for this paper as “BKRW (1987)”.

5.1.4 Paper 3

This paper, Bickel and Ritov (1987), treats efficiency of estimation in the structural (or mixture model) version of the errors-in-variables model dating back at least to Neyman and Scott (1948) and Reiersol (1950), and perhaps earlier. As noted by the authors: “Estimates of β in the general Gaussian error model, with Σ ₀ diagonal, have been proposed by a variety of authors including Neyman and Scott (1948) and Rubin (1956). In the arbitrary independent error model, Wolfowitz in a series of papers ending in 1957, Kiefer, Wolfowitz, and Spiegelman (1979) by a variety of methods gave estimates, which are consistent and in Spiegelman’s case ${n}^{1/2}-$consistent and asymptotically. Little seems to be known about the efficiency of these procedures other than that in the restricted Gaussian model …”. This model is among the first semiparametric mixture models involving a nontrivial projection in the calculation of the efficient score function to receive a thorough analysis and constructions of asymptotically efficient estimators. The authors gave an explicit construction of estimators achieving the information bound in a very detailed analysis requiring 17 pages of careful argument.

The type of construction used by the authors involves kernel smoothing estimators of the nonparametric part of the model, and hence brings in choices of smoothing kernels and smoothing parameters (ε_n, c _n and ν_n in the authors’ notation, with nc _n ²ν_n ⁶ → ∞). This same approach was used by van der Vaart (1988) to construct efficient estimators in a whole class of structural models of this same type; van der Vaart’s construction involved the choice of seven different smoothing parameters. On the other hand, Pfanzagl (1990a) pages 47 and 48 (see also Pfanzagl 1990b) pointed out that the resulting estimators are rather artificial in some sense, and advocated in favor of maximum likelihood or other procedures requiring no (or at least fewer) smoothing parameter choices. This approach was pursued in van der Vaart (1996). Forty years after Kiefer and Wolfowitz established consistency of maximum likelihood procedures, Van der Vaart proved, efficiency of maximum likelihood in several particular structural models (under moment conditions which are sufficient but very likely not necessary), including the errors-in-variables model treated in the paper under review. The proofs in van der Vaart (1996) proceed via careful use of empirical process theory. Furthermore, Murphy and van der Vaart (1996) succeeded in extending the maximum likelihood estimators to confidence sets via profile likelihood considerations.

This paper has 35 citations in the ISI Web of Science as of 20 June 2011, but it inspired considerable further work on efficiency bounds and especially on alternative methods for construction of efficient estimators.

5.1.5 Paper 4

In the period 1988–1991 several key questions on the “boundary” between nonparametric and semiparametric estimation came under close examination by van der Vaart, Bickel and Ritov, and Donoho and Liu. The lower bound theory under development for publication in BKRW (1993) relied upon Hellinger differentiability of real-valued functionals. (The lower bound theory based on pathwise Hellinger differentiability was put in a very nice form by van der Vaart (1991).)

But the possibility of a gap between the conditions for differentiability and sufficient conditions to attain the bounds became a nagging question. In Ritov and Bickel (1990), Peter and Ya’acov analyzed the situation in complete detail for the real-valued functional ν(P) = ∫p ²(x)dx defined for the collection $\mathcal{P}$ of distributions P on [0, 1] with a density p with respect to Lebesgue measure. This functional turns out to be Hellinger differentiable at all such densities p with an information lower bound given by

$$\begin{array}{rcl}{ I}_{\nu }^{-1} = 4V ar(p(X)) = 4\int \nolimits \nolimits {(p(x) - \nu (P))}^{2}p(x)dx.& & \\ \end{array}$$

However, Theorem 1 of Ritov and Bickel (1990) shows that there exist distributions $P \in \mathcal{P}$ such every sequence of estimators of ν(p) converges to ν(p) more slowly than n ^− α for every α > 0. It had earlier been shown by Ibragimov and Hasminskii (1979) that the $\sqrt{n}-$ convergence rate could be achieved for densities satisfying a Hölder condition of order at least 1 ∕ 2, and in a companion paper to the one under discussion Bickel and Ritov (1988), Peter and Ya’acov showed that this continued to hold for densities p satisfying a Hölder condition of at least 1 ∕ 4.

These results have been extended to obtain rates of convergence in the “non-regular” or nonparametric domain: see Birgé and Massart (1993, 1995) and Laurent and Massart (2000). More recently the techniques of analysis have been extended still further Tchetgen et al. (2008) and Robins et al. (2009). As of 20 June 2011, this paper has been cited 45 times (ISI Web of Science).

5.1.6 Summary and Further Problems

The four papers reviewed here represent only a small fraction of Peter Bickel’s work on the theory of semiparametric models, but they illustrate his superb judgement in the choice of problems suited to push both the theory of semiparametric models in general terms and having relevance for applications. They also showcase his wonderful ability to see his way through the technicalities of problems to solutions of theoretical importance and which point the way forward to further understanding. Paper 1 was clearly important in development of general theory for the adaptive case beyond the location and shift models ${\mathcal{P}}_{1}$ and ${\mathcal{P}}_{2}$. Paper 2 initiated efficiency theory for estimation in functional models quite generally. Paper 3 played an important role in illustrating how semiparametric theory could be applied to the structural (or mixing) form of the classical errors in variables model, hence yielding one of the first substantial models to be discussed in detail in the “non-adaptive case” in which calculation of the efficient score and efficient influence function requires a non-trivial projection.

As noted by Kosorok (2009) semiparametric models continue to be of great interest because of their “… genuine scientific utility … combined with the breadth and depth of the many theoretical questions that remain to be answered”.

Figure 5.1 gives an update of Fig. 2.1 of Wellner et al. (2006). The trend is clearly increasing!

References

Begun JM, Hall WJ, Huang W-M, Wellner JA (1983) Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat 11(2):432–452
Article MATH MathSciNet Google Scholar
Beran R (1974) Asymptotically efficient adaptive rank estimates in location models. Ann Stat 2:63–74
Article MATH MathSciNet Google Scholar
Bickel PJ (1982) On adaptive estimation. Ann Stat 10(3):647–671
Article MATH MathSciNet Google Scholar
Bickel PJ, Klaassen CAJ (1986) Empirical Bayes estimation in functional and structural models, and uniformly adaptive estimation of location. Adv Appl Math 7(1):55–69
Article MATH MathSciNet Google Scholar
Bickel PJ, Ritov Y (1987) Efficient estimation in the errors in variables model. Ann Stat 15(2):513–540
Article MATH MathSciNet Google Scholar
Bickel PJ, Ritov Y (1988) Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā Ser A 50(3):381–393
MATH MathSciNet Google Scholar
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA (1993) Efficient and adaptive estimation for semiparametric models. Johns Hopkins series in the mathematical sciences. Johns Hopkins University Press, Baltimore
Google Scholar
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA (1998) Efficient and adaptive estimation for semiparametric models. Springer, New York. Reprint of the 1993 original
Google Scholar
Birgé L, Massart P (1993) Rates of convergence for minimum contrast estimators. Probab Theory Relat Fields 97(1–2):113–150
Article MATH Google Scholar
Birgé L, Massart P (1995) Estimation of integral functionals of a density. Ann Stat 23(1):11–29
Article MATH Google Scholar
Efron B (1977) The efficiency of Cox’s likelihood function for censored data. J Am Stat Assoc 72(359):557–565
Article MATH MathSciNet Google Scholar
Hájek J (1962) Asymptotically most powerful rank-order tests. Ann Math Stat 33:1124–1147
Article MATH Google Scholar
Ibragimov IA, Khasminskii RZ (1981) Statistical estimation: asymptotic theory. Springer Verlag, New York (Russian ed. 1979)
Google Scholar
Kiefer J, Wolfowitz J (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann Math Stat 27:887–906
Article MATH MathSciNet Google Scholar
Klaassen CAJ (1987) Consistent estimation of the influence function of locally asymptotically linear estimators. Ann Stat 15(4):1548–1562
Article MATH MathSciNet Google Scholar
Kosorok MR (2009) What’s so special about semiparametric methods? Sankhyā 71(2, Ser A):331–353
Google Scholar
Laurent B, Massart P (2000) Adaptive estimation of a quadratic functional by model selection. Ann Stat 28(5):1302–1338
Article MATH MathSciNet Google Scholar
Murphy SA, van der Vaart AW (1996) Likelihood inference in the errors-in-variables model. J Multivar Anal 59(1):81–108
Article MATH Google Scholar
Neyman J, Scott EL (1948) Consistent estimates based on partially consistent observations. Econ 16:1–32
Article MathSciNet Google Scholar
Pfanzagl J (1990a) Estimation in semiparametric models. Lecture notes in statistics, vol 63. Springer, New York. Some recent developments
Google Scholar
Pfanzagl J (1990b) Large deviation probabilities for certain nonparametric maximum likelihood estimators. Ann Stat 18(4):1868–1877
Article MATH MathSciNet Google Scholar
Pfanzagl J (1993) Incidental versus random nuisance parameters. Ann Stat 21(4):1663–1691
Article MATH MathSciNet Google Scholar
Reiersol O (1950) Identifiability of a linear relation between variables which are subject to error. Econometrica 18:375–389
Article MathSciNet Google Scholar
Ritov Y, Bickel PJ (1990) Achieving information bounds in non and semiparametric models. Ann Stat 18(2):925–938
Article MATH MathSciNet Google Scholar
Robins J, Tchetgen Tchetgen E, Li L, van der Vaart A (2009) Semiparametric minimax rates. Electron J Stat 3:1305–1321
Article MATH MathSciNet Google Scholar
Rubin H (1956) Uniform convergence of random functions with applications to statistics. Ann Math Statist 27:200–203
Article MATH MathSciNet Google Scholar
Schick A (1986) On asymptotically efficient estimation in semiparametric models. Ann Stat 14(3):1139–1151
Article MATH MathSciNet Google Scholar
Stein C (1956) Efficient nonparametric testing and estimation. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability, 1954–1955, vol I. University of California Press, Berkeley/Los Angeles, pp 187–195
Google Scholar
Stone CJ (1975) Adaptive maximum likelihood estimators of a location parameter. Ann Stat 3:267–284
Article MATH Google Scholar
Strasser H (1996) Asymptotic efficiency of estimates for models with incidental nuisance parameters. Ann Stat 24(2):879–901
Article MATH MathSciNet Google Scholar
Tchetgen E, Li L, Robins J, van der Vaart A (2008) Minimax estimation of the integral of a power of a density. Stat Probab Lett 78(18):3307–3311
Article MATH Google Scholar
van der Vaart AW (1988) Estimating a real parameter in a class of semiparametric models. Ann Stat 16(4):1450–1474
Article MATH Google Scholar
van der Vaart A (1991) On differentiable functionals. Ann Stat 19(1):178–204
Article MATH Google Scholar
van der Vaart A (1996) Efficient maximum likelihood estimation in semiparametric mixture models. Ann Stat 24(2):862–878
Article MATH Google Scholar
van Eeden C (1970) Efficiency-robust estimation of location. Ann Math Stat 41:172–181
Article MATH Google Scholar
Wellner JA, Klaassen CAJ, Ritov Y (2006) Semiparametric models: a review of progress since BKRW (1993). In: Frontiers in statistics. Imperial College Press, London, pp 25–44
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Washington, Seattle, WA, USA
Jon A. Wellner

Authors

Jon A. Wellner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jon A. Wellner .

Editor information

Editors and Affiliations

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey, USA
Jianqing Fan
Department of Statistics, Hebrew University of Jerusalem, Jerusalem, Israel
Ya’acov Ritov
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA
C. F. Jeff Wu

Appendix

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wellner, J.A. (2014). Adaptive Estimation. In: Fan, J., Ritov, Y., Wu, C.F.J. (eds) Selected Works of Peter J. Bickel. Selected Works in Probability and Statistics, vol 13. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-5544-8_5

Download citation

DOI: https://doi.org/10.1007/978-1-4614-5544-8_5
Published: 08 October 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-5543-1
Online ISBN: 978-1-4614-5544-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Adaptive Estimation

Abstract

Similar content being viewed by others