Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction to Four Papers on Semiparametric and Nonparametric Estimation

5.1.1 Introduction: Setting the Stage

I discuss four papers of Peter Bickel and coauthors: Bickel (1982), Bickel and Klaassen (1986), Bickel and Ritov (1987), and Ritov and Bickel (1990).

The four papers by Peter Bickel (and co-authors Chris Klaassen and Ya’acov Ritov) to be discussed here all deal with various aspects of estimation in semiparametric and nonparametric models. All four papers were published in the period 1982–1990, a time when semiparametric theory was in rapid development. Thus it might be useful to briefly review some of the key developments in statistical theory prior to 1982, the year in which Peter Bickel’s Wald lectures (given in 1980) appeared, in order to give some relevant background information. Because I was personally involved in some of these developments in the early 1980s, my account will necessarily be rather subjective and incomplete. I apologize in advance for oversights and a possibly incomplete version of the history.

A key spur for the development of theory for semiparametric models was the clear recognition by Neyman and Scott (1948) that maximum likelihood estimators are often inconsistent in the presence of an unbounded (with sample size) number of nuisance parameters. The simplest of these examples is as follows: suppose that

$$\begin{array}{rcl} ({X}_{i},{Y }_{i}) \sim {N}_{2}(({\mu }_{i},{\mu }_{i}),{\sigma }^{2}),\ \ i = 1,\ldots, n& &\end{array}$$
(5.1)

are independent where \({\mu }_{i} \in \mathbb{R}\) for i = 1, , n and σ2 > 0. Then the maximum likelihood estimator of σ2 is

$$\hat{{\sigma }}_{n}^{2} = {(4n)}^{-1}{ \sum \nolimits }_{i=1}^{n}{({X}_{ i} - {Y }_{i})}^{2} {\rightarrow }_{ p}\frac{{\sigma }^{2}} {2}.$$

This is an example of what has come to be known as a “functional model”. The corresponding “structural model” (or mixture or latent variable model) is: (X i , Y i ) are i.i.d. with density p σ, G where

$${p}_{\sigma, G}(x,y) = \int \nolimits \nolimits \frac{1} {\sigma }\phi \left (\frac{x - \mu } {\sigma } \right ) \frac{1} {\sigma }\phi \left (\frac{y - \mu } {\sigma } \right )dG(\mu )$$

where ϕ is the standard normal density, σ > 0, and G is a (mixing) distribution on \(\mathbb{R}\). Equivalently,

$$\left (\begin{array}{c} X\\ Y \end{array} \right ) = \left (\begin{array}{c} Z\\ Z \end{array} \right )+\sigma \left (\begin{array}{c} \delta \\ \epsilon \end{array} \right )$$

where Z ∼ G is independent of (δ, ε) ∼ N 2(0, I), and only (X, Y ) is observed. Here the nuisance parameters {μ i , i = 1, , n} of the functional model (5.1) have been replaced by the (nuisance) mixing distribution G. Kiefer and Wolfowitz (1956) studied general semiparametric models of this “structural” or mixture type, \(\{{p}_{\theta, G} :\ \theta \in \Theta \subset {\mathbb{R}}^{d},\ G\ \mbox{ a probability distribution}\}\), and established consistency of maximum likelihood estimators \((\hat{{\theta }}_{n},\hat{{G}}_{n})\) of (θ, G). (Further investigation of the properties of maximum likelihood estimators in structural models (or semiparametric mixture models) was pursued by Aad van der Vaart in the mid 1990s; I will return to this later.)

Nearly at the same time as the work by Kiefer and Wolfowitz (1956) and Stein (1956) studied efficient testing and estimation in problems with many nuisance parameters (or even nuisance functions) of a somewhat different type. In particular Stein considered the one-sample symmetric location model

$$\begin{array}{rcl}{ \mathcal{P}}_{1} =\{ {p}_{\theta, f}(x) = f(x - \theta ) :\ \ \theta \in \mathbb{R},\ \ f\ \ \mbox{ symmetric about }\ 0,\ \ {I}_{f} < \infty \}& & \\ \end{array}$$

and the two-sample (paired) shift model

$$\begin{array}{rcl}{ \mathcal{P}}_{2} =\{ {p}_{\mu, \nu, f}(x,y) = f(x - \mu )f(y - \nu ) :\ \ \mu, \nu \in \mathbb{R},\ \ {I}_{f} < \infty \};& & \\ \end{array}$$

here I f  ≡ ∫(f′ ∕ f)2 fdx. Stein (1956) studied testing and estimation in models \({\mathcal{P}}_{1}\) and \({\mathcal{P}}_{2}\), and established necessary conditions for “adaptive estimation”: for example, conditions under which the information bounds for estimation of θ in the model \({\mathcal{P}}_{1}\) are the same as for the information bounds for estimation of θ in the sub-model in which f is known. Roughly speaking, these are both cases in which the efficient score and influence functions are orthogonal to the “nuisance tangent space” in L 2 0(P); i.e. orthogonal to all possible score functions for regular parametric submodels for the infinite-dimensional part of the model. Models of this type, and in particular the symmetric location model \({\mathcal{P}}_{1}\), remained as a focus of research during the period 1956–1982.

Over the period 1956–1982, considerable effort was devoted to finding sufficient conditions for the construction of “adaptive estimators” and “adaptive tests” in the context of the model \({\mathcal{P}}_{1}\): Hájek (1962) gave conditions for the construction of adaptive tests in the model \({\mathcal{P}}_{1}\), while van Eeden (1970) gave a construction for the sub-model of \({\mathcal{P}}_{1}\) consisting of log-concave densities (for which the score function for location is monotone non-decreasing), Beran (1974) constructed efficient estimators based on ranks, while Stone (1975) gave a construction of efficient estimators based on an “estimated” one-step approach.

This, modulo a key paper by Efron (1977) on asymptotic efficiency of Cox’s partial likelihood estimators, was roughly the state of affairs of semiparametric theory in 1980–1982. Of course this is an oversimplification: much progress had been underway from a more nonparametric perspective from several quarters: the group around Lucien Le Cam in Berkeley, including P. W. Millar and R. Beran, the Russian school including I. Ibragimov and R. Has’minskii in (now) St. Petersburg and Y. A. Koshevnik and B. Levit in Moscow, and J. Pfanzagl in Cologne. Over the decade from 1982 to 1993 these two directions would merge and be understood as a whole piece of cloth, but that was not yet the case in 1980–1982, the period when Peter Bickel gave his Wald Lectures (and prepared them for publication).

5.1.2 Paper 1

The first of these four papers, On Adaptive Estimation, represents the culmination and summary of the first period of research on the phenomena of adaptive estimation uncovered by Stein (1956): it gives a masterful exposition of the state of “adaptive estimation” in the early 1980s, and new constructions of efficient estimators in several models satisfying Stein’s necessary conditions for “adaptive estimation” in the sense of Stein (1956). Bickel (1982) begins in Sect. 5.1.2 with an explanation of “adaptive estimation”, with focus on the “i.i.d. case”, and introduces four key examples to be treated: (1) the one-sample symmetric location model \({\mathcal{P}}_{1}\) introduced above; (2) linear regression with symmetric errors; (3) linear regression with a constant and arbitrary errors, a model closely related to the two-sample shift model \({\mathcal{P}}_{2}\) introduced above; and (4) location and variance-covariance parameters of elliptic distributions. The paper then moves to an explanation of Stein’s necessary condition and presentation of a (new) set of sufficient conditions for adaptive estimation involving \({L}_{2}({P}_{{\theta }_{m},G})-\)consistent estimation of the efficient influence function (“Condition H”). Bickel shows that the sufficient conditions are satisfied in the Examples (1)–(4), and hence that adaptive estimators exist in each of these problems. It was also conjectured that Condition H is necessary for adaptation. Necessary and sufficient conditions only slightly stronger than “Condition H” were established by Schick (1986) and Klaassen (1987); also see Bickel et al. (19931998), Sect. 7.8.

According to the ISI Web of Science, as of 20 June 2011, this paper has received 228 citations, and thus is the most cited of the four papers reviewed here. It inspired the search for necessary and sufficient conditions for adaptive estimation (including the papers by Schick (1986) and Klaassen (1987) mentioned above). It also implicitly raised the issue of understanding efficient estimation in semiparametric models more generally. This was the focus of my joint work with Janet Begun, W. J. (Jack) Hall, and Wei-Min Huang at the University of Rochester during the period 1979–1983, resulting in Begun et al. (1983), which I will refer to in the rest of this discussion as BHHW.

5.1.3 Paper 2

Neyman and Scott (1948) had focused on inconsistency of maximum likelihood estimators in functional models, and Kiefer and Wolfowitz (1956) showed that inconsistency of likelihood-based procedures was not a difficulty for the corresponding structural (or mixture) models. Bickel and Klaassen (1986) initiated the exploration of efficiency issues in connection with functional models, with a primary focus on functional models connected with the symmetric location model \({\mathcal{P}}_{1}\). In particular, this paper examined the functional model with X i  ∼ N(θ, σ i 2) independent with \({\sigma }_{i}^{2} \in {\mathbb{R}}^{+},\theta \in \mathbb{R}\), for 1 ≤ i ≤ n. The corresponding structural model is the normal scale mixture model with shift parameter θ, and hence is a subset of \({\mathcal{P}}_{1}\). In fact, it is a very rich subset with nuisance parameter tangent spaces (for “typical” points in the model) agreeing with that of the model \({\mathcal{P}}_{1}\). The main result of the paper is a theorem giving precise conditions under which a modified version of the estimator of Stone (1975) is asymptotically efficient, again in a precise sense defined in the paper.

This paper inspired further work on efficiency issues in functional models: see e.g. Pfanzagl (1993) and Strasser (1996). According to the ISI Web of Science (20 June 2011), it has been cited 15 times. These types of models remain popular (in September 2011, MathSciNet gives 414 hits for “functional model” and 480 hits for “structural model”), but many problems remain.

Between 1982 and publication of this paper in 1986, the paper Begun et al. (1983) appeared. In June 1983 Peter Bickel and myself had given a series of lectures at Johns Hopkins University on semiparametric theory as it stood at that time, and had started writing a book on the subject together with Klaassen and Ritov, Bickel et al. (19931998), which was optimistically announced in the references for this paper as “BKRW (1987)”.

5.1.4 Paper 3

This paper, Bickel and Ritov (1987), treats efficiency of estimation in the structural (or mixture model) version of the errors-in-variables model dating back at least to Neyman and Scott (1948) and Reiersol (1950), and perhaps earlier. As noted by the authors: “Estimates of β in the general Gaussian error model, with Σ 0 diagonal, have been proposed by a variety of authors including Neyman and Scott (1948) and Rubin (1956). In the arbitrary independent error model, Wolfowitz in a series of papers ending in 1957, Kiefer, Wolfowitz, and Spiegelman (1979) by a variety of methods gave estimates, which are consistent and in Spiegelman’s case \({n}^{1/2}-\)consistent and asymptotically. Little seems to be known about the efficiency of these procedures other than that in the restricted Gaussian model ”. This model is among the first semiparametric mixture models involving a nontrivial projection in the calculation of the efficient score function to receive a thorough analysis and constructions of asymptotically efficient estimators. The authors gave an explicit construction of estimators achieving the information bound in a very detailed analysis requiring 17 pages of careful argument.

The type of construction used by the authors involves kernel smoothing estimators of the nonparametric part of the model, and hence brings in choices of smoothing kernels and smoothing parameters (ε n , c n and ν n in the authors’ notation, with nc n 2ν n 6 → ). This same approach was used by van der Vaart (1988) to construct efficient estimators in a whole class of structural models of this same type; van der Vaart’s construction involved the choice of seven different smoothing parameters. On the other hand, Pfanzagl (1990a) pages 47 and 48 (see also Pfanzagl 1990b) pointed out that the resulting estimators are rather artificial in some sense, and advocated in favor of maximum likelihood or other procedures requiring no (or at least fewer) smoothing parameter choices. This approach was pursued in van der Vaart (1996). Forty years after Kiefer and Wolfowitz established consistency of maximum likelihood procedures, Van der Vaart proved, efficiency of maximum likelihood in several particular structural models (under moment conditions which are sufficient but very likely not necessary), including the errors-in-variables model treated in the paper under review. The proofs in van der Vaart (1996) proceed via careful use of empirical process theory. Furthermore, Murphy and van der Vaart (1996) succeeded in extending the maximum likelihood estimators to confidence sets via profile likelihood considerations.

This paper has 35 citations in the ISI Web of Science as of 20 June 2011, but it inspired considerable further work on efficiency bounds and especially on alternative methods for construction of efficient estimators.

5.1.5 Paper 4

In the period 1988–1991 several key questions on the “boundary” between nonparametric and semiparametric estimation came under close examination by van der Vaart, Bickel and Ritov, and Donoho and Liu. The lower bound theory under development for publication in BKRW (1993) relied upon Hellinger differentiability of real-valued functionals. (The lower bound theory based on pathwise Hellinger differentiability was put in a very nice form by van der Vaart (1991).)

But the possibility of a gap between the conditions for differentiability and sufficient conditions to attain the bounds became a nagging question. In Ritov and Bickel (1990), Peter and Ya’acov analyzed the situation in complete detail for the real-valued functional ν(P) = ∫p 2(x)dx defined for the collection \(\mathcal{P}\) of distributions P on [0, 1] with a density p with respect to Lebesgue measure. This functional turns out to be Hellinger differentiable at all such densities p with an information lower bound given by

$$\begin{array}{rcl}{ I}_{\nu }^{-1} = 4V ar(p(X)) = 4\int \nolimits \nolimits {(p(x) - \nu (P))}^{2}p(x)dx.& & \\ \end{array}$$

However, Theorem 1 of Ritov and Bickel (1990) shows that there exist distributions \(P \in \mathcal{P}\) such every sequence of estimators of ν(p) converges to ν(p) more slowly than n  − α for every α > 0. It had earlier been shown by Ibragimov and Hasminskii (1979) that the \(\sqrt{n}-\) convergence rate could be achieved for densities satisfying a Hölder condition of order at least 1 ∕ 2, and in a companion paper to the one under discussion Bickel and Ritov (1988), Peter and Ya’acov showed that this continued to hold for densities p satisfying a Hölder condition of at least 1 ∕ 4.

These results have been extended to obtain rates of convergence in the “non-regular” or nonparametric domain: see Birgé and Massart (19931995) and Laurent and Massart (2000). More recently the techniques of analysis have been extended still further Tchetgen et al. (2008) and Robins et al. (2009). As of 20 June 2011, this paper has been cited 45 times (ISI Web of Science).

5.1.6 Summary and Further Problems

The four papers reviewed here represent only a small fraction of Peter Bickel’s work on the theory of semiparametric models, but they illustrate his superb judgement in the choice of problems suited to push both the theory of semiparametric models in general terms and having relevance for applications. They also showcase his wonderful ability to see his way through the technicalities of problems to solutions of theoretical importance and which point the way forward to further understanding. Paper 1 was clearly important in development of general theory for the adaptive case beyond the location and shift models \({\mathcal{P}}_{1}\) and \({\mathcal{P}}_{2}\). Paper 2 initiated efficiency theory for estimation in functional models quite generally. Paper 3 played an important role in illustrating how semiparametric theory could be applied to the structural (or mixing) form of the classical errors in variables model, hence yielding one of the first substantial models to be discussed in detail in the “non-adaptive case” in which calculation of the efficient score and efficient influence function requires a non-trivial projection.

As noted by Kosorok (2009) semiparametric models continue to be of great interest because of their “ genuine scientific utility combined with the breadth and depth of the many theoretical questions that remain to be answered”.

Figure 5.1 gives an update of Fig. 2.1 of Wellner et al. (2006). The trend is clearly increasing!

Fig. 5.1
figure 1

Numbers of papers with “semiparametric” in title, keywords, or abstract, by year, 1984–2010. Red = MathSciNet; Green = Current Index of Statistics (CIS); Blue = ISI Web of Science