Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Huberman, David B.; Reich, Brian J.; Bondell, Howard D.

doi:10.1007/s10651-021-00499-z

Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Published: 20 May 2021

Volume 29, pages 677–704, (2022)
Cite this article

Environmental and Ecological Statistics Aims and scope Submit manuscript

592 Accesses
5 Citations
Explore all metrics

A Correction to this article was published on 26 August 2022

This article has been updated

Abstract

Short-term forecasting is an important tool in understanding environmental processes. In this paper, we incorporate machine learning algorithms into a conditional distribution estimator for the purposes of forecasting tropical cyclone intensity. Many machine learning techniques give a single-point prediction of the conditional distribution of the target variable, which does not give a full accounting of the prediction variability. Conditional distribution estimation can provide extra insight on predicted response behavior, which could influence decision-making and policy. We propose a technique that simultaneously estimates the entire conditional distribution and flexibly allows for machine learning techniques to be incorporated. A smooth model is fit over both the target variable and covariates, and a logistic transformation is applied on the model output layer to produce an expression of the conditional density function. We provide two examples of machine learning models that can be used, polynomial regression and deep learning models. To achieve computational efficiency, we propose a case–control sampling approximation to the conditional distribution. A simulation study for four different data distributions highlights the effectiveness of our method compared to other machine learning-based conditional distribution estimation techniques. We then demonstrate the utility of our approach for forecasting purposes using tropical cyclone data from the Atlantic Seaboard. This paper gives a proof of concept for the promise of our method, further computational developments can fully unlock its insights in more complex forecasting and other applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Realistic tropical cyclone wind and pressure fields can be reconstructed from sparse data using deep learning

Article Open access 03 January 2024

Estimation of global tropical cyclone wind speed probabilities using the STORM dataset

Article Open access 10 November 2020

Markovian approach to the frequency of tropical cyclones and subsequent development of univariate prediction model

Article 04 January 2022

Change history

26 August 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10651-022-00543-6

References

Biswas MK, Carson L, Newman K, Bernardet L, Kalina E, Grell E, Frimel J (2017) Community hwrf users guide v3. 9a
Chung Y, Dunson DB (2009) Nonparametric bayes conditional distribution modeling with variable selection. J Am Stat Assoc 104(488):1646–1660
Article PubMed PubMed Central Google Scholar
Cloud KA, Reich BJ, Rozoff CM, Alessandrini S, Lewis WE, Delle Monache L (2019) A feed forward neural network based on model output statistics for short-term hurricane intensity prediction. Weather and Forecasting 34(4):985–997
Article Google Scholar
Dalmasso N, Pospisil T, Lee AB, Izbicki R, Freeman PE, Malz AI (2020) Conditional density estimation tools in python and r with applications to photometric redshifts and likelihood-free cosmological inference. Astronomy and Computing 30:100362
Article Google Scholar
Dunson DB, Pillai N, Park J-H (2007) Bayesian density regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2):163–183
Article Google Scholar
Efromovich S (2010) Orthogonal series density estimation. Wiley Interdisciplinary Reviews: Computational Statistics 2(4):467–476
Article Google Scholar
Escobar MD, West M (1995) Bayesian density estimation and inference using mixtures. J Am Stat Assoc 90(430):577–588
Article Google Scholar
Fahey MT, Thane CW, Bramwell GD, Coward WA (2007) Conditional gaussian mixture modelling for dietary pattern analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170(1):149–166
Article Google Scholar
Fan J-Q, Peng L, Yao Q-W, Zhang W-Y (2009) Approximating conditional density functions using dimension reduction. Acta Mathematicae Applicatae Sinica, English Series 25(3):445–456
Article Google Scholar
Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. The annals of applied statistics 7(4):1917–1939
Article PubMed Google Scholar
Fithian W, Hastie T (2014) Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann Stat 42(5):1693
Article PubMed PubMed Central Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1
Article PubMed PubMed Central Google Scholar
Friedman J, Hastie T, Tibshirani R (2009) glmnet: Lasso and elastic-net regularized generalized linear models. R package version 1(4)
Geweke J, Keane M (2007) Smoothly mixing regressions. Journal of Econometrics 138(1):252–290
Article Google Scholar
Gilardi N, Bengio S, Kanevski M (2002) Conditional gaussian mixture models for environmental risk mapping. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 777–786. IEEE
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102(477):359–378
Article CAS Google Scholar
Hall P, Wolff RC, Yao Q (1999) Methods for estimating a conditional distribution function. J Am Stat Assoc 94(445):154–163
Article Google Scholar
Hall P, Racine J, Li Q (2004) Cross-validation and the estimation of conditional probability densities. J Am Stat Assoc 99(468):1015–1026
Article Google Scholar
Hall P, Yao Q et al (2005) Approximating conditional distribution functions using dimension reduction. Ann Stat 33(3):1404–1421
Article Google Scholar
Hannah LA, Blei DM, Powell WB (2011) Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research 12(6)
Hersbach H (2000) Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting 15(5):559–570
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. pp. 1026–1034
Hornik K, Stinchcombe M, White H et al (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article Google Scholar
Hothorn T, Zeileis A (2017) Transformation forests. arXiv preprint arXiv:1701.02110
Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Izbicki R, Lee AB (2016) Nonparametric conditional density estimation in a high-dimensional regression setting. J Comput Graph Stat 25(4):1297–1316
Article Google Scholar
Izbicki R, Lee AB et al (2017) Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics 11(2):2800–2831
Article Google Scholar
Jara A, Hanson TE (2011) A class of mixtures of dependent tail-free processes. Biometrika 98(3):553–566
Article CAS PubMed PubMed Central Google Scholar
Jarner MF, Diggle P, Chetwynd AG (2002) Estimation of spatial variation in risk using matched case-control data. Biometrical Journal: Journal of Mathematical Methods in Biosciences 44(8):936–945
Article Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krüger F, Lerch S, Thorarinsdottir TL, Gneiting T (2016) Predictive inference based on markov chain monte carlo output. arXiv preprint arXiv:1608.06802
Kundu S, Dunson DB (2014) Bayes variable selection in semiparametric linear models. J Am Stat Assoc 109(505):437–447
Article CAS PubMed PubMed Central Google Scholar
Lenk PJ (1988) The logistic normal distribution for bayesian, nonparametric, predictive densities. J Am Stat Assoc 83(402):509–516
Article Google Scholar
Li R, Bondell HD, Reich BJ (2019) Deep distribution regression. arXiv:1903.06023
Matheson JE, Winkler RL (1976) Scoring rules for continuous probability distributions. Manage Sci 22(10):1087–1096
Article Google Scholar
Meinshausen N (2006) Quantile regression forests. Journal of Machine Learning Research 7(Jun), 983–999
Müller P, Erkanli A, West M (1996) Bayesian curve fitting using multivariate normal mixtures. Biometrika 83(1):67–79
Article Google Scholar
Park J-H, Dunson DB (2010) Bayesian generalized product partition model. Statistica Sinica, 1203–1226
Payne RD, Guha N, Ding Y, Mallick BK (2020) A conditional density estimation partition model using logistic gaussian processes. Biometrika 107(1):173–190
Article CAS PubMed Google Scholar
Peng F, Jacobs RA, Tanner MA (1996) Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J Am Stat Assoc 91(435):953–960
Article Google Scholar
Pospisil T, Lee AB (2018) Rfcde: Random forests for conditional density estimation. arXiv preprint arXiv:1804.05753
Rojas AL, Genovese CR, Miller CJ, Nichol R, Wasserman L (2005) Conditional density estimation using finite mixture models with an application to astrophysics
Rosenblatt M (1969) Conditional probability density and regression estimators. Multivariate analysis II 25:31
Google Scholar
Shahbaba B, Neal R (2009) Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research 10(8)
Song X, Yang K, Pavel M (2004) Density boosting for gaussian mixtures. In International Conference on Neural Information Processing, pp. 508–515. Springer
Taddy MA, Kottas A (2010) A bayesian nonparametric approach to inference for quantile regression. Journal of Business & Economic Statistics 28(3):357–369
Article Google Scholar
Tokdar ST, Kadane JB et al (2012) Simultaneous linear quantile regression: a semiparametric bayesian approach. Bayesian Anal 7(1):51–72
Google Scholar
Tokdar ST, Zhu YM, Ghosh JK et al (2010) Bayesian density regression with logistic gaussian process and subspace projection. Bayesian Anal 5(2):319–344
Article Google Scholar
Trippa L, Müller P, Johnson W (2011) The multivariate beta process and an extension of the polya tree model. Biometrika 98(1):17–34
Article PubMed PubMed Central Google Scholar
Tung NT, Huang JZ, Khan I, Li MJ, Williams G et al. (2014) Extensions to quantile regression forests for very high-dimensional data
Weierstrass K (1885) Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin 2:633–639
Google Scholar
Wood SA, Jiang W, Tanner M (2002) Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika 89(3):513–528
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, North Carolina State University, Campus Box 8203, Raleigh, NC, 27695, USA
David B. Huberman & Brian J. Reich
School of Mathematics and Statistics, University of Melbourne, Peter Hall Building, Parkville, VIC, 3122, Australia
Howard D. Bondell

Authors

David B. Huberman
View author publications
You can also search for this author in PubMed Google Scholar
Brian J. Reich
View author publications
You can also search for this author in PubMed Google Scholar
Howard D. Bondell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David B. Huberman.

Additional information

Handling Editor: Luiz Duczmal.

We’d like to thank Dr. Kevin Gunn for lending equipment and technical assistance to obtain some of the results for the tropical cyclone intensity forecasting application.

Appendices

Appendix A: Case–control sampling justification

We exploit the relationship between an inhomogeneous Poisson process (IPP) model and our problem to re-frame case and control samples into presence and background samples from presence-only datasets. Presence-only data is generally used with species distribution modeling where surveyors record all species presences within a pre-specified region along with randomly sampled background data across the region. Fithian (2013) lay out how to model presence-only data in a two-dimensional spatial domain using an IPP model. For our method, the domain for cases (and controls) is [0, 1] as previously noted. The IPP requires an intensity function $\lambda $ to be specified, which represents the likelihood of a case being present at any location in the given domain. The average intensity function for response value $\textit{z}$ over the domain is given as

$$\begin{aligned} \Lambda =\int _{0}^{1}\lambda (\textit{z})d\textit{z}. \end{aligned}$$

(A1)

An IPP model with intensity function $\lambda $ gives the probability distributions for both the total number of cases as well as the locations of those cases. Conditional on the number of cases (governed by a Poisson distribution with mean $\Lambda $), the locations of the cases are independently and identically distributed as

$$\begin{aligned} \Pr (\textit{Z}=\textit{z})=\frac{\lambda (\textit{z})}{\Lambda }. \end{aligned}$$

(A2)

If we define $\lambda (\textit{z})=e^{q(\textit{z},{\varvec{X}})}$, then the IPP model becomes

$$\begin{aligned} \Pr (\textit{Z}=\textit{z})=\frac{e^{q(\textit{z},{\varvec{X}})}}{\int _0^1 e^{q(\textit{z},{\varvec{X}})} d\textit{z}} . \end{aligned}$$

(A3)

We recognize this distribution form in the continuous logistic transformation formula given in 2. The cases are independently and identically distributed, so the log likelihood objective function is

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z},{\varvec{X}})=\sum \limits _{i=1}^{n} \bigg [ q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })-\log \bigg ( \int _0^1 q(\textit{z},{\varvec{X}}_i;\varvec{\theta })d\textit{z}\bigg ) \bigg ] \end{aligned}$$

(A4)

where $i=1,\ldots ,n$ and $\varvec{\theta }$ is the parameter vector for the selected $q$ model, as before. Fithian (2013) discuss how to evaluate the integral in the denominator by approximating it using a finite set of control (background) samples. Let $k=1,\ldots ,K$ index the control samples, so that $\textit{z}_{ik}$ denotes the $k$th control value for observation $i$. Then, the log likelihood objective function with an approximated integral becomes

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z},{\varvec{X}})=\sum \limits _{i=1}^{n}\bigg [ q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })-\log \bigg ( e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+ \sum _{k=1}^{K} e^{q(\textit{z}_{ik},{\varvec{X}}_i;\varvec{\theta })} \bigg ) \bigg ]. \end{aligned}$$

(A5)

We recognize this as the log likelihood objective function in Sect. 8. Fithian and Hastie do not approximate the integral using a $q$ function with a case plus a set of controls, instead only using the set of controls. However, the case is only providing more information about the integral than the controls would on their own so we do not expect this to produce any inconsistencies or biases. They state that the control points can be uniformly sampled from the domain, which implies that selecting as few as $K=1$ controls still can provide valid inference about the target distribution function. Fithian and Hastie also state that the control points can be chosen through weighted sampling, referring to using quadrature weights. The additional ridge penalty component in 8 could have been added to the IPP likelihood as well.

Note that for presence-only data, Fithian and Hastie detail possible sampling bias issues that can arise due to imperfect detection of presences during data collection. However, we do not think detectability issues are relevant to our usage of this framework. We make the assumption that the detectability parameter equals 1 in our context, so we do not have to worry about sampling bias in this respect.

Appendix B: Computation

We use a variety of techniques to determine the optimal set of parameter estimates for the chosen model.

1.1 B.1: Polynomial regression model

For the polynomial regression model, we can manipulate the data in order to evaluate the parameters using penalized logistic regression. Recall the polynomial regression model in Sect. 3. We can express the polynomial model in 3 in matrix form as $q(\textit{z},{\varvec{X}})={\varvec{X}}(\textit{z},B)\varvec{\xi }$, where ${\varvec{X}}(\textit{z},B)$ is the full covariate matrix as a function of the transformed response $\textit{z}$ and the highest polynomial power $B$ and $\varvec{\xi }=[\varvec{\xi }_{10},\ldots ,\varvec{\xi }_{{B}p} ]$ is the associated parameter vector. Then, we can rewrite the objective function in Sect. 8 using the polynomial regression from Sect. 3 as

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z}_i,{\varvec{X}}_i)=&\sum \limits _{i=1}^{n}\bigg \{ -\log \frac{e^{{\varvec{X}}_i(\textit{z}_i,B)\varvec{\xi }}}{e^{{\varvec{X}}_i(\textit{z}_i,B)\varvec{\xi }}+e^{{\varvec{X}}_i(\textit{z}_{i1}^*,B)\varvec{\xi }}} \bigg \}+\omega ||\varvec{\theta }||^2 \end{aligned}$$

(B1)

$$\begin{aligned} =&\sum \limits _{i=1}^{n}\bigg \{ -\log \frac{1}{1+e^{[{\varvec{X}}_i(\textit{z}_{i1}^*,B)-{\varvec{X}}_i(\textit{z}_i,B)]\varvec{\xi }}} \bigg \}+\omega ||\varvec{\theta }||^2. \end{aligned}$$

(B2)

We recognize the form of this objective function for a logistic regression where every binary outcome equals 1. If we add a small amount of dummy observations where the covariate and outcome values are all 0, we can evaluate our parameter vector using penalized logistic regression. For each run of the polynomial regression case-control method, we used two dummy observations. The GLMNet package in R evaluates these parameters extremely fast, making this method practical and convenient (Friedman et al. 2009).

We ran preliminary simulation runs on a smaller number of datasets for a variety of ridge penalties to determine the appropriate penalty values to use for each scenario of the simulation and the application. For Model 1, 2, and 4 the ridge penalties used were 0.025, 0.05, and 0.05, respectively. For Model 3, the ridge penalty used for 200 observations was 0.025, whereas the ridge penalty used for the other two sample size settings was 0.01.

For the application, we considered a variety of ridge penalties to determine which penalty minimized the mean 5-fold CRPS. For lag 3 and lag 6, a ridge penalty of 0.000001 and 0.0005 were chosen, respectively.

1.2 B.2: Deep learning model

For the deep learning case–control approximation, we cannot manipulate the data like in Sect. B2 and instead rely on gradient descent to evaluate the parameters.

Deep learning models can be difficult to train with a basic gradient descent algorithm. We extend or alter the basic gradient descent algorithm in a multitude of ways designed to improve parameter estimation. Mini-batch gradient descent is used with batch sizes of 50 and an initial step size of 1. For the case study, the batch sizes are approximate as the number of observations is not divisible by 50. The models are run for 600 gradient descent steps for the simulation and 300 gradient descent steps for the case study, respectively. Adaptive moment estimation (ADAM) is implemented along with a step decay that halves the initial step size every 100 steps for the simulation models and 50 steps for the case study models (Kingma and Ba 2014). We apply the batch normalization algorithm described in Ioffe and Szegedy (2015). Model weights are initiated using the He initialization scheme, while the batch normalization shift and scale parameters are initialized to 0 and 1, respectively (He et al. 2015).

The gradient descent algorithm requires the calculation of a gradient vector, which contains the first derivative value of the objective function with respect to each parameter. We obtain these values using back-propagation. The individual chain rule components for the gradient vector calculations for a single observation and case/control value are given in Sect. B3. The gradient vector can be calculated by summing the appropriate terms across observations and cases/controls.

$$\begin{aligned} \frac{\partial \ell }{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}=&\frac{1}{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}} \nonumber \\ \frac{\partial \ell }{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}=&\frac{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}}{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}}\nonumber \\ \frac{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}{\partial \delta _{t}}=&\frac{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}{\partial \delta _{t}}=f_A(\textit{H}_{t}) \nonumber \\ \frac{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}{\partial f_A(\textit{H}_{t})}=&\frac{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}{\partial f_A(\textit{H}_{t})}=\delta _{t}\nonumber \\ \frac{\partial f_A(\textit{H}_{t})}{\partial \textit{H}_{t}}=&{\left\{ \begin{array}{ll} 1 &{} \text {if } \textit{H}_{t}>0\\ \alpha \exp {\textit{H}_{t}} &{} \text {if } \textit{H}_{t}\le 0 \end{array}\right. } \nonumber \\ \frac{\partial \textit{H}_{t}}{\partial f_A(\textit{I}_{r})}=&\gamma _{tr}\nonumber \\ \frac{\partial \textit{H}_{t}}{\partial \gamma _{0t}}=&1 \nonumber \\ \frac{\partial \textit{H}_{t}}{\partial \gamma _{tr}}=&f_A(\textit{I}_{r}) \nonumber \\ \frac{\partial f_A(\textit{I}_{r})}{\partial \textit{I}_{r}} =&{\left\{ \begin{array}{ll} 1 &{} \text {if } \textit{I}_{r}>0\\ \alpha \exp {\textit{I}_{r}} &{} \text {if } \textit{I}_{r}\le 0 \end{array}\right. }\nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{0r}}=&1 \nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{1r}}=&\textit{z}_i\nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{jr}}=&X_{ji}. \end{aligned}$$

(B3)

The components for the batch normalization parameters can be obtained following the steps in (Ioffe and Szegedy 2015).

We ran preliminary simulation runs on a smaller number of datasets for a variety of ridge penalties to determine the appropriate penalty values to use for each scenario of the simulation and the application. For simulation Model 1, we used 0.025, 0.002, and 0.0025 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 2, we used 0.015, 0.0025, and 0.001 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 3, we used 0.0075, 0.001, and 0.001 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 4, we used 0.01, 0.001, and 0.0005 as the penalties for 200, 1000, and 4000 observations respectively.

For the application, we considered a variety of ridge penalties and initial learning rates for the deep learning IPP approximation method. For both lags, the tuning settings that optimized the mean 5-fold CRPS were a ridge penalty of 0.000001 and an initial learning rate of 1.

Appendix C: Predicted distribution plots for individual fold

The predicted distribution plots for a range of covariate values using the full tropical cyclone dataset is given in Sect. 5, however the results in Table 2 were produced by obtaining separate models for each testing dataset fold. Here, we present an example of the predicted distribution plots for the third fold for both methods and both lags. The predicted distributions for the individual folds can vary somewhat compared to the overall predicted response distributions fitted using the full tropical cyclone dataset, as seen below for the deep learning IPP approximation. The polynomial MCC approximation predicted conditional distributions for the individual folds were generally similar to the overall predicted conditional distributions in Fig. 3.

Table 3 The 5-fold mean CRPS values for various numbers of grid points for the QRF, DDR, polynomial MCC approximation ($K=1$), and deep learning IPP approximation ($K=20$) conditional distribution estimation methods for predicting the lag 3 maximum 10-meter wind speed

Full size table

Table 4 The 5-fold mean CRPS values for various numbers of evenly spaced grid points for the QRF, DDR, polynomial MCC approximation ($K=1$), and deep learning IPP approximation ($K=20$) conditional distribution estimation methods for predicting the lag 6 maximum 10-m wind speed

Full size table

Appendix D: CRPS control sensitivity analysis

The simulation study results in Sect. 4 indicate that the number of $K$ controls is influential for deep learning approximation model performance when there are only 200 observations. The deep learning IPP approximation model with $K=1$ control was outperformed by QRF and DDR in three of the four data models, whereas the deep learning IPP approximation model with $K=10$ controls outperformed QRF and DDR in three of the four data models. We further explore the effect of the number of $K$ for these four data models and 200 observations in this section.

Figure 6 gives the average CRPS divergence for 200 observations under each data model for $K\in \{1,2,3,4,5,6,7,8,9,10,15,20\}$ controls. There is a notable decrease in CRPS divergence when increasing from $K=1$ to $K=2$ controls for all four data models. The decrease in CRPS divergence for $K>2$ is also present, however the improvements diminish as $K$ is increased.

These results suggest that it is important to use multiple controls when utilizing the deep learning IPP approximation for a small dataset. The CRPS divergence scores either improve or stay similar as more controls are used, however the benefit of increasing the number of controls beyond $K=2$ are increasingly marginal. The results in Table 1 point to the computational efficiency of the deep learning approximation method as a significant consideration, although not quite as significant for only 200 observations. Still, this suggests that the optimal selection of $K$ controls would factor in both accuracy and computation time rather than choosing the highest $K$ value possible.

Appendix E: CRPS grid points sensitivity analysis

The methods used for the case study were evaluated using the CRPS performance metric with 1000 evenly spaced grid points, consistent with the specifications used in Li et al. (2019). To investigate whether the case study results are impacted by the number of grid points, we conducted a sensitivity analysis. The CRPS scores for each case study method are calculated using 250, 500, 750, 1000, 1250, 1500, 1750, and 2000 evenly spaced grid points and the results are compared.

Tables 3 and 4 give the CRPS scores and accompanying standard errors for each method using the various amounts of evenly spaced grid points for the lag 3 and lag 6 datasets, respectively. The results across all methods and datasets are extremely insensitive to the number of evenly spaced grid points. The average CRPS scores for different number of grid points are nearly identical, indicating that the takeaways of the case study were not influenced by our choice to use 1000 evenly spaced grid points.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huberman, D.B., Reich, B.J. & Bondell, H.D. Nonparametric conditional density estimation in a deep learning framework for short-term forecasting. Environ Ecol Stat 29, 677–704 (2022). https://doi.org/10.1007/s10651-021-00499-z

Download citation

Received: 20 August 2020
Revised: 25 February 2021
Accepted: 01 April 2021
Published: 20 May 2021
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10651-021-00499-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Abstract

Access this article

Similar content being viewed by others

Realistic tropical cyclone wind and pressure fields can be reconstructed from sparse data using deep learning

Estimation of global tropical cyclone wind speed probabilities using the STORM dataset

Markovian approach to the frequency of tropical cyclones and subsequent development of univariate prediction model

Change history

26 August 2022

References