Keywords

1 Introduction

Computational models are a widely-used alternative method for solving problems that cannot be studied by deterministic analytical approaches [1]. This has led to the development of a fairly small number of density functions to describe how values are distributed over the sample spaces of a large number of real phenomena. However, preparing and statistically analyzing the data to take advantage of these distributions requires significant effort, and does not produce good results when the system analyzed depends on random variables that do not follow known probability density functions, leading us to look for unconventional alternatives that can reproduce the stochasticity of real systems [2].

Estimating random variable distributions has played an important role in several recent studies, studying monthly rainfall and water flow to determine drought indicators [3], predicting crime based on Twitter messages [4], and studying trends in the marine duck populations along the Atlantic coast of the United States [5].

For more than 40 years, nonparametric probability density estimation techniques, such as the Kolmogorov–Smirnov and chi-squared tests, have been the most widely-used density estimation methods, because they do not depend on the explicit form of the distribution or its parameter values, as parametric techniques do [6, 7]. However, these tests only suggest how to adjust the data when working with known distributions, and they are also sensitive to common errors in interpreting the p-value [8].

Since its introduction in 1956, KDE has become one of the most widely-used nonparametric density estimation methods [9, 10]. Over time, various authors have extensively modified the original technique in order to reduce its sensitivity to the choice of the kernel function and bandwidth [11]. Most recent methods suggest using maximum likelihood algorithms, with maximum entropy [12] and histogram trend filters [13]. These nonparametric techniques have proved useful for analyzing phenomena that do not conform to known distributions, such as wind speeds [14], crime prediction using social network data [4], and smart sensor-based electricity readings [15].

Once a given random variable’s distribution has been established, the subsequent problem consists of generating numerical values that follow the same distribution. The most common conventional random variable generation techniques are the inverse transformation, the acceptance-rejection, and convolution methods. However, these all have issues in terms of the calculation speed, computational resources required, and effort needed to prepare and statistically analyze the data [1]. Some researchers have presented universal methods of generating random variables by means of generalized acceptance-rejection algorithms [16], multilayer neural networks [17], transforming random variables to generate continuous distribution families [18], and specialized algorithms for producing particular distributions such as geometric [19] distributions.

This paper presents a new nonparametric approach that combines KDE with RBFBNNs to generate random variable values regardless of their probability distributions and whether they are discrete or continuous variables, thus reducing the dependence on goodness of fit tests, for both data that follows a known distribution and those that are distributed atypically.

2 Kernel Density Estimation

KDE is a common nonparametric technique for estimating the probability density functions of random variables. Given a set of n independent observations \( X_{1} ,X_{2} , \ldots ,X_{n} \), represented by the same probability density function f, this model estimates the probability density function (PDF) \( f_{n} \) associated with these observations as follows:

$$ f_{n} \left( x \right) = n^{ - 1} h^{ - 1} \mathop \sum \limits_{j = 1}^{n} K\left\{ {h^{ - 1} \left( {x - X_{j} } \right)} \right\}. $$
(1)

Here, \( K \) is the kernel function, which weights the result by the proximity of x to the sampled points, and \( h \) is the bandwidth, which defines the size of the kernel function’s weighting window.

KDE’s accuracy depends on the kernel function \( K \) and bandwidth \( h \) used. The value of \( K \) does not significantly affect the model’s statistical efficiency, but it does impact the calculation speed for large data sets [10]. In contrast, the bandwidth h is a sensitive parameter that governs the model’s overall behavior, so it is essential to select an optimal value for it when estimating the PDF [10, 11].

3 Generalized Regression Neural Networks and Probabilistic Neural Networks (RBFBNNs)

GRNNs and PNNs are RBFBNNs introduced by Donald Specht between 1990 and 1991. In case of PNNs, Specht demonstrated that the Bayes–Parzen classifier can be split into many simple processes and hence implemented as a multilayer neural network [20], also showed that the GRNNs can be implemented for any regression problem in which an assumption of linearity is no justified [21]. In general, the structure for GRNNs and PNNs can be summarized as shown in Fig. 1:

Fig. 1.
figure 1

(Source: adapted from [20, 21])

Neural Networks structure for (a) GRNN (b) PNN.

In both cases the input layer distributes the input to the neurons of the next, or pattern, layer, when it receives a pattern \( x \), pattern layer neuron \( x_{ij} \) calculates its output, that is later distributed in the units of sum that determine the output according to some weights or defined relation, finally the result obtained is used to estimate the classification generated in PNNs case, or the continuous value in GRNNs case [20, 21].

One important characteristic of this type of networks is that no iterative training is required; instead, the parameters are saved and used to make predictions. This makes it a computationally lightweight algorithm, which is significant when handling large amounts of data [22].

4 Combining KDE and RBFBNNs

Figure 2 gives an overview of the proposed method. It starts by estimating the shape of the sample data’s PDF using KDE. Here we used the Epanechnikov weighting function and established an appropriate bandwidth for each data set using a local search procedure, starting from the reference bandwidth value proposed by Silver [10]. Once PDF’s shape is estimated, we compute the CDF’s shape using a numerical approximation of PDF’s area under curve by trapezoidal Riemann Sum. While in the analytically case the highest probability of CDF must be equal to one, in the estimation case, the highest value of the estimated CDF is better when is closer to one.

Fig. 2.
figure 2

(Source: the authors)

Overview of the proposed method.

The computed points from CDF estimation are used to train a GRNN in case of continuous variable or a PNN in case of discrete variable. This produces a model that enables random values to be generated according to the same distribution as the sample data via an inverse transform procedure by replacing the CDF with the RBFBNN.

5 Evaluation

For this study, a computer with an Intel Core i7 2.60 GHz processor and 8 GB of RAM was used. All the calculations were carried out using Python 3.2.6.

The GRNNs and PNNs implemented using the NeuPy library, were trained on 70% of the input data set, with the remaining 30% reserved for the subsequent validation step. For this evaluation, we used the learning curve algorithm from the scikit-learn library. This method evaluates the neural network’s accuracy by varying the training data set and performing repeated cross-validation, preserving the 70/30 split for each training data subset and using R2 metric.

To evaluate the total error of our model (Eq. (2)), we used three probability distributions with known CDFs, so that their inverse transforms could be computed analytically for a given set of uniform random values, enabling us to calculate different errors between the analytic values obtained from the inverse transformations and the values generated by the RBFBNN. The mixture of normal distributions data set was used as an illustration of applicability of proposed model to generate random variates from unknown distributions.

$$ {\text{Total}}\;{\text{error}} = {\text{KDE}}\;{\text{error}} + {\text{Riemann}}\;{\text{sum}}\;{\text{error}} + {\text{RBFBNN}}\;{\text{error}} $$
(2)

5.1 Data Sets Used

To evaluate the proposed model, we used four different data sets, of 600 samples each, generated using SciPy Python’s library. The first three data sets used the Poisson, triangular, and exponential distributions, while the fourth was a mixture of three different normal distributions, contributing 200 samples each. Table 1 lists the parameters used for each distribution, given according to SciPy’s nomenclature.

Table 1. Details of the data sets used.

6 Results and Discussion

6.1 PDFs and CDFs Estimations

Figure 3 shows histograms of the 600 samples from each probability distribution, together with the estimated PDFs and corresponding CDFs.

Fig. 3.
figure 3

KDE of the PDFs and CDFs for the (a) Poisson, (b) triangular, (c) exponential, and (d) mixed normal distribution data sets.

The bandwidth computed with the local search procedure shown in Table 2, allows good fit for PDF and CDF especially for Poisson and mixture of normal distributions, where the smoothed curves are closers to the histogram representations. In the case of triangular and exponential distributions, the KDE exhibits non-smoothed shapes in comparison with the histograms for both distributions, which is an evidence of histogram’s width class sensitivity in PDF’s shape estimation.

Table 2. Overall errors for the proposed method on each of known distributions.

Otherwise, the learning curves shown in Fig. 4 for each data set, reflect the impact of the training data variation on the RBFBNN’s R2, allowing us to establish that, for the triangular and Poisson distributions, a training set of size approximately 150 was sufficient for good fitting, while, for the exponential distribution, the R2 increases considerably above 300 samples, meaning that distributions with extreme values with low probability of occurrence, require a larger number of sample data for good CDF estimation.

Fig. 4.
figure 4

Learning curves and estimated PDFs for the (a) Poisson, (b) triangular, (c) exponential, and (d) mixture normal distribution data sets.

Figure 4 also shows the histograms and estimated PDFs for 10,000 new random values generated by each of the RBFBNN models, suggesting graphically that the original and generated data follow identical distributions.

6.2 Precision of the Method

We determined the method’s overall precision by considering the MSE, mean absolute error (MAE), R2, and explained variance, calculated by comparing the analytic results with the KDE-based CDFs for each of the known probability distributions and the values generated by the RBFBNNs, as discussed in Sect. 5.

The results shown in Table 2 indicate that that the accuracy was generally good for the Poisson and triangular distributions, but the correlation and explained variance are notably reduced for the exponential distribution. This is probably due to the fact that exponential distribution includes extreme values that are unlikely to appear in the training data and therefore do not feature in the estimated PDFs and CDFs, weakening the GRNN’s and PNN’s ability to generate accurate results.

In case of mixture of normal distributions data set, where analytical CDF is unknown, we only analyzed KDE adjust between original data set and the histogram for 10,000 new random values generated using the proposed method, finding good fitting as shown in Fig. 4.

7 Conclusions

In this paper, we have proposed a nonparametric model for generating random variable values that has considerable advantages in terms of reducing the amount of preparatory and statistical analysis work required to represent the stochasticity of real phenomena in computer simulations. The integration of KDE and RBFBNNs enable us to replicate known and unknown random variate distributions without needed of goodness of test fit procedures, improving the model’s applicability when the data do not conform to the known probability density functions as shown before in the case of mixture of normal distributions.

One of the main weaknesses of our model is the strict need to establish a suitable KDE bandwidth, to prevent errors propagating to neural networks training process and hence guarantee the CDF curves are well-adjusted, since this involves a local search procedure to stablish an adequate bandwidth value with an associated computational cost. This weakness is especially evident in distributions where unlikely values will generally not appear in the sample data, and thus will not be reflected in the KDE and the RBFBNN’s prediction, meaning a greater amount of data will be needed for training.