Keywords

1 Introduction

Yellow rust (YR) is one of the most important epidemic diseases of wheat. It can cause a significant loss of wheat at a global scale [5, 9]. In the year 2002, over 6.7 million hm2 wheat was infected by YR in China, which resulted in a production loss around 10 billion kg [4]. It is of great importance to predict the YR effectively at an early stage, since it can provide critical information to agriculture plant protection departments to facilitate timely spray recommendation. So far, a series of studies had been conducted to forecast YR over a long time based on meteorological and agronomy data around the world. Hu et al. modeled a BP neutral network to predict YR in Hanzhong city, Shaanxi Province. The forecast results were highly consistent with the actual situation [3]. Chen et al. predicted YR severities at a seasonal time step in both Maerkang county and Tianshui city using discriminant analysis, with rewind accuracy and cross-validation accuracy greater than 78% [6]. Coakley et al. developed an improved method to predict YR [27]. Wang et al. (2012) conducted a study to develop a stable neutral network for predicting YR [31].

To date, it should be noted that there were few attempts made in forecasting YR at a regional scale with a short time step (7 days). Instead, efforts were made on forecasting seasonal severities of YR using spores counts data and meteorological observations. These models can achieve high accuracy at a local site. However, in most regions where studied, Puccinia striiformis can survive through winter. It is difficult to apply these models in the region where the spores counts data are not available. Considering that the YR is a multi-cycle disease, which distributed over large areas in the world. It is necessary to develop a multi-temporal YR forecasting model at a large spatial scale. However, such forecasting models lack recently.

Several critical weather factors associating the occurrence of YR on winter wheat were reported, which were temperature (T), humidity (H), precipitation (P), sunshine (S) [30]. It is important to relate YR occurrence with meteorological factors building the developing YR forecasting model. Bayesian network is a probabilistic graphical model that based on probability and statistics theory. The characteristics of the Bayesian network include rigorous reasoning process, clear semantic expression, data learning ability, etc. It is an efficient method for uncertainty reasoning and data analysis [15]. And it has been widely used in many fields since the 1980s. In this study, the Gansu province, which is a typical wheat planting region that suffers YR in China, was selected as our study area. Based on a continuous YR field survey data and corresponding meteorological data from 2010 to 2012, the potential of Bayesian network in disease forecasting was examined. In addition, a forecasting model of YR was developed to facilitate disease management at a regional scale.

2 Materials and Methods

2.1 Yellow Rust Survey Data

The YR survey data is collected by Gansu Provincial Protection Station. During 2010 to 2012, a weekly field survey was conducted across southern area of Gansu province (Fig. 1). The climate of the study region is characterized by high humidity and rainfall, and YR disease occurs almost every year. The surveyed data include the initial date of disease occurrence and the infected area. A total of 45, 18, 47 sites were surveyed in 2010, 2011 and 2012, respectively. The distribution of survey points is demonstrated in Fig. 1. The investigation ranged from the beginning of March to the end of July in each year. For model calibration and validation, the surveyed data were randomly split into 60% versus 40% in each year.

Fig. 1.
figure 1

Distribution of YR survey sites and meteorological stations in the study area

2.2 Meteorological Data

In this study, according to the research results of Cooke [9], four meteorological factors were chosen as input variables, including average temperature, average humidity, precipitation and sunshine duration. The daily data of these meteorological factors from a total of 54 weather stations around the study area was acquired from Chinese Meteorological Data Sharing Service System. The time range of the data is from a week before YR occurrence (based on the investigation data) in spring to its mature stage in each year. There are 3 steps to process meteorological data, including removal of abnormal value, averaging of meteorological factors on a weekly basis, and interpolation of each factor to a resolution of 30 m*30 m. Considering some meteorological data have a strong relationship with altitude, the DEM (Digital Elevation Model) data was used the adjust the spatial maps of meteorological factors by interpolating the fitted residue across the region [11, 14]. As for interpolation methods, the normality of the distribution of each meteorological factor was examined by Kolmogorov-Smirnov method. For those meteorological factors have a P-value > 0.05, a kriging method is used to conduct interpolation. Otherwise, an inverse distance weighted method is adopted.

2.3 Yellow Rust Forecast Based on Bayesian Network

2.3.1 The Bayesian Network Theory

Suppose there is a finite set X = {X1, X2, …, Xn} of discrete random variables, and each variable Xi can take on values from a finite set, denoted by Val(Xi). We use capital letter X to denote set of variables Xi, and lower-case letter x to denote specific values taken by those variables. A Bayesian network for X, the Bayesian network is B = <G, Θ>. The first component, G, is a directed acyclic graph whose vertices correspond to the random variables X1, X2, …, Xn, and whose edges represent direct dependencies between the variables.

As an example, let X1 = {X1, X2, …, Xn, C}, where variables X1, X2, …, Xn are the attributes and C is the class variable. The graph structure of this example is demonstrated in Fig. 2. given a variable set D = {x1, x2,.., xn}, and a class variable set c, according to Bayesian theory, the posterior probability of the most likely class can be estimated by [28]:

Fig. 2.
figure 2

An example of a Bayesian network

$$ p(c|D) = \mathop {\arg \hbox{max} }\limits_{c \in C} \frac{p(D|c)p(c)}{p(D)} $$
(1)

where the p(D) is independent constant, the formula (1) can be written as:

$$ p(c|D) = \mathop {\arg \hbox{max} }\limits_{c \in C} p(D|c)p(c) $$
(2)

Based on the rules of multiplication, p(D|c) can be expressed formulas:

$$ p(D|c) = p(x_{1} |c)P(x_{2} |x_{1} ,c)p(x_{n} |x_{1} ,x_{2} , \cdots ,x_{n - 1} ,c) $$
$$ = \prod\limits_{i = 1}^{n} {p(x_{i} |x_{1} ,x_{2} , \cdots ,x_{i - 1} ,c)} $$
(3)

For each xi, if there is a set π(xi) ∈ {x1, …, xi − 1}, xi and {x1,…, xi − 1} are conditional independence given the set π(xi). Then formula (2) has the form as formula (4), and this is the classification formula of Bayesian network.

$$ c(x) = \mathop {\arg \hbox{max} }\limits_{c \in C} p(c)\prod\limits_{i = 1}^{n} {p(x_{i} |\pi (x_{i} ),c)} $$
(4)

2.3.2 Development of Bayesian Network

In this study, a Bayesian network model is developed to forecast YR with not only the four meteorological factors as mentioned above, but also the growth period, given the growth period has a significant impact on disease occurrence probability. In addition, considering the physical relationships between precipitation and humidity, and between precipitation and sunshine duration, the structure of the Bayesian network is illustrated in Fig. 3.

Fig. 3.
figure 3

Bayesian network structure

In this Bayesian network, W represents the status of YR occurrence, which is a binary variable (w1 = health, w0 = YR infected; G is the growth stage (1 = reviving stage, 2 = jointing stage, 3 = heading stage, 4 = milk stage,). While T, P, H, S denote average temperature, precipitation, average humidity, sunshine duration respectively. Each of them has 6 degrees following de Vallavieille-Pope, Cooke [9, 18], etc. The value range of each degree for all weather factors is given in Table 1.

Table 1. Grade of meteorological factor

As the YR field surveys were conducted on a weekly basis, the meteorological data was also processed per week. Considering the possible latent effect, the independent variables were prepared to start from one week in advance to the initial YR field survey date. The conditional probability was calculated with Laplace estimate method to avoid possible zero occurrence frequency. The equations are shown in (5)–(7) [16].

$$ p(w) = \frac{{\sum\limits_{i = 1}^{n} {\delta \left( {w_{i} ,w} \right) + 1} }}{{n + n_{w} }} $$
(5)
$$ p(a_{j} |w,b) = \frac{{\sum\limits_{i = 1}^{n} {\delta (a_{ij} ,a_{j} )\delta (w_{i} ,w)\delta (b_{i} ,b) + 1} }}{{\sum\limits_{i = 1}^{n} {\delta (w_{i} ,w)\delta (b_{i} ,b) + n_{j} } }} $$
(6)
$$ p(a_{j} |w) = \frac{{\sum\limits_{i = 1}^{n} {\delta \left( {a_{ij} ,a_{j} } \right)\delta (w_{i} ,w) + 1} }}{{\sum\limits_{i = 1}^{n} {\delta (w_{i} ,w) + n_{j} } }} $$
(7)

Where n is the number of samples, nw is the number of classes, nj is the number of the jth variable’s values, wi is the actual class value of the ith sample, aj is the jth value of the independent variables, aij is the jth value of the independent variables in the ith sample. δ(wi, w) is a two-valued function, the value of the function is 1 when wi = w, or else, the value is 0.

The posterior probability of YR occurrence is expressed as:

$$ w(x) = \mathop {\arg \hbox{max} }\limits_{{w \in (w_{1} ,w_{0} )}} p(w)\prod\limits_{i = 1}^{5} {p(x_{i} |\pi \left( {x_{i} } \right)} ,w) $$
(8)

2.3.3 Evaluation of Disease Forecast Model

Based on the posterior probability that is generated from the forecasting model, a threshold is applied to convert the forecasting probability to disease occurrence status. A sample will be marked as health when the probability value is smaller than the threshold. Otherwise, it will be classified as a YR infected sample. To obtain an optimal threshold, we calculated the model accuracy under different thresholds varying from 0 to 1 with a step of 0.05. The optimal threshold can be determined when the highest model accuracy researched. To further compare the Bayesian network to other classic methods, we also compared its performance with that under BP neutral network and FLDA.

3 Results and Discussion

In the bayesian network, the distribution of conditional probability for each node was calculated through formulas (5)–(7) (Figs. 4 and 5). In Fig. 4, for infected sites, with an increase of precipitation, the conditional probability of humidity during in h4 and h5 have a certain increase. While the conditional probability distribution of sunshine duration is relatively uniform. In Fig. 5a, for infected survey sites, the conditional probability variation trends of T, H, P, S are similar to each other, which approaching the Gaussian distribution. In Fig. 5b, the conditional probability of growth stage rise as time goes on. This result is in agreement with the research results of Cooke [9].

Fig. 4.
figure 4

Conditional probability distribution of the attribute nodes that with more than one parent nodes. a. The conditional probability of the humidity in the case of different rainfall and YR happened, b. The conditional probability of the sunshine duration in the case of different rainfall and YR happened.

Fig. 5.
figure 5

The conditional probability distribution of the nodes that with single parent. a. The conditional probability of meteorological factor, b. The conditional probability of growth period

In this paper, we developed a YR forecasting bayesian network with four weather factors and one phonological variable, to model the probability of YR infection a week in advance. The output of the Bayesian network model is a posterior probability. The forecasted probability of YR occurrence is compared with the number of actual infected sites according to the survey data (Fig. 6). It is noted that both the number of actual infected sites and the forecasted YR probability showed an increasing trend over time (from reviving stage to milk stage). Figure 7 demonstrated the spatial distribution of both the forecasting results and the ground truth. The YR started to show up in the southeast of study area at an early stage (reviving stage). Then, another YR occurrence was spotted in the central region of study area in early April. After a spread process, in the middle of June, most surveyed sites were identified as infected over the study area. Such a spatial trend can be well modelled with the developed Bayesian network (Fig. 7).

Fig. 6.
figure 6

Prediction probability and infected spots: a. Trend of prediction probability b. Trend of actually happened number of infected spots

Fig. 7.
figure 7

Forecasting of YR and physical truth distribution from 2010 to 2012. a. The distribution of prediction probability on March 1, 2010. b. The distribution of prediction probability on April 12, 2010. c. The distribution of prediction probability on May 17, 2010. d. The distribution of prediction probability on June 14, 2010.

Through an optimization of threshold that was mentioned in Sect. 2.4.2, the probability of 0.4 was used to convert the forecasted probability to a binary disease occurrence result. Table 2 summarized the forecasted results of the Bayesian network, BP neutral network and FLDA. The result suggested that Bayesian network and FLDA produced more accurate forecasts than BP neutral network. For Bayesian network and FLDA, the Bayesian network outperformed FLDA at both heading stage and milk stage, which are important time points for prevention.

Table 2. Accuracy indices of three tested methods

4 Conclusions

The Bayesian network was successfully used to develop a forecast model of YR occurrence probability across vast area in this paper. The performance of the model was evaluated against a weekly survey data during wheat’s key growth stages from 2010 to 2012. The results confirmed that the disease forecasted results are able to reflect the spatio-temporal development and distribution pattern of YR. Further, superior performance of the Bayesian network in comparing with BP neutral network and FLDA also demonstrated that the Bayesian network is of great potential in forecasting crop diseases at a regional scale.