Keywords

1 Introduction

Rice is one of the most important grain crops in the world, more than half population in the planet regard rice as their staple food [1]. So, the cultivation of new rice varieties is very important for human beings. Nowadays the cross breeding is still the most popular and effective rice breeding method [2], the parental selection is the key to a successful cross breeding. Through the long-term practice, scientists have found the key principles for parental varieties in crop cross breeding: the variety selected must have the target traits needed, no bad traits with the parental varieties, one of the parental varieties should be proved good variety locally, selected parent variety should have a dominant trait mark, the selected varieties should be able to produce a fruited progeny to avoid hybrid sterility. The five principles are the theoretical basis of parental varieties selection in cross breeding.

There is not much study in parental varieties selection decision support in rice cross breeding. Lifu Jiang uses FFNN and orthogonal genetic algorithm to forecast the quantity and traits of rice hybrids [3]. Dingchun Yan uses a knowledge model according constructed with the natural resources conditions to select the suitable planting location [4]. Yuliang Qi found there is significant positive correlation between use statistical methods for Hybrid F1 yield and effective spikes per plant (and grains per panicle), and put forward choosing principle between subspecies [5]. Since this century, molecular breeding becomes the main breeding method gradually. Marker-assisted selection (MAS) implements the locating of target genes by means of molecular markers for the material of the donor and receptor and predict hybrid parent breeding values of parental varieties through combing phenotype and marker together [6] GWS is developed on the basis of MAS, with zero returned best linear unbiased prediction method (RR-BLU), the accuracy rate is 18%–43% higher than MAS [7]. In actual hybrid rice breeding, parent selection should not depend on the genomics completely, because it will lead to artificial exaggeration of the target traits and affect the judgment for genetic stability. The trend of these years is to combine genomics and phenomics, mining patterns in traits and genes, and find varieties with higher ability and heritability [8], to provide accurate accordance for rice breeding. The study collects rice parental varieties and related phenotypic and economic traits data, mining the data with data exploring and machine learning technology, find how biological traits affect economic traits accurately and discover suitable methods and technology for rapid parental varieties seeking. So provide an operational tool for parental varieties matching. The core problem of the study is how to find the relationships between rice phenotype traits and economic traits, discover useful pattern and knowledge, and provide decision support tool for rice breeding.

2 Data Process and Analysis

2.1 Data Collecting and Processing

The data used in the study comes from rice genetic resources characteristic evaluation database of National Agriculture Science data sharing center, outstanding rice germplasm repository, crop varieties examination, and approval database, rice germplasm database, rice bred varieties and pedigree database, main crop seed vigor monitoring database, main crop variety regional experiment database, approved rice data from national and provincial rice data center. Due to the different data quality and future missing, combine with the data analysis subject and data integration requirements, 19 features is selected during the data preprocessing period, include varieties, available spike, plant height, spike length, total grains amount, filled grains amount, seed rate, thousand seeds weight, grain length, length-width ratio, brown rice percentage, milled rice percentage, head rice rate, chalky percentage, chalkiness degree, gel consistency, amylose content, whole growth period and actual output, and filter the data according to data quality. Finally, 147 male parent data, 133 female parent data and 134 new varieties data are selected.

2.2 Visual Data Exploring

Visual data exploring show data to decision makers in interactive graph to enhance data browse and analyze [9], help them know datasets deeply, form hypothesis, and it’s in fact the basis for further data mining and analytics [10].

Because the restriction of human visual sense, usually data exploring only process unitary data, binary data and ternary data. Unitary data analysis is always used for data sample distribution observation, and multivariate data analysis used for discovering mutual effect and dependency. It is hard to visualize the data higher than ternary with conventional charts. They must be transformed into low level data or visualized in other methods. Usual unitary data exploring methods include counting, percentage, variance, standard deviation, average, median, skewness and kurtosis etc., the visualization includes histogram, box chart, pie chart, curve and line chart etc. Usual multivariate data exploring methods include Z test. T test, chi-square test, covariance, regression etc., and visualization include scatter, stacked histogram and also the combination of several charts.

The system developed for this project generate interactive histogram and ridge regression for arbitrary two features and explore the possible correlation relationships among different features.

Ridge regression is a kind of Biased estimate regression analysis method use in collinearity data, it’s essentially a kind of improved least square estimation method, the objective function of ridge regression is:

$$ \mathop {\hbox{min} }\limits_{\omega } \left\| {X_{\omega } - y} \right\|_{2}^{2} + \alpha \left\| \omega \right\|_{2}^{2} $$

Ridge regression obtain higher numerical stability through the loss of unbiasedness, and get higher computational accuracy, the algorithm has more practical value in collinearity problem and study with more error data.

The system observes the distribution of every feature with drawing histogram of them, and finds the possible correlation relationships though ridge regression between any two features. As showed in Fig. 1, the two features here have obvious linear correlation relationship.

Fig. 1.
figure 1

Binary analysis of rice breeding data

2.3 Mining for Rice Data

The data mining algorithms used for rice data in the study are CART (Classification and Regression Tree) algorithm and Spectral BiCluster.

2.3.1 Decision Tree

Decision Tree also called judging tree. It’s a flow chart tree structure with two or more branches [11]. There are numerous implementations for decision tree, like ID3, C4.5, CART, SLIQ and SPRINT etc. CART algorithm is used in this project. CART uses binary recursive partitioning technology, it always divides the sample datasets into two sub datasets, the according to determine the sample partition next time is GINI coefficient (also called GINI impurity): \( gini(T) = 1 - \sum {p_{j}^{2} } \). Which pj means the probability of a sample belongs to class j, All the samples belong to a single class when gini(T) = 0, when all the class appear in all the nodes with same probability, gini(T) reach the maximum value, the value is (C−1)C/2. If the GINI coefficient of all the features is calculated, the GINI information gain can be calculated. In this study, features’ data is input into CART, and the common sense, rules and knowledge for rice breeding varieties selection and judgement can be found.

2.3.2 Spectral BiCluster

Spectral BiCluster is a cluster method through eigenvalue decomposition. The algorithm is closely related with graphic partitioning. The algorithm can cluster samples on the arbitrary shape of the sample space, and always converge to the global optimal solution.

Figure 2 shows the difference between K-means and spectral cluster. K-means select clustering center randomly first, and gather sample points to the nearest center, then calculate the mean of the points as a new clustering center, repeat the process until the clustering center converge to a stable point. The spectral clustering first calculate the similarity matrix W though Gaussian similarity function: \( W_{ij} = e^{{\frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\sigma^{2} }}}} , \) then calculatenixlacian) trix \( {\text{L}}\, (L\text{ = }D\text{ - }W ) \), D means degree matrix, which is a diagonal matrix, eacrepresented in the matrix represent the sum of every single row in matrix W. The algorithm calculates extract minimized k eigenvectors and combine the k vectors into a n × k matrix, each row in the matrix is A vector in the k dimension space. The algorithm then cluster the vector with K-means, The class of each row in a cluster belongs is the node of the original graph, which is the class which the original data point belongs.

Fig. 2.
figure 2

(Source: Scikit-Learn official Site)

Comparison between spectral cluster and k-means cluster

Traditional cluster just cluster data in one direction (row or column) in a matrix, that is, cluster in global pattern, so K-means just find the global information of the dataset and drop much partition information [12]. While BiCluster algorithm clustering at both row and column direction of the matrix, it not only cluster global information but also find effective partition information in high dimension data [13,14,15,16,17].

Spectral cluster is integration of spectral cluster and bicluster. The algorithm suppose the input data matrix has a hidden board structure, the rows and columns of The matrix can be divided into one or more sub biclusters, to make the count of the cartesian product of row clusters and column clusters in any sub bicluster is roughly constant. For instance, there is a 2 × 3 sub bicluster, then each row in the matrix belongs to three subclusters, and each column belongs to two subclusters. The algorithm re-divide the rows and columns of the matrix into subclusters, to make board matrix corresponding to the subclusters approximate the original matrix better.

The spectral BiCluster algorithm is used in this study to analyze rice breeding data to find approximate parent varieties with partition excellent traits.

2.4 Training Strategy of the Algorithm

Because the samples we collected in this study are limited, in order to get better training result, this study adopts ten-folder cross validation for data training. During each data training, The data is divided into 10 even parts randomly, each single part of the data is taken out in turn, and other 9 parts of the data join the training, the part taken out used to calculate the error rate. Repeat the process 10 times, and each time uses different datasets, finally calculate a comprehensive error rate. The experience proved that 10 times are the best choice for getting more accurate error rate [18].

3 Result

For the convenience of data processing and data visualization, a system is developed by Python with Django framework, to realize the online rice breeding data analysis and mining.

3.1 Result of CART

The maximum amounts of the layers of the decision tree generated should be input before running the CART algorithm. Figures 3, 4 and 5 show the visual results in different layer parameter. A higher or lower layer parameter could result in overfitting or under fitting, so users should adjust the parameter and watch the result to get the best output. The default value of the parameter of the system is 5, which can get reasonable output in most situations.

Fig. 3.
figure 3

The 3 layers decision tree generated by CART algorithm processing the rice breeding data

Fig. 4.
figure 4

The 5 layers decision tree generated by CART algorithm processing the rice breeding data

Fig. 5.
figure 5

The 10 layers decision tree generated by CART algorithm processing the rice breeding data

X[N] (N = 1, 2, …, 18) Means The Features of the Data, X[1] Means the Feature Effective Panicle, X[2] Plant Height, X[3] Ear Length, X[4] Total Grain Number, X[5] Grain Number, X[6] Seeding Rate, X[7] the 1000 Kernel Weight, X[8] Grain Length, X[9] Length-Width Ratio, X[10] Brown Rice Percentage, X[11] Head Rice Rate, X[12] Whole Head Rice Percentage, X[13] Chalky Rice Rate, X[14] Chalkiness Degree, X[15] Gel Consistence, X[16] Amylose Content, X[17] Whole Growth Period, X[18] Actual Yield

For example, the following rules or common sense can be recognized from the decision tree shows in Fig. 4: there are 100 samples for setting rate > 20.6 and 4.85 ≤ thousand seed weight ≤ 7.9:10 samples for setting rate ≤ 20.6 and thousand seed weight ≤ 8.25 and chalk rice grade > 70 and polished rice rate ≤ 62.4%. Other samples are scattered across different rules. But the corresponding sample number is small. Changing the number of layers of the decision tree may result in different results. For example, set the maximum layer to 3, the result changes as follow: 104 samples for setting rate ≤ 20.6 and thousand seed weight ≤ 8.25, 18 samples for setting rate ≤ 20.6 and thousand seed weight > 8.25. Therefore, along each path from the root node to leaf node of the tree, there is a rule or a common sense, means a knowledge or a pattern, effective knowledge and patterns can provide valuable references for varieties breeding decision making.

3.2 Result of Spectral Bicluster

The number of rows and columns generated by the spectral Bicluster algorithm must be input previously before running the algorithm (Fig. 6).

Fig. 6.
figure 6

Visual output of spectral BiCluster analysis

The number on top of the shape represents the feature number. The number to the left of the shape represents the sample number.

Tables 1, 2 and 3 show 3 biclusters generated by the algorithm with parameters rows = 8 and columns = 6. There are 48 biclusters generated in this training, but some of them are meaningless (e.g. Some biclusters only have one column include), we can get different biclusters through changing the value of rows and columns parameters. The value of the biclusters depends on the evaluation of breeders. But the algorithm provides a tool for trying different parameters and training the data anytime, until the breeder gets satisfactory results.

Table 1. Analysis output by spectral BiCluster – data in BiCluster (1)
Table 2. Analysis output by spectral BiCluster – data in BiCluster (2)
Table 3. Analysis output by spectral BiCluster – data in BiCluster (3)

From the data in these biclusters, we can see that the samples in each bicluster are generally closer (the distance between samples). In addition, features extracted from each bicluster follow certain patterns. So results from spectral biclustering algorithm are some closer samples with their features follow certain patterns.

The algorithm used in rice breeding can help breeders find closer samples and phenotype features from large amount of samples rapidly, and help breeders to make the right decision.

4 Discussion

The study explore rice breeding data with unitary and binary analysis method, help breeders discover the influence between biological traits and economic traits, reveal hidden patterns, knowledge or rules in the data, discover closer samples with their features follow certain patterns through Spectral Biclustering algorithm, all these technology can promote the accuracy and efficiency of rice breeding.

The essence of mining the rice breeding data is to reveal the pattern and knowledge hide in the data, discover the influence of how phenotype traits affect the economic traits, and provide reference for rice varieties selection and breeding decision making.

5 Conclusion

Use data mining to analyze rice breeding data can help breeders discover the underlying patterns and knowledge hidden in these data, the patterns and knowledge can help breeders make a more accurate decision rapidly. Through the study, the method is proved feasible, and the method will be the major trend in crop breeding area in the future.

But from the perspective of the current discipline development, this study needs further investigation, Follow-up studies should focus on the following two aspects:

  1. (1)

    The data dimension should be further enriched and the data integrity should be improved. The data used in this study did not include the characteristics of the rice planting environment, such as altitude and planting area, but these factors tend to have an important effect on the economic traits of rice. In addition, the amount of data available for this study is to be expanded. The quality of data and the richness of the data characteristics need to be improved, to ensure the effectiveness of the patterns and knowledge extracted from the data.

  2. (2)

    further researches should be made on bioinformatics. With the development of life science and bioinformatics, the study of phenotypology has also been developing rapidly Phenotypology mainly studies the patterns of phenotypic traits, such as the physical and chemical properties of biology, that vary with mutation and environment [19]. Phenomics studies the phenotype system at the genome scale, hope to explains the unknown function of the genome [20]. The simple analysis of crop phenotype data only shows the relation and function of crop biological traits and economic traits at best, but with the genome data add in, the decisive action and relationships between the specific genetic segment and the specific traits of the crop can be excavated. It is possible to change a gene precisely and make it possible for the offspring to have good traits.

Therefore, the development of future breeding intelligent decision technology must be the close combination of data analysis technology and bioinformatics. Concrete will be reflected in more detailed and rich, continuous, a huge amount of phenotypic data collection, combined with crop genome comparison mining, and provide methods and tools for accurate directional breeding.