Study of Machine Learning Based Rice Breeding Decision Support Methods and Technologies

Cui, Yun-peng; Wang, Jian; Liu, Shi-hong; Liu, En-ping; Liu, Hai-qing

doi:10.1007/978-3-030-06137-1_6

Yun-peng Cui¹⁷,
Jian Wang¹⁷,
Shi-hong Liu¹⁷,
En-ping Liu¹⁸ &
…
Hai-qing Liu¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 545))

Included in the following conference series:

International Conference on Computer and Computing Technologies in Agriculture

800 Accesses

Abstract

The Objective of the study is to Analyze and mining rice breeding data with data explore and machine learning algorithms to discover how rice biological characters influence the economic characters, explore effective methods and technologies for breeders and help them find appropriate breeding parents, and provide tools for parental selection in rice breeding. The author developed a B/S application with Python and Django, which implement real-time data mining of rice breeding data. Data analysis and processing result generated from decision tree algorithm can find effective breeding knowledge and patterns, and spectral biclustering algorithm can find required varieties with their local features follow certain patterns. The system can help breeders find useful knowledge and patterns more quickly, and improves the accuracy and efficiency of crop breeding.

Funding by scientific and technological innovation project of CAAS (CAAS-ASTIP-2016-AII) and Genotype of hot pepper plant in Hainan province – phenotype data depth mining technology research and application project (KJHZ2015-33).

You have full access to this open access chapter, Download conference paper PDF

Rice Crop Disease Detection Using Machine Learning Algorithms

An automatic ensemble machine learning for wheat yield prediction in Africa

Article 23 January 2024

Identification of optimal prediction models using multi-omic data for selecting hybrid rice

Article 25 March 2019

Keywords

1 Introduction

Rice is one of the most important grain crops in the world, more than half population in the planet regard rice as their staple food [1]. So, the cultivation of new rice varieties is very important for human beings. Nowadays the cross breeding is still the most popular and effective rice breeding method [2], the parental selection is the key to a successful cross breeding. Through the long-term practice, scientists have found the key principles for parental varieties in crop cross breeding: the variety selected must have the target traits needed, no bad traits with the parental varieties, one of the parental varieties should be proved good variety locally, selected parent variety should have a dominant trait mark, the selected varieties should be able to produce a fruited progeny to avoid hybrid sterility. The five principles are the theoretical basis of parental varieties selection in cross breeding.

There is not much study in parental varieties selection decision support in rice cross breeding. Lifu Jiang uses FFNN and orthogonal genetic algorithm to forecast the quantity and traits of rice hybrids [3]. Dingchun Yan uses a knowledge model according constructed with the natural resources conditions to select the suitable planting location [4]. Yuliang Qi found there is significant positive correlation between use statistical methods for Hybrid F1 yield and effective spikes per plant (and grains per panicle), and put forward choosing principle between subspecies [5]. Since this century, molecular breeding becomes the main breeding method gradually. Marker-assisted selection (MAS) implements the locating of target genes by means of molecular markers for the material of the donor and receptor and predict hybrid parent breeding values of parental varieties through combing phenotype and marker together [6] GWS is developed on the basis of MAS, with zero returned best linear unbiased prediction method (RR-BLU), the accuracy rate is 18%–43% higher than MAS [7]. In actual hybrid rice breeding, parent selection should not depend on the genomics completely, because it will lead to artificial exaggeration of the target traits and affect the judgment for genetic stability. The trend of these years is to combine genomics and phenomics, mining patterns in traits and genes, and find varieties with higher ability and heritability [8], to provide accurate accordance for rice breeding. The study collects rice parental varieties and related phenotypic and economic traits data, mining the data with data exploring and machine learning technology, find how biological traits affect economic traits accurately and discover suitable methods and technology for rapid parental varieties seeking. So provide an operational tool for parental varieties matching. The core problem of the study is how to find the relationships between rice phenotype traits and economic traits, discover useful pattern and knowledge, and provide decision support tool for rice breeding.

2 Data Process and Analysis

2.1 Data Collecting and Processing

The data used in the study comes from rice genetic resources characteristic evaluation database of National Agriculture Science data sharing center, outstanding rice germplasm repository, crop varieties examination, and approval database, rice germplasm database, rice bred varieties and pedigree database, main crop seed vigor monitoring database, main crop variety regional experiment database, approved rice data from national and provincial rice data center. Due to the different data quality and future missing, combine with the data analysis subject and data integration requirements, 19 features is selected during the data preprocessing period, include varieties, available spike, plant height, spike length, total grains amount, filled grains amount, seed rate, thousand seeds weight, grain length, length-width ratio, brown rice percentage, milled rice percentage, head rice rate, chalky percentage, chalkiness degree, gel consistency, amylose content, whole growth period and actual output, and filter the data according to data quality. Finally, 147 male parent data, 133 female parent data and 134 new varieties data are selected.

2.2 Visual Data Exploring

Visual data exploring show data to decision makers in interactive graph to enhance data browse and analyze [9], help them know datasets deeply, form hypothesis, and it’s in fact the basis for further data mining and analytics [10].

Because the restriction of human visual sense, usually data exploring only process unitary data, binary data and ternary data. Unitary data analysis is always used for data sample distribution observation, and multivariate data analysis used for discovering mutual effect and dependency. It is hard to visualize the data higher than ternary with conventional charts. They must be transformed into low level data or visualized in other methods. Usual unitary data exploring methods include counting, percentage, variance, standard deviation, average, median, skewness and kurtosis etc., the visualization includes histogram, box chart, pie chart, curve and line chart etc. Usual multivariate data exploring methods include Z test. T test, chi-square test, covariance, regression etc., and visualization include scatter, stacked histogram and also the combination of several charts.

The system developed for this project generate interactive histogram and ridge regression for arbitrary two features and explore the possible correlation relationships among different features.

Ridge regression is a kind of Biased estimate regression analysis method use in collinearity data, it’s essentially a kind of improved least square estimation method, the objective function of ridge regression is:

$$ \mathop {\hbox{min} }\limits_{\omega } \left\| {X_{\omega } - y} \right\|_{2}^{2} + \alpha \left\| \omega \right\|_{2}^{2} $$

Ridge regression obtain higher numerical stability through the loss of unbiasedness, and get higher computational accuracy, the algorithm has more practical value in collinearity problem and study with more error data.

The system observes the distribution of every feature with drawing histogram of them, and finds the possible correlation relationships though ridge regression between any two features. As showed in Fig. 1, the two features here have obvious linear correlation relationship.

2.3 Mining for Rice Data

The data mining algorithms used for rice data in the study are CART (Classification and Regression Tree) algorithm and Spectral BiCluster.

2.3.1 Decision Tree

Decision Tree also called judging tree. It’s a flow chart tree structure with two or more branches [11]. There are numerous implementations for decision tree, like ID3, C4.5, CART, SLIQ and SPRINT etc. CART algorithm is used in this project. CART uses binary recursive partitioning technology, it always divides the sample datasets into two sub datasets, the according to determine the sample partition next time is GINI coefficient (also called GINI impurity): $ gini(T) = 1 - \sum {p_{j}^{2} } $. Which pj means the probability of a sample belongs to class j, All the samples belong to a single class when gini(T) = 0, when all the class appear in all the nodes with same probability, gini(T) reach the maximum value, the value is (C−1)C/2. If the GINI coefficient of all the features is calculated, the GINI information gain can be calculated. In this study, features’ data is input into CART, and the common sense, rules and knowledge for rice breeding varieties selection and judgement can be found.

2.3.2 Spectral BiCluster

Spectral BiCluster is a cluster method through eigenvalue decomposition. The algorithm is closely related with graphic partitioning. The algorithm can cluster samples on the arbitrary shape of the sample space, and always converge to the global optimal solution.

Figure 2 shows the difference between K-means and spectral cluster. K-means select clustering center randomly first, and gather sample points to the nearest center, then calculate the mean of the points as a new clustering center, repeat the process until the clustering center converge to a stable point. The spectral clustering first calculate the similarity matrix W though Gaussian similarity function: $ W_{ij} = e^{{\frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\sigma^{2} }}}} , $ then calculatenixlacian) trix $ {\text{L}}\, (L\text{ = }D\text{ - }W ) $, D means degree matrix, which is a diagonal matrix, eacrepresented in the matrix represent the sum of every single row in matrix W. The algorithm calculates extract minimized k eigenvectors and combine the k vectors into a n × k matrix, each row in the matrix is A vector in the k dimension space. The algorithm then cluster the vector with K-means, The class of each row in a cluster belongs is the node of the original graph, which is the class which the original data point belongs.

Traditional cluster just cluster data in one direction (row or column) in a matrix, that is, cluster in global pattern, so K-means just find the global information of the dataset and drop much partition information [12]. While BiCluster algorithm clustering at both row and column direction of the matrix, it not only cluster global information but also find effective partition information in high dimension data [13,14,15,16,17].

Spectral cluster is integration of spectral cluster and bicluster. The algorithm suppose the input data matrix has a hidden board structure, the rows and columns of The matrix can be divided into one or more sub biclusters, to make the count of the cartesian product of row clusters and column clusters in any sub bicluster is roughly constant. For instance, there is a 2 × 3 sub bicluster, then each row in the matrix belongs to three subclusters, and each column belongs to two subclusters. The algorithm re-divide the rows and columns of the matrix into subclusters, to make board matrix corresponding to the subclusters approximate the original matrix better.

The spectral BiCluster algorithm is used in this study to analyze rice breeding data to find approximate parent varieties with partition excellent traits.

2.4 Training Strategy of the Algorithm

Because the samples we collected in this study are limited, in order to get better training result, this study adopts ten-folder cross validation for data training. During each data training, The data is divided into 10 even parts randomly, each single part of the data is taken out in turn, and other 9 parts of the data join the training, the part taken out used to calculate the error rate. Repeat the process 10 times, and each time uses different datasets, finally calculate a comprehensive error rate. The experience proved that 10 times are the best choice for getting more accurate error rate [18].

3 Result

For the convenience of data processing and data visualization, a system is developed by Python with Django framework, to realize the online rice breeding data analysis and mining.

3.1 Result of CART

The maximum amounts of the layers of the decision tree generated should be input before running the CART algorithm. Figures 3, 4 and 5 show the visual results in different layer parameter. A higher or lower layer parameter could result in overfitting or under fitting, so users should adjust the parameter and watch the result to get the best output. The default value of the parameter of the system is 5, which can get reasonable output in most situations.

X[N] (N = 1, 2, …, 18) Means The Features of the Data, X[1] Means the Feature Effective Panicle, X[2] Plant Height, X[3] Ear Length, X[4] Total Grain Number, X[5] Grain Number, X[6] Seeding Rate, X[7] the 1000 Kernel Weight, X[8] Grain Length, X[9] Length-Width Ratio, X[10] Brown Rice Percentage, X[11] Head Rice Rate, X[12] Whole Head Rice Percentage, X[13] Chalky Rice Rate, X[14] Chalkiness Degree, X[15] Gel Consistence, X[16] Amylose Content, X[17] Whole Growth Period, X[18] Actual Yield

For example, the following rules or common sense can be recognized from the decision tree shows in Fig. 4: there are 100 samples for setting rate > 20.6 and 4.85 ≤ thousand seed weight ≤ 7.9:10 samples for setting rate ≤ 20.6 and thousand seed weight ≤ 8.25 and chalk rice grade > 70 and polished rice rate ≤ 62.4%. Other samples are scattered across different rules. But the corresponding sample number is small. Changing the number of layers of the decision tree may result in different results. For example, set the maximum layer to 3, the result changes as follow: 104 samples for setting rate ≤ 20.6 and thousand seed weight ≤ 8.25, 18 samples for setting rate ≤ 20.6 and thousand seed weight > 8.25. Therefore, along each path from the root node to leaf node of the tree, there is a rule or a common sense, means a knowledge or a pattern, effective knowledge and patterns can provide valuable references for varieties breeding decision making.

3.2 Result of Spectral Bicluster

The number of rows and columns generated by the spectral Bicluster algorithm must be input previously before running the algorithm (Fig. 6).

The number on top of the shape represents the feature number. The number to the left of the shape represents the sample number.

Tables 1, 2 and 3 show 3 biclusters generated by the algorithm with parameters rows = 8 and columns = 6. There are 48 biclusters generated in this training, but some of them are meaningless (e.g. Some biclusters only have one column include), we can get different biclusters through changing the value of rows and columns parameters. The value of the biclusters depends on the evaluation of breeders. But the algorithm provides a tool for trying different parameters and training the data anytime, until the breeder gets satisfactory results.

Table 1. Analysis output by spectral BiCluster – data in BiCluster (1)

Full size table

Table 2. Analysis output by spectral BiCluster – data in BiCluster (2)

Full size table

Table 3. Analysis output by spectral BiCluster – data in BiCluster (3)

Full size table

From the data in these biclusters, we can see that the samples in each bicluster are generally closer (the distance between samples). In addition, features extracted from each bicluster follow certain patterns. So results from spectral biclustering algorithm are some closer samples with their features follow certain patterns.

The algorithm used in rice breeding can help breeders find closer samples and phenotype features from large amount of samples rapidly, and help breeders to make the right decision.

4 Discussion

The study explore rice breeding data with unitary and binary analysis method, help breeders discover the influence between biological traits and economic traits, reveal hidden patterns, knowledge or rules in the data, discover closer samples with their features follow certain patterns through Spectral Biclustering algorithm, all these technology can promote the accuracy and efficiency of rice breeding.

The essence of mining the rice breeding data is to reveal the pattern and knowledge hide in the data, discover the influence of how phenotype traits affect the economic traits, and provide reference for rice varieties selection and breeding decision making.

5 Conclusion

Use data mining to analyze rice breeding data can help breeders discover the underlying patterns and knowledge hidden in these data, the patterns and knowledge can help breeders make a more accurate decision rapidly. Through the study, the method is proved feasible, and the method will be the major trend in crop breeding area in the future.

But from the perspective of the current discipline development, this study needs further investigation, Follow-up studies should focus on the following two aspects:

(1)
The data dimension should be further enriched and the data integrity should be improved. The data used in this study did not include the characteristics of the rice planting environment, such as altitude and planting area, but these factors tend to have an important effect on the economic traits of rice. In addition, the amount of data available for this study is to be expanded. The quality of data and the richness of the data characteristics need to be improved, to ensure the effectiveness of the patterns and knowledge extracted from the data.
(2)
further researches should be made on bioinformatics. With the development of life science and bioinformatics, the study of phenotypology has also been developing rapidly Phenotypology mainly studies the patterns of phenotypic traits, such as the physical and chemical properties of biology, that vary with mutation and environment [19]. Phenomics studies the phenotype system at the genome scale, hope to explains the unknown function of the genome [20]. The simple analysis of crop phenotype data only shows the relation and function of crop biological traits and economic traits at best, but with the genome data add in, the decisive action and relationships between the specific genetic segment and the specific traits of the crop can be excavated. It is possible to change a gene precisely and make it possible for the offspring to have good traits.

Therefore, the development of future breeding intelligent decision technology must be the close combination of data analysis technology and bioinformatics. Concrete will be reflected in more detailed and rich, continuous, a huge amount of phenotypic data collection, combined with crop genome comparison mining, and provide methods and tools for accurate directional breeding.

References

Zhu, R., Deng, J., Li, Y.: The response of rice yield and nitrogen fertilizer utilization to different formula fertilizer. Mod. Agric. (10), 17–21 (2008). (in Chinese)
Google Scholar
Zhu, R.S., Deng, J.S., Li, Y.: Response of rice yield and nitrogen fertilizer utilization rate to different recipes fertilizer. Mod. Agric. (10), 17–21 (2008). (in Chinese)
Google Scholar
Xia, R.B.: A study on the breeding science and technology of rice in contemporary China. Nanjing Agricultural University (2009). (in Chinese)
Google Scholar
Che, S.F., Dai, K.K., Cao, F.L.: The hybrid training algorithm for feedforward neural networks and its application. J. China Univ. Metrol. (4), 424–431 (2014). (in Chinese)
Google Scholar
Yan, D.C., Zhu, Y., Cao, W.X.: A knowledge model for selection of suitable variety in rice production. J. Nanjing Agric. Univ. (04), 424–431 (2014). (in Chinese)
Google Scholar
Qi, Y.L., et al.: Interspecific superiority analysis of two rice series subspecies and study of rice parent selection. Henan Agric. Sci. (10), 33–36 (2005). (in Chinese)
Google Scholar
Gupa, P.K.: Marker-assisted wheat breeding: present status and future possibilities. Mol. Breeding 26, 145–161 (2010)
Article Google Scholar
Guo, Z.: Evaluation of genome-wide selection efficiency in maize nested association mapping populations. Theor. Appl. Genet. 124, 261–275 (2012)
Article Google Scholar
Yang, J.J., Jin, C.X., Ma, H.C.: Consideration of traditional cross-breeding parent selection factors and its application of modern breeding techniques. Gansu Agric. Sci. Technol. (01), 61–64 (2015). (in Chinese)
Google Scholar
Chen, C.M.: Information Visualization: Beyond the Horizon, pp. 10–25. Springer, London (2004)
Google Scholar
Yu, H.M., Liang, Z.P.: Visual data exploration and its applications. Inf. Sci. (04), 599–603 (2007). (in Chinese)
Google Scholar
Zhao, R.: Design and implementation of decision tree classifier based on WEKA. Central South University (2007). (in Chinese)
Google Scholar
Hu, Y., Miao, D.Q., Wang, R.Z.: A biclustering algorithm based on rough K-means. Comput. Sci. 34(11), 174–177 (2007). (in Chinese)
Google Scholar
Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. U.S.A. 97(22), 12079–12084 (2000)
Article Google Scholar
Yang, J., Wang, W., Wang, H., et al.: δ-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th IEEE International Conference on Data Engineering (2002)
Google Scholar
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(Suppl 1), S136–S144 (2002)
Article Google Scholar
Kluger, Y., Basri, R., Chang, J.T., et al.: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res. (13), 703–716 (2003)
Article Google Scholar
Cano, C., Adarve, L., Lopez, J., et al.: Possibilistic approach for biclustering microarray data. Comput. Biol. Med. 37(10), 1426–1436 (2007)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn, p. 150. Morgan Kaufmann Publishers, Burlington (2016)
Google Scholar
Gowen, C.M., Fong, S.: Phenome analysis of microorganisms. In: Edwards, D., Stajich, J., Hansen, D. (eds.) Bioinformatics Tools and Applications. Springer, New York (2009). https://doi.org/10.1007/978-0-387-92738-1_14
Chapter Google Scholar
Li, H., Wei, X.L.: Phenomics: a science of unravelling the genotype-phenotype relationship. Biotechnol. Bull. 7, 41–47 (2013). (in Chinese)
Google Scholar

Download references

Author information

Authors and Affiliations

Agricultural Information Institute, Chinese Academy of Agricultural Sciences/Key Laboratory of Agri-Information Service Technology, Ministry of Agriculture, Beijing, People’s Republic of China
Yun-peng Cui, Jian Wang & Shi-hong Liu
Institute of Scientific and Technical Information, CATS/Key lab of Tropical Crops Information Technology Application Research of Hainan Province, Danzhou, People’s Republic of China
En-ping Liu & Hai-qing Liu

Authors

Yun-peng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shi-hong Liu
View author publications
You can also search for this author in PubMed Google Scholar
En-ping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hai-qing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun-peng Cui .

Editor information

Editors and Affiliations

China Agricultural University (CAU), Beijing, China
Daoliang Li
National Research Center of Intelligent Equipment for Agriculture (NRCIEA), Beijing, China
Chunjiang Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, Yp., Wang, J., Liu, Sh., Liu, Ep., Liu, Hq. (2019). Study of Machine Learning Based Rice Breeding Decision Support Methods and Technologies. In: Li, D., Zhao, C. (eds) Computer and Computing Technologies in Agriculture XI. CCTA 2017. IFIP Advances in Information and Communication Technology, vol 545. Springer, Cham. https://doi.org/10.1007/978-3-030-06137-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-06137-1_6
Published: 09 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-06136-4
Online ISBN: 978-3-030-06137-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Study of Machine Learning Based Rice Breeding Decision Support Methods and Technologies

Abstract

Similar content being viewed by others

Rice Crop Disease Detection Using Machine Learning Algorithms

An automatic ensemble machine learning for wheat yield prediction in Africa

Identification of optimal prediction models using multi-omic data for selecting hybrid rice

Keywords

1 Introduction