Abstract
Prediction of allergic pollen concentration is one of the most important goals of aerobiology. Past studies have used a broad range of modeling techniques; however, the results cannot be directly compared owing to the use of different datasets, validation methods, and evaluation metrics. The main aim of this study was to compare nine statistical modeling techniques using the same dataset. An additional goal was to assess the importance of predictors for the best model. Aerobiological data for Corylus, Alnus, and Betula pollen counts were obtained from nine cities in Poland and covered between five and 16 years of measurements. Meteorological data from the AGRI4CAST project were used as a predictor variables. The results of 243 final models (3 taxa \(\times\) 9 cities \(\times\) 9 techniques) were validated using a repeated k-fold cross-validation and compared using relative and absolute performance statistics. Afterward, the variable importance of predictors in the best models was calculated and compared. Simple models performed poorly. On the other hand, regression trees and rule-based models proved to be the most accurate for all of the taxa. Cumulative growing degree days proved to be the single most important predictor variable in the random forest models of Corylus, Alnus, and Betula. Finally, the study suggested potential improvements in aerobiological modeling, such as the application of robust cross-validation techniques and the use of gridded variables.
Similar content being viewed by others
References
Anderson, M. J. (2001). A new method for non parametric multivariate analysis of variance. Austral Ecology, 26(2001), 32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x.
Baruth, B., Genovese, G., & Leo, O. (2007). CGMS version 9.2—User manual and technical documentation. Technical report, Office for official publications of the European Communities, Luxembourg. https://doi.org/10.2788/37265
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC Press.
Bringfelt, B., Engström, I., & Nilsson, S. (1982). An evaluation of some models to predict airborne pollen concentration from meteorological conditions in Stockholm, Sweden. Grana, 21(1), 59–64. https://doi.org/10.1080/00173138209427680.
Castellano-Méndez, M., Aira, M. J., Iglesias, I., Jato, V., & González-Manteiga, W. (2005). Artificial neural networks as a useful tool to predict the risk level of Betula pollen in the air. International Journal of Biometeorology, 49(5), 310–316. https://doi.org/10.1007/s00484-004-0247-x.
Cotos-Yáñez, T. R., Rodríguez-Rajo, F. J., & Jato, M. V. (2004). Short-term prediction of Betula airborne pollen concentration in Vigo (NW Spain) using logistic additive models and partially linear models. International Journal of Biometeorology, 48(4), 179–185. https://doi.org/10.1007/s00484-004-0203-9.
Dahl, A., Galán, C., Hajkova, L., Pauling, A., Sikoparija, B., Smith, M., et al. (2013). The onset, course and intensity of the pollen season. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 29–70). Dordrecht: Springer.
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Dystems, 1, 155–161.
Emberlin, J., Savage, M., & Woodman, R. (1993). Annual variations in the concentrations of Betula pollen in the London area, 1961–1990. Grana, 32(6), 359–363. https://doi.org/10.1080/00173139309428965.
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67. https://doi.org/10.2307/2241837.
Galán, C., Smith, M., Thibaudon, M., Frenguelli, G., Oteros, J., Gehrig, R., et al. (2014). Pollen monitoring: Minimum requirements and reproducibility of analysis. Aerobiologia, 30(4), 385–395. https://doi.org/10.1007/s10453-014-9335-5.
Gosso, A. (2012). elmNN: Implementation of ELM (extreme learning machine) algorithm for SLFN (single hidden layer feedforward neural networks). https://cran.r-project.org/package=elmNN.
Grolemund, G., & Wickham, H. (2011). Dates and time made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. http://www.jstatsoft.org/v40/i03.
Hilaire, D., Rotach, M. W., & Clot, B. (2012). Building models for daily pollen concentrations: The example of 16 pollen taxa in 14 Swiss monitoring stations. Aerobiologia, 28(4), 499–513. https://doi.org/10.1007/s10453-012-9252-4.
Hirst, J. (1952). An automatic volumetric spore trap. Annals of Applied Biology, 39(2), 257–265. https://doi.org/10.1111/j.1744-7348.1952.tb00904.x.
Hyndman, R. J., & Athanasopoulos, G. (2013). Forecasting: Principles and practice. Melbourne: OTexts.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab: An S4 package for Kernel methods in R. Journal of Statistical Software, 11(9), 1–20. https://doi.org/10.1016/j.csda.2009.09.023.
Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in ordinal data. An introduction to cluster analysis (Vol. 344). New York: Wiley.
Kuhn, M. (2016). Package ’caret’: Classification and regression training. https://doi.org/10.1053/j.sodo.2009.03.002. https://github.com/topepo/caret/.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. https://doi.org/10.1007/978-1-4614-6849-3.
Kuhn, M., Weston, S., Keefer, C., & Coulter, N. (2014). Cubist: Rule- and instance-based regression modeling. https://cran.r-project.org/web/packages/Cubist/index.html.
Laaidi, M. (2001). Regional variations in the pollen season of Betula in Burgundy: Two models for predicting the start of the pollination. Aerobiologia, 17(3), 247–254. https://doi.org/10.1023/A:1011899603453.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(December), 18–22. https://doi.org/10.1177/154405910408300516. http://cran.r-project.org/doc/Rnews/.
Makridakis, S. (1993). Accuracy measure: Theoretical and practical concerns. International Journal of Forecasting, 9(1), 527–529.
Mevik, B., Wehrens, R., & Liland, K. (2015). pls: Partial least squares and principal component regression. https://cran.r-project.org/package=pls.
Milborrow, S. (2016). Multivariate adaptive regression splines. http://cran.r-project.org/package=earth.
Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: A comparison of resampling methods. Bioinformatics, 21(15), 3301–3307.
Myszkowska, D. (2013). Prediction of the birch pollen season characteristics in Cracow, Poland using an 18-year data series. Aerobiologia, 29(1), 31–44. https://doi.org/10.1007/s10453-012-9260-4.
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A (General), 135(3), 370–384.
Nowosad, J. (2016). Spatiotemporal models for predicting high pollen concentration level of Corylus, Alnus, and Betula. International Journal of Biometeorology, 60(6), 843–855. https://doi.org/10.1007/s00484-015-1077-8.
Nowosad, J., Stach, A., Kasprzyk, I., Weryszko-Chmielewska, E., Piotrowska-Weryszko, K., Puc, M., et al. (2016). Forecasting model of Corylus, Alnus, and Betula pollen concentration levels using spatiotemporal correlation properties of pollen count. Aerobiologia, 32(3), 453–468. https://doi.org/10.1007/s10453-015-9418-y.
Puc, M. (2012). Artificial neural network model of the relationship between Betula pollen and meteorological factors in Szczecin (Poland). International Journal of Biometeorology, 56(2), 395–401. https://doi.org/10.1007/s00484-011-0446-1.
R Core Team. (2016). R: A language and environment for statistical computing. https://doi.org/10.1007/978-3-540-74686-7. http://www.r-project.org. arXiv:1011.1669v3.
Rapiejko, P., Stankiewicz, W., Szczygielski, K., & Jurkiewicz, D. (2007). Progowe stȩżenie pyłku roślin niezbȩdne do wywołania objawów alergicznych (Threshold pollen count necessary to evoke allergic symptoms). Otolaryngologia Polska, 61(4), 591–594. https://doi.org/10.1016/S0030-6657(07)70491-2.
Ritenberga, O., Sofiev, M., Kirillova, V., Kalnina, L., & Genikhovich, E. (2016). Statistical modelling of non-stationary processes of atmospheric pollution from natural sources: Example of birch pollen. Agricultural and Forest Meteorology, 226–227, 96–107. https://doi.org/10.1016/j.agrformet.2016.05.016.
Rodríguez-Rajo, F. J., Valencia-Barrera, R. M., Vega-Maray, A. M., Suárez, F. J., Fernández-González, D., & Jato, V. (2006). Prediction of airborne Alnus pollen concentration by using ARIMA models. Annals of Agricultural and Environmental Medicine, 13(1), 25–32.
Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of international conference on tools with artificial intelligence, ICTAI (pp. 576–584). https://doi.org/10.1109/ICTAI.2004.50.
Sofiev, M., Belmonte, J., Gehrig, R., Izquierdo, R., Smith, M., Dahl, Å., et al. (2013a). Airborne Pollen Transport Mikhail. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 127–159). Dordrecht: Springer. https://doi.org/10.1007/978-94-007-4881-1.
Sofiev, M., Siljamo, P., Ranta, H., Linkosalo, T., Jaeger, S., Rasmussen, A., et al. (2013b). A numerical model of birch pollen emission and dispersion in the atmosphere. Description of the emission module. International Journal of Biometeorology, 57(1), 45–58. https://doi.org/10.1007/s00484-012-0532-z.
Therneau, T., Atkinson, B., & Ripley, B. (2015). Recursive partitioning and regression trees. https://cran.r-project.org/package=rpart.
Tibshirani, R. (1996). Regression selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1), 267–288. https://doi.org/10.2307/2346178.
Vogel, H., Pauling, A., & Vogel, B. (2008). Numerical simulation of birch pollen dispersion with an operational weather forecast system. International Journal of Biometeorology, 52(8), 805–814. https://doi.org/10.1007/s00484-008-0174-3.
Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer. https://doi.org/10.1007/978-0-387-98141-3.
Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In: Matrix pencils (pp. 286–293), Springer. https://doi.org/10.1017/CBO9781107415324.004.
Zou, H., & Hastie, T. (2008). elasticnet: Elastic-net for sparse estimation and sparse PCA. http://cran.r-project.org/package=elasticnet.
Acknowledgements
This study was carried out within the framework of Project no. NN305 321936, financed by the Ministry of Science and Higher Education in Poland.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nowosad, J., Stach, A., Kasprzyk, I. et al. Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air. Aerobiologia 34, 301–313 (2018). https://doi.org/10.1007/s10453-018-9514-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10453-018-9514-x