Abstract
Understanding the structure, basic properties and relationships in a given dataset is a fundamental prerequisite for an appropriate statistical analysis. Here, we highlight the major principles of data exploration and provide a roadmap for a systematic and reproducible analysis based on key questions. Using an exemplary dataset on throughfall measurements, we demonstrate how several techniques can be used and evaluated in order to address common tasks of data analysis such as understanding the structure of the dataset, detecting temporal and spatial dependence among observations, identifying outliers and influential observations, checking normality and homogeneity of model residuals and exploring the relationships of variables. Finally, it is discussed when and when not data transformations can be used as potential actions to overcome restrictions in the application of a statistical method. The electronic supplement offers the sample dataset as well as fully documented computer code, which aims to serve as a guideline for conducting an exploratory data analysis using the statistical software environment R.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See the DataONE primer on data management for an overview on how to effectively manage data: https://www.dataone.org/sites/all/documents/DataONE_BP_Primer_020212.pdf
References
Bivand RS, Pebesma EJ, Gomez-Rubio V (2008) Applied spatial data analysis with R. Use R series, Springer, New York. https://doi.org/10.1007/978-1-4614-7618-4
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York. https://doi.org/10.1007/b97636
Carlyle-Moses DE, Lishman CE, McKee AJ (2014) A preliminary evaluation of throughfall sampling techniques in a mature coniferous forest. J For Res 25:407–413. https://doi.org/10.1007/s11676-014-0468-8
Chenouri S, Small CG (2012) A nonparametric multivariate multisample test based on data depth. Electron J Stat 6:760–782. https://doi.org/10.1214/12-EJS692
Daszykowski M, Kaczmarek K, Vander Heyden Y, Walczak B (2007) Robust statistics in data analysis – a review. basic concepts Chemometrics Intell Lab Syst 85:203–219. https://doi.org/10.1016/j.chemolab.2006.06.016
Dytham C (2006) Choosing and using statistics: a biologist’s guide. 2nd edn (Repr.), Blackwell Publishing., Malden, p 248
Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G et al (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36:027–046. https://doi.org/10.1111/j.1600-0587.2012.07348.x
Fox J, Weisberg S (2011) An R companion to applied regression, 2nd edn. Sage Publications, Thousand Oaks. http://tinyurl.com/carbook
Freckleton RP (2011) Dealing with collinearity in behavioural and ecological data: model averaging and the problems of measurement error. Behav Ecol Sociobiol 65:91–101. https://doi.org/10.1007/s00265-010-1045-6
Frischbier N (2012) Study on the single-tree related small-scale variability and quantity-dependent dynamics of net forest precipitation using the example of two mixed beech-spruce stands. TUDpress, Dresden. (Dissertation). http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-94870
Frischbier N, Wagner S (2015) Detection, quantification and modelling of small-scale lateral translocation of throughfall in tree crowns of European beech (Fagus sylvatica L.) and Norway spruce (Picea abies (L.) karst.). J Hydrol 522:228–238. https://doi.org/10.1016/j.jhydrol.2014.12.034
Hurlbert SH (1984) Pseudoreplication and the design of ecological field experiments. Ecol Monogr 54:187–211. https://doi.org/10.2307/1942661
Joliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Phil Trans R Soc A 374:20150202. https://doi.org/10.1098/rsta.2015.0202
Kallenberg O (2002) Foundations of modern probability, 2nd edn. Springer, New York, p 638
Keim RF, Skaugset AE, Weiler M (2005) Temporal persistence of spatial patterns in throughfall. J Hydrol 314:263–274. https://doi.org/10.1016/j.jhydrol.2005.03.021
Pinheiro J, Bates D (2010) Mixed-effects models in S and S-PLUS. Springer, Dordrecht. ISBN: 9781441903181. https://doi.org/10.1007/b98882
Quinn GP, Keough MJ (2002) Experimental design and data analysis for biologists. Repr. With corr. 2003. Cambridge University Press, Cambridge, p 537
Schielzeth H, Forstmeier W (2009) Conclusions beyond support: overconfident estimates in mixed models. Behav Ecol 20:416–420. https://doi.org/10.1093/beheco/arn145
Schielzeth H (2010) Simple means to improve the interpretability of regression coefficients. Methods Ecol Evol 1:103–113. https://doi.org/10.1111/j.2041-210X.2010.00012.x
Schielzeth H, Nakagawa S (2013) Nested by design: model fitting and interpretation in a mixed model era. Methods Ecol Evol 4:14–24. https://doi.org/10.1111/j.2041-210x.2012.00251.x
Sievert C (2018) Plotly for R. https://plotly-book.cpsievert.me
Sun F, Roderick ML, Farquhar GD (2018) Rainfall statistics, stationarity, and climate change. P Natl Acad Sci USA 115:2305–2310. https://doi.org/10.1073/pnas.1705349115
Tischer A, Zwanzig M, Frischbier N (2019) Spatiotemporal statistics: analysis of spatially and temporally-correlated throughfall data: exploring and considering dependency and heterogeneity. In: Levia DF, Carlyle-Moses DE, Iida S, Michalzik B, Nanko K, Tischer A (eds) Forest-water interactions. Ecological studies series, No. 240. Springer, Heidelberg. https://doi.org/10.1007/978-3-030-26086-6_8
Townend J (2008) Practical statistics for environmental and biological scientists. Wiley, Chichester, p 276. ISBN: 978-0-471-49665-6
Unwin A (2018). OutliersO3: draws overview of outliers (O3) Plots. R package version 0.5.4. https://CRAN.R-project.org/package=OutliersO3
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer-Verlag, New York. https://doi.org/10.1007/978-0-387-98141-3
Wilks DS (2006) Statistical methods in the atmospheric sciences. Second edition. Elsevier, Amsterdam, p 676
Zuur AF, Ieno EN, Elphick CS (2010) A protocol for data exploration to avoid common statistical problems. Methods Ecol Evol 1:3–14. https://doi.org/10.1111/j.2041-210X.2009.00001.x
Zuur AF, Ieno EN (2015) A beginner’s guide to data exploration and visualisation with R. Highland Statistics Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Supplementary Electronic Material (S)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zwanzig, M., Schlicht, R., Frischbier, N., Berger, U. (2020). Primary Steps in Analyzing Data: Tasks and Tools for a Systematic Data Exploration. In: Levia, D.F., Carlyle-Moses, D.E., Iida, S., Michalzik, B., Nanko, K., Tischer, A. (eds) Forest-Water Interactions. Ecological Studies, vol 240. Springer, Cham. https://doi.org/10.1007/978-3-030-26086-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-26086-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26085-9
Online ISBN: 978-3-030-26086-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)