Skip to main content

From Data Mining with Machine Learning to Inference in Diverse and Highly Complex Data: Some Shared Experiences, Intellectual Reasoning and Analysis Steps for the Real World of Science Applications

  • Chapter
  • First Online:

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Machine Learning for Ecology and Sustainable Natural Resource Management
  • 2039 Accesses

Abstract

Statistics is an ever-evolving field. So-called “traditional” statistical analysis actually consists of a blend of methods and approaches that evolved over several hundred years (without computers), and it is usually centered around frequency mindsets and central theorems, summarized by Zar (2010). Nowadays, statistics are easily done with a computer and the internet, which brings forward new approaches to analysis and inference. Traditional (frequentist) statistics, also linked with some Bayesian approaches, are widely used overall and conveniently entrenched in the ‘western world’ but difficult to change and to update. Entire university departments, agencies and contractors make a living from it, which slows down progress on statistics and academic data analysis, thus resulting in little change for moving away from a somewhat outdated ‘science model’. It’s part of an income culture, for many contractors, governments, a lifestyle and subsequent belief system and society. However, the methodology employed in traditional statistical analyses are well-published and known to violate many of their required statistical assumptions to allow for valid inferences. Often, this topic becomes the key scheme and the central question in decision-making, and even in globally debated case studies. Court cases and entire sectors including the pharmaceutical industry or the environmental impact assessment industry hinges on the experimental approach using ‘hypothesis’ concepts for legal approval. Here I review and contrast some basic statistical tests and steps involved to draw meaningful inference, followed by a discussion on a more robust and modern approach to use for generic but complex data situations in line with Breiman (2001a, b)‘s machine learning. Lastly, I expand on such approaches for global progress in natural resource conservation management.

The problem posed for biologists by the real world seldom, if ever, have an exact, correct statistical solution. The assumptions of nearly all techniques are violated to a greater or lesser extent by real data.

McArdle (1988)

In a Time of Universal Deceit — Telling the Truth Is a Revolutionary Act.”

Attributed to George Orwell

With this method the dangers of parental affection for a favorite theory can be circumvented.

T. Chamberlin (1890)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Notes

  1. 1.

    Changes in the current science concept - as run by the western world - are known to take app. 20 years or longer (e.g. Connner 2005). While this is way too long for relevant and effective conservation management, it’s often caused by generic inertia, institutional setup and tenure + funding concepts. In reality, most changes in science are only achieved when earlier researchers retire and new minds and new generations enter the scene. Arguably, many of the ‘new/fresh’ minds got trained by earlier minds and thus, just little changes truly occur. Examples of this pattern can be found in most western and science nations, e.g. in Academies of Science, many established Journals and Nobel Prize committees.

References

  • Anderson DR, Burnham KP (2002) Avoiding pitfalls when using information-theoretic methods. J Wildl Manag 66:912–918

    Article  Google Scholar 

  • Anderson DR, Burnham KP, Thompson WL (2000) Null hypothesis testing: problems, prevalence, and an alternative. J Wildl Manag 64:912–923

    Article  Google Scholar 

  • Araujo M, New B (2007) Ensemble forecasting of species distributions. Trends Ecol Evol 22:42–47.

    Article  Google Scholar 

  • Arnold TW (2010) Uninformative parameters and model selection using Akaike’s information criterion. J Wildl Manage 74:1175–1178

    Article  Google Scholar 

  • Bandura A (2007) Impending ecological sustainability through selective moral disengagement. Int J Innov Sustain Dev:8–35

    Article  Google Scholar 

  • Berkson J (1942) Tests of significance considered as evidence. J Am Stat Assoc 37:325–335

    Article  Google Scholar 

  • Betts MG, Ganio L, Huso M, Som N, Huettmann F, Bowman J, Wintle WA (2009) Comment on “methods to account for spatial autocorrelation in the analysis of species distributional data: a review”. Ecography 32:374–378

    Article  Google Scholar 

  • Breiman L (2001a) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231

    Article  Google Scholar 

  • Breiman L (2001b) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC press, Boca Raton

    Google Scholar 

  • Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. springer, New York

    Google Scholar 

  • Carlson D. (2011). A lesson in sharing. Nature. 469: 293.

    Article  CAS  Google Scholar 

  • Carlson D (2013) Reading and thinking about international polar years: five recent books. Polar Res 32:1–7. https://doi.org/10.3402/polar.v32i0.20789CODATA

  • Chamberlin T (1890) The method of multiple working hypothesis. Reprinted 1965. Science 148:754–759

    Article  Google Scholar 

  • Chia KS (1997) “Significant-itis”—an obsession with the P-value. Scand J Work Environ Health 23:152–154

    Article  CAS  Google Scholar 

  • Concato J, Hartigan JA (2016) P values: from suggestion to superstition. J Investig Med 64:1166–1171

    Article  Google Scholar 

  • Conner CD (2005) A People’s history of science: miners, midwives and “low Mechanicks”. Nation books, New York

    Google Scholar 

  • Cushman, S. and F. Huettmann. (2010) Spatial Complexity, Informatics and Wildlife Conservation. Springer Tokyo, Japan. 448 p.

    Google Scholar 

  • Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88:2783–2792

    Article  Google Scholar 

  • Czech B (2002) Shoveling fuel for a runaway train: errant economists, shameful spenders, and a plan to stop them all. University of California Press, Berkeley

    Google Scholar 

  • Daly H, Farley J (2010) Ecological economics. Principles and applications, 2nd edn. Island Press, New York

    Google Scholar 

  • De’ath G, Fabricius K (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81:3178–3192 10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2

    Article  Google Scholar 

  • Dodds DG (2001) Philosophy and practice of wildlife management, 3rd edn. Krieger Publications, New York

    Google Scholar 

  • Drew CA, Yo W, Huettmann F (eds) (2011) Predictive modeling in landscape ecology. Springer, New York.

    Google Scholar 

  • Elith J, Graham C, NCEAS working group (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:129–151.

    Google Scholar 

  • Elder JF (2003) The generalization paradox of ensembles. J Comput Graph Stat 12:853–864

    Article  Google Scholar 

  • Fernandez-Delgado M, Cernadas E, Barro S, Dinan A (2014) Do we need hundreds of classifiers to solve real world classification problems. J Mach Learn 15:3133–3181

    Google Scholar 

  • Fidler F, Loftus GR (2009) Why figures with error bars should replace p values: some conceptual arguments and empirical demonstrations. J Psychol 217:27–37. https://faculty.washington.edu/gloftus/Downloads/Fidler.Loftus.pdf

    Article  Google Scholar 

  • Filliben JJ (1975) The probability plot correlation coefficient test for normality. Technometrics. Am Soc Qual 17:111–117

    Article  Google Scholar 

  • Fox CH, Huettmann F, Harvey GKA, Morgan KH, Robinson J, Williams R, Paquet PC (2017) Predictions from machine learning ensembles: marine bird distribution and density on Canada’s Pacific coast. Marine Ecology Progress Series 566:199–216.

    Article  Google Scholar 

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139

    Article  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378

    Article  Google Scholar 

  • Fryell JM, Caughley G (2014) Wildlife ecology, conservation, and management. Wiley Blackwell, Brisbane

    Google Scholar 

  • Gardner MA, Altman DG (1986) Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J 292:746–750

    Article  CAS  Google Scholar 

  • Gelman A (2008) Objections to bayesian statistics. Bayesian Anal 3:445–450

    Article  Google Scholar 

  • Georgescu-Roegen N (1971) The entropy law and the economic process. Harvard University Press, Cambridge, MA

    Book  Google Scholar 

  • Gigerenzer G (2004) Mindless statistics. J Socio Econ 33:587–606

    Article  Google Scholar 

  • Greenland S (2012) Transparency and disclosure, neutrality and balance: shared values or just shared words? J Epidemiol Community Health 66:967–970

    Article  Google Scholar 

  • Greenland S, Senn SK, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG (2016) Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350

    Article  Google Scholar 

  • Guthery FS (2008) Statistical ritual; versus knowledge accrual in wildlife science. J Wildl Manag 72:1872–1875

    Article  Google Scholar 

  • Guthery FS, Lusk JJ, Peterson MJ (2001) The fall of the null hypothesis: liabilities and opportunities. J Wildl Manag 65:379–384

    Article  Google Scholar 

  • Guthery FS, Brennan LA, Peterson MJ, Lusk LL (2005) Information theory in wildlife science: critique and viewpoint. J Wildl Manag 69:457–465

    Article  Google Scholar 

  • Han X, Huettmann F, Guo Y, Mi C, Wen L (2018) Conservation prioritization with machine learning predictions for the black-necked crane Grus nigricollis, a flagship species on the Tibetan plateau for 2070. Glob Environ Chang. https://doi.org/10.1007/s10113-018-1336-4

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Hilborn R, Mangel M (1997) The ecological detective: confronting models with data. Princeton University Press, Princeton, p 330

    Google Scholar 

  • Hobbs NT, Hooten M (2015) Bayesian models: a statistical primer for ecologists. Princeton University Press, Princeton

    Book  Google Scholar 

  • Hochachka W, Caruana R, Fink D, Munson A, Riedewald M, Sorokina D, Kelling S (2007) Data mining for discovery of pattern and process in ecological systems. J Wildl Manag 71:2427–2437

    Article  Google Scholar 

  • Huettmann F (2005) Databases and science-based management in the context of wildlife and habitat: towards a certified ISO standard for objective decision-making for the global community by using the internet. J Wildl Manag 69:466–472

    Article  Google Scholar 

  • Huettmann F (2007) Modern adaptive management: adding digital opportunities towards a sustainable world with new values. Forum Public Policy Clim Chang Sustain Dev 3:337–342

    Google Scholar 

  • Huettmann F (2009) The global need for, and appreciation of, high-quality metadata in biodiversity work. In: Spehn E, Koerner C (eds) Data mining for global trends in mountain biodiversity. CRC Press, Taylor & Francis, pp 25–28

    Chapter  Google Scholar 

  • Huettmann F, Artukhin Y, Gilg O, Humphries G (2011) Predictions of 27 Arctic pelagic seabird distributions using public environmental variables, assessed with colony data: a first digital IPY and GBIF open access synthesis platform. Mar Biodivers 41:141–179. https://doi.org/10.1007/s12526-011-0083-2

    Article  Google Scholar 

  • Johnson CJ, Seip DR (2008) Relationship between resource selection, distribution, and abundance: a test with implications to theory and conservation. Popul Ecol 50:145–157

    Article  Google Scholar 

  • Kandel K, Huettmann F, Suwal MK, Regmi GR, Nijman V, Nekaris KAI, Lama ST, Thapa A, Sharma HP, Subedi TR (2015) Rapid multi-nation distribution assessment of a charismatic conservation species using open access ensemble model GIS predictions: red panda (Ailurus fulgens) in the Hindu-Kush Himalaya region. Biol Conserv 181:150–161

    Article  Google Scholar 

  • Lambdin C (2012) Significance tests as sorcery: science is empirical—significance tests are not. Theory Psychol 22:67–90

    Article  Google Scholar 

  • Loftus GR (1996) Psychology will be a much better science when we change the way we analyze data. Curr Dir Psychol 5:161–171

    Article  Google Scholar 

  • Magness DR, Huettmann F, Morton JM (2008) Using random forests to provide predicted species distribution maps as a metric for ecological inventory & monitoring programs. pp 209–229. In: Smolinski TG, Milanova MG, Hassanien A-E (eds) Applications of computational intelligence in biology: current trends and open problems. Studies in computational intelligence, vol 122. Springer-Verlag, Berlin/Heidelberg, p 428

    Chapter  Google Scholar 

  • Magness DR, Morton JM, Huettmann F, Chapin FS III, McGuire AD (2011) A climate-change adaptation framework to reduce continental-scale vulnerability across conservation reserves. Ecosphere 2:art112. https://doi.org/10.1890/ES11-00200.1

    Article  Google Scholar 

  • Manly FJ, McDonald LL, Thomas DL, McDonald TL, Erickson WP (2002) Resource selection by animals: statistical design and analysis for field studies, 2nd edn. Kluwer Academic Publishers, Dordrecht

    Google Scholar 

  • McArdle (1988) The structural relationship: regression in biology. Can J Zool 66:2329–2339

    Article  Google Scholar 

  • McCullough BC, Wilson B (1999) On the accuracy of statistical procedures in Microsoft Excel 97. Comput Stat Data Anal 31:27–37

    Article  Google Scholar 

  • McGarical K, Cushman S, Stafford S (2000) Multivariate statistics for wildlife and ecology research. Springer, New York

    Book  Google Scholar 

  • Mi C, Huettmann F, Guo Y, Han X, Wen L (2017) Why to choose random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence. Peerj. https://doi.org/10.7717/peerj.2849

    Article  Google Scholar 

  • Mueller JP,Massaron L (2016) Machine learning for dummies. For Dummies Publisher, 435 p

    Google Scholar 

  • Næss A (1989) Ecology, community and lifestyle: outline of an ecosophy (trans: Rothenberg D). Cambridge: Cambridge University Press

    Google Scholar 

  • O’Connor R (2000) Why ecology lags behind biology. The Scientist 14:35–36

    Google Scholar 

  • Oppel S, Strobl C, Huettmann F (2009). Alternative methods to quantify variable importance in ecology. Technical report number 65, Department of Statistics. University of Munich

    Google Scholar 

  • Oppel S, Meirinho A, Ramírez I, Gardner B, O’Connell A, Miller PI, Louzao M (2012) Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds. Biol Conserv 156:94–104.

    Article  Google Scholar 

  • Pearce J, Ferrier S (2000) Evaluating the predictive performance of habitat models developed using logistic regression. Ecol Model 133:225–245

    Article  Google Scholar 

  • Perneger TV (1998) What’s wrong with Bonferroni adjustments. Br Med J 316:1236–1238

    Article  CAS  Google Scholar 

  • Popper K (1945) The open society and its enemies. Princeton University Press, Princeton/Oxford

    Google Scholar 

  • Quinn G, Keough M (2004) Experimental design and data analysis for biologists. Cambridge University Press, Cambridge

    Google Scholar 

  • Razali N, Wah YB (2011) Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–darling tests. J Stat Model Anal 2:21–33

    Google Scholar 

  • Reinhart A (2015) Statistics done wrong: the woefully complete guide. No Starch Press, San Francisco

    Google Scholar 

  • Rexstad EA, Miller D, Flather C, Anderson A, Hupp J, Anderson JR (1988) Questionable multivariate statistical inference in wildlife and community studies. J Wildl Manag 52:794–798

    Article  Google Scholar 

  • Romesburg HC (1981) Wildlife science: gaining reliable knowledge. J Wildl Manag 45:293–313

    Article  Google Scholar 

  • Romesburg HC (1991) On improving natural resources and environmental sciences. J Wildl Manag 55:744–756

    Article  Google Scholar 

  • Rosales J (2008) Economic growth, climate change, biodiversity loss: distributive justice for the global north and south. Conserv Biol 22:1409–1417

    Article  Google Scholar 

  • Rothman KJ (1990) No adjustments are needed for multiple comparisons. Epidemiology 1:43–46

    Article  CAS  Google Scholar 

  • Salsburg DS (1985) The religion of statistics as practiced in medical journals. Am Stat 39:220–223

    Google Scholar 

  • Salsburg D (2001) The lady tasting tea: how statistics revolutionized science in the twentieth century. W. H. Freeman and Company, New York

    Google Scholar 

  • Savalei V, Dunn E (2015) Is the call to abandon p-values the red herring of the replicability crisis? Front Psychol. https://doi.org/10.3389/fpsyg.2015.00245

  • Sawitzki G (1994a) Testing numerical reliability of data analysis systems. Comput Statist Data Anal 18:269–286

    Article  Google Scholar 

  • Sawitzki G (1994b) Report on the reliability of data analysis systems. Comput Statist Data Anal (SSN) 18:289–301

    Article  Google Scholar 

  • Schapire RE (1990) The strength of weak learnability. Machine learning, vol 5. Kluwer Academic Publishers, Boston, MA, pp 197–227. https://doi.org/10.1007/bf00116037

    Book  Google Scholar 

  • Schapire RE (1992) The design and analysis of efficient learning algorithms. MIT Press, Cambridge, MA

    Google Scholar 

  • Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictors. Mach Learn 37:297–336

    Article  Google Scholar 

  • Schmidt FL, Hunter JE (2014) Methods of meta-analysis: correcting error and bias in research findings, 3rd edn. Sage Publisher, Thousand Oaks

    Google Scholar 

  • Silva NJ (2012) The wildlife techniques manual: research & management, vol 2, Seventh edn. The Johns Hopkins University Press

    Google Scholar 

  • Stang A, Poole C, Kuss O (2010) The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 25:225–230

    Article  Google Scholar 

  • Stanton-Geddes J, Gomes De Freitas C, de Sales Dambros C (2014) In defense of P values: comment on the statistical methods actually used by ecologists. Ecology 95:637–642

    Article  Google Scholar 

  • Stephens PA, Buskirk SW, Hayward GW, Martinez Del Rio C (2007) A call for statistical pluralism answered. J Appl Ecol 44:461–463. https://doi.org/10.1111/j.1365-2664.2007.01302.x

    Article  Google Scholar 

  • Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. Bioinformatics 8:25. https://doi.org/10.1186/1471-2105-8-25

    Article  CAS  PubMed  Google Scholar 

  • Swihart R, Slade N (1985) Testing for independence in observations of animal movements. Ecology 66:1176–1184

    Article  Google Scholar 

  • Thompson B (2004) The “significance” crisis in psychology and education. J Soc Econ 33:607–613

    Article  Google Scholar 

  • Venables WN, Ripley BD (2002) Modern applied statistical analysis, 4th edn. Springer, New York

    Google Scholar 

  • Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP (2006) Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol 75:1182–1189

    Article  Google Scholar 

  • Yackulic CB, Chandler R, Zipkin EF, Royle JA, Nichols JD, Campbell Grant EH, Veran S (2012) Presence-only modeling using MAXENT: when can we trust the inferences? Meth Ecol Evol 4:236–243

    Article  Google Scholar 

  • Yoccoz NG (1991) Use, overuse, and misuse of significance tests in evolutionary biology and ecology. Bull Ecol Soc Am 72:106–111

    Google Scholar 

  • Zar JH (2010) Biostatistical analysis, 5th edn. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Ziliak ST, McCloskey DN (2009) The cult of statistical significance. Section on statistical education, pp 2302–2319

    Google Scholar 

  • Zuckerberg, B, F. Huettmann and J. Friar (2011) Proper Data Management as a Scientific Foundation for Reliable Species Distribution Modeling. In C.A. Drew, Y. Wiersma and F. Huettmann (eds). Predictive Species and Habitat Modeling in Landscape Ecology. Springer, New York. Pp 45–70

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Falk Huettmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Huettmann, F. (2018). From Data Mining with Machine Learning to Inference in Diverse and Highly Complex Data: Some Shared Experiences, Intellectual Reasoning and Analysis Steps for the Real World of Science Applications. In: Humphries, G., Magness, D., Huettmann, F. (eds) Machine Learning for Ecology and Sustainable Natural Resource Management. Springer, Cham. https://doi.org/10.1007/978-3-319-96978-7_4

Download citation

Publish with us

Policies and ethics