Skip to main content
Log in

On the Validity of Machine Learning-based Next Generation Science Assessments: A Validity Inferential Network

  • Published:
Journal of Science Education and Technology Aims and scope Submit manuscript

Abstract

This study provides a solid validity inferential network to guide the development, interpretation, and use of machine learning-based next-generation science assessments (NGSAs). Given that machine learning (ML) has been broadly implemented in the automatic scoring of constructed responses, essays, simulations, educational games, and interdisciplinary assessments to advance the evidence collection and inference of student science learning, we contend that additional validity issues arise for science assessments due to the involvement of ML. These emerging validity issues may not be addressed by prior validity frameworks developed for either non-science or non-ML assessments. We thus examine the changes brought in by ML to science assessments and identify seven critical validity issues of ML-based NGSAs: potential risk of misrepresenting the construct of interest, potential confounders due to that more variables may involve, nonalignment between interpretation and use of scores and designed learning goals, nonalignment between interpretation and use of scores and actual learning quality, nonalignment between machine scores and rubrics, limited generalizable ability of machine algorithmic models, and limited extrapolating ability of machine algorithmic models. Based on the seven validity issues identified, we propose a validity inferential network to address the cognitive, instructional, and inferential validity of ML-based NGSAs. To demonstrate the utility of this network, we present an exemplar of ML-based next-generation science assessments that was developed using a seven-step ML framework. We articulate how we used the validity inferential network to ensure accountable assessment design, as well as valid interpretation and use of machine scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

modified from Harris et al., 2019)

Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. This article does not make argument with regards to the perfectness of the case study. Instead, we use this case study to demonstrate the usability of the validity inferential network.

References

  • AACR. (2020). September 4, 2020, Retrieved from https://apps.beyondmultiplechoice.org.

  • AERA, APA, NCME, JCSE, & PT. (1999). Standards for educational and psychological testing. Amer Educational Research Assn.

  • Alozie, N., Haugabook Pennock, P., Madden, K., Zaidi, S., Harris, C. J., & Krajcik, J. S. (2018) Designing and developing NGSS-aligned formative assessment tasks to promote equity. Paper presented at the annual conference of National Association for Research in Science Teaching, Atlanta, GA.

  • Anderson, C. W., et al. (2018). Designing educational systems to support enactment of the Next Generation Science Standards. Journal of Research in Science Teaching, 55(7), 026–1052.

    Article  Google Scholar 

  • Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance? Journal of Science Education and Technology, 23(1), 160–182.

  • Bennett, R. E. (2018). Educational assessment: What to watch in a rapidly changing world. Educational measurement: issues and practice, 37(4), 7–15.

    Article  Google Scholar 

  • Bennett, R. E., Deane, P., & van Rijn, P. W. (2016). From cognitive-domain theory to assessment practice. Educational Psychologist, 51(1), 82–107.

  • Clauser, B. E., Kane, M. T., & Swanson, D. B. (2002). Validity issues for performance-based tests scored with computer-automated scoring systems. Applied Measurement in Education, 15(4), 413–432.

    Article  Google Scholar 

  • Cronbach, L. J. (1980). Validity on parole: How can we go straight? New directions for testing and measure-ment. Paper presented at the 1979 ETS Invitational Converence, San Francisco.

  • Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Cronbach, L. J. (1989). Construct validation after thirty years. In R. E. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press.

    Google Scholar 

  • Erickson, B. J., Korfiatis, P., Akkus, Z., & Kline, T. L. (2017). Machine learning for medical imaging. Radiographics, 37(2), 505-515.

    Article  Google Scholar 

  • Forehand, M. (2010). Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology41(4), 47-56.

    Google Scholar 

  • Furtak, E. M., Kang, H., Pellegrino, J., Harris, C., Krajcik, J., Morrison, D., & Nation, J. (2020). Emergent design heuristics for three-dimensional classroom assessments that promote equity. The Interdisciplinarity of the Learning Sciences.

  • Gane, B. D., Zaidi, S. Z., & Pellegrino, J. W. (2018). Measuring what matters: Using technology to assess multidimensional learning. European Journal of Education, 53(2), 176–187.

    Article  Google Scholar 

  • Gerard, L., Kidron, A., & Linn, M. (2019). Guiding collaborative revision of science explanations. International Journal of Computer-Supported Collaborative Learning, 14(3), 291–324.

  • Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111-129.

    Article  Google Scholar 

  • Ghali, R., Ouellet, S., & Frasson, C. (2016). LewiSpace: An exploratory study with a machine learning model in an educational game. Journal of Education and Training Studies, 4(1), 192–201.

    Google Scholar 

  • Gobert, J. D., Baker, R. S., & Wixon, M. B. (2015). Operationalizing and detecting disengagement within online science microworlds. Educational Psychologist, 50(1), 43–57.

    Article  Google Scholar 

  • Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25(3), 358–374.

    Article  Google Scholar 

  • Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge-in-use assessments to promote deeper learning. Educational measurement: issues and practice, 38(2), 53–67.

    Article  Google Scholar 

  • Jescovitch, L. N., Scott, E. E., Cerchiara, J. A., Merrill, J., Urban-Lurain, M., Doherty, J. H., & Haudek, K. C. (2020). Comparison of machine learning performance using analytic and holistic coding approaches across constructed response assessments aligned to a science learning progression. Journal of Science Education and Technology, 1–18.

  • Kane, M. (1992). An argument-based approach to validity. Psychological bulletin, 112(3), 527–535.

    Article  Google Scholar 

  • Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

    Article  Google Scholar 

  • Large, J., Lines, J., & Bagnall, A. (2019). A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates. Data mining and knowledge discovery, 33(6), 1674-1709.

    Article  Google Scholar 

  • Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622.

    Article  Google Scholar 

  • Li, H., Gobert, J., Graesser, A., & Dickler, R. (2018). Advanced educational technology for science inquiry assessment. Policy Insights from the Behavioral and Brain Sciences, 5(2), 171–178.

    Article  Google Scholar 

  • Liaw, H., Yu, Y. R., Chou, C. C., & Chiu, M. H. (2020). Relationships between facial expressions, prior knowledge, and multiple representations: A case of conceptual change for kinematics instruction. Journal of Science Education and Technology, 1-12.

  • Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233.

    Article  Google Scholar 

  • Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in predicting automated scoring performance. Applied Measurement in Education, 31(3), 215–232.

    Article  Google Scholar 

  • Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23(2), 121–138.

    Article  Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). NewYork: American Councilon Education and Macmillan.

    Google Scholar 

  • Mislevy, R., & Haertel, G. (2006). Implications of evidence-centered design for educational testing. Educational measurement: issues and practice, 25(4), 6–20.

    Article  Google Scholar 

  • Mislevy, R., & Riconscente, M. (2011). Evidence-centered assessment design. In Handbook of test development (pp. 75–104). Routledge.

  • Mislevy, R., Steinberg, L., & Almond, R. (2003). On the structure of educational assessments. Measurement: Interdisciplinary research and perspective. In: Hillsdale, NJ: Lawrence Erlbaum Associates.

  • Nakamura, C. M., Murphy, S. K., Christel, M. G., Stevens, S. M., & Zollman, D. A. (2016). Automated analysis of short responses in an interactive synthetic tutoring system for introductory physics. Physical Review Physics Education Research, 12(1), 010122.

    Article  Google Scholar 

  • National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas: National Academies Press.

  • National Research Council. (2014). Developing assessments for the next generation science standards. National Academies Press.

  • Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183–196.

    Article  Google Scholar 

  • NGSA team, Next Generation Science Assessment. Retrived on Oct. 9, 2020 from https://nextgenscienceassessment.org/about/team/

  • NGSS Lead States. (2013). Next generation science standards: For states, by states: National Academies Press.

  • Pellegrino, J. W., DiBello, L. V., & Goldman, S. R. (2016). A framework for conceptualizing and evaluating the validity of instructionally relevant assessments. Educational Psychologist, 51(1), 59–81.

    Article  Google Scholar 

  • Pellegrino, J. W., Wilson, M. R., Koenig, J. A., & Beatty, A. S. (2014). Developing assessments for the Next Generation Science Standards: ERIC.

  • Prevost, L. B., Smith, M. K., & Knight, J. K. (2016). Using student writing and lexical analysis to reveal student thinking about the role of stop codons in the central dogma. CBE—Life Sciences Education, 15(4), ar65.

  • Ruiz-Primo, M. A., Li, M., Wills, K., Giamellaro, M., Lan, M.-C., Mason, H., & Sands, D. (2012). Developing and evaluating instructionally sensitive assessments in science. Journal of Research in Science Teaching, 49(6), 691–712.

    Article  Google Scholar 

  • Shin, D., & Shim, J. (2020). A systematic review on data mining for mathematics and science education. International Journal of Science and Mathematics Education.

  • Urban-Lurain, M., Cooper, M. M., Haudek, K. C., Kaplan, J. J., Knight, J. K., Lemons, P. P., et al. (2015). Expanding a national network for automated analysis of constructed response assessments to reveal student thinking in STEM. Computers in Education Journal, 6, 65–81.

    Google Scholar 

  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2–13.

    Article  Google Scholar 

  • Wilson, J., Roscoe, R., & Ahmed, Y. (2017). Automated formative writing assessment using a levels of language framework. Assessing Writing, 34, 16–36.

    Article  Google Scholar 

  • Yoo, J., & Kim, J. (2014). Can online discussion participation predict group project performance? Investigating the roles of linguistic features and participation patterns. International Journal of Artificial Intelligence in Education, 24(1), 8–32.

    Article  Google Scholar 

  • Zhai, X. (2019). Applying machine learning in science assessment: Opportunity and challenges. A call for a Special Issue in Journal of Science Education and Technology. https://doi.org/10.13140/RG.2.2.10914.07365(Unpublished document).

    Article  Google Scholar 

  • Zhai, X., Haudek, K., Shi, L., Nehm, R., Urban-Lurain, M. (2020b). From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching, 57(9), 1430-1459. DOI: https://doi.org/10.1002/tea.21658.

    Article  Google Scholar 

  • Zhai, X., Haudek, K., Stuhlsatz, M., Wilson, C. (2020c). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 1-12. https://doi.org/10.1016/j.stueduc.2020.100916.

    Article  Google Scholar 

  • Zhai, X., Shi, L. Nehm, R. (In press) A meta-analysis of machine learning-based science assessments: Factors impacting machine-human score agreements. Journal of Science Education and Technology. https://doi.org/10.1007/s10956-020-09875-z.

  • Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020a). Applying machine learning in science assessment: A systematic review. Studies in Science Education, 56(1), 111-151. https://doi.org/10.1080/03057267.2020.1735757.

    Article  Google Scholar 

  • Zhu, M., Lee, H.-S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to the NGSA team members, particularly to Daniel Damelin, Peng He, Tingting Li, Jie Yang, and Nicholas Yoshida.

Funding

This material is based upon work supported by the Chan Zuckerberg Initiative (grant number: 194933; PIs are James Pellegrino, Christopher Harris, Joseph Krajcik, Daniel Damelin). The automatic analysis was partially supported by the National Science Foundation (DUE-1561159).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoming Zhai.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Ethical Approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study.

Informed Consent

All authors agreed to publish the study in this journal.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhai, X., Krajcik, J. & Pellegrino, J.W. On the Validity of Machine Learning-based Next Generation Science Assessments: A Validity Inferential Network. J Sci Educ Technol 30, 298–312 (2021). https://doi.org/10.1007/s10956-020-09879-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10956-020-09879-9

Keywords

Navigation