Skip to main content

A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects

  • Conference paper
  • First Online:
Advanced Information Systems Engineering (CAiSE 2022)

Abstract

In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools improving, thus, process replicability; the automation of the process execution, enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions.

This work has been partially funded by the Spanish government (LOCOSS project - PID2020-114615RB-I00), and (ii) European Regional Development Fund (ERDF) and Junta de Extremadura: IB18034, and GR18112 projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.omg.org/mda/.

  2. 2.

    https://www.eclipse.org/modeling/emf/.

  3. 3.

    https://www.omg.org/spec/OCL/2.4.

  4. 4.

    https://www.vagrantup.com/.

  5. 5.

    https://www.docker.com/.

  6. 6.

    https://rapidminer.com/.

  7. 7.

    https://www.knime.com/.

  8. 8.

    https://orangedatamining.com/.

  9. 9.

    https://bit.ly/3FXDbp5.

  10. 10.

    http://dmg.org/pmml/pmml-v4-2-1.html.

  11. 11.

    https://github.com/i3uex/education_drop.

  12. 12.

    https://www.python.org/.

  13. 13.

    https://dvc.org/.

  14. 14.

    https://www.knime.com/.

References

  1. Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a, https://www.nature.com/articles/533452a

  2. Bertoa, M.F., Burgueño, L., Moreno, N., Vallecillo, A.: Incorporating measurement uncertainty into OCL/UML primitive datatypes. Softw. Syst. Model. 19(5), 1163–1189 (2019). https://doi.org/10.1007/s10270-019-00741-0

    Article  Google Scholar 

  3. Brambilla, M., Cabot, J., Wimmer, M.: Model-driven software engineering in practice, second edition. Synthesis Lect. Softw. Eng. 3(1), 1–207 (2017). https://doi.org/10.2200/S00751ED2V01Y201701SWE004

  4. Byrne, C.: Development Workflows for Data Scientists. O’Reilly Media, Inc., Newton (2017)

    Google Scholar 

  5. Chapman, A., Missier, P., Simonelli, G., Torlone, R.: Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 507–520 (2020). https://doi.org/10.14778/3436905.3436911

  6. Domenech, A.M., Guillén, A.: ml-experiment: A Python framework for reproducible data science. J. Phys. Conf. Ser. 1603(1), 012025 (2020). https://doi.org/10.1088/1742-6596/1603/1/012025

  7. Fernández-García, A.J., Preciado, J.C., Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Sánchez-Figueroa, F.: A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access 9, 133076–133090 (2021)

    Article  Google Scholar 

  8. Gardner, J., Brooks, C., Andres, J.M., Baker, R.S.: Morf: a framework for predictive modeling and replication at scale with privacy-restricted MOOC data. In: Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, pp. 3235–3244, January 2019. https://doi.org/10.1109/BIGDATA.2018.8621874

  9. Gundersen, O.E., Kjensmo, S.: State of the art: reproducibility in artificial intelligence. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI 2018/IAAI 2018/EAAI 2018, pp. 1644–1651. AAAI Press (2018)

    Google Scholar 

  10. Haibe-Kains, B., et al.: Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020). https://doi.org/10.1038/s41586-020-2766-y

  11. van den Heuvel, W.-J., Tamburri, D.A.: Model-driven ML-ops for intelligent enterprise applications: vision, approaches and challenges. In: Shishkov, B. (ed.) BMSD 2020. LNBIP, vol. 391, pp. 169–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52306-0_11

    Chapter  Google Scholar 

  12. Hutson, M.: Artificial intelligence faces reproducibility crisis. Science 359(6377), 725–726 (2018). https://doi.org/10.1126/science.359.6377.725

    Article  Google Scholar 

  13. Jaiswal, A., Bagale, P.: A survey on big data in financial sector. In: 2017 International Conference on Networking and Network Applications (NaNA), pp. 337–340. IEEE (2017). https://doi.org/10.1109/NaNA.2017.46

  14. Konkol, M., Nüst, D., Goulier, L.: Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication. Res. Integrity Peer Rev. 5, 1–8 (2020). https://doi.org/10.1186/S41073-020-00095-Y/TABLES/2

  15. National Academies of Sciences, Engineering, and Medicine: Reproducibility and Replicability in Science. The National Academies Press, Washington, DC (2019). https://doi.org/10.17226/25303

  16. Obermeyer, Z., Emanuel, E.J.: Predicting the future - big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016). https://doi.org/10.1056/NEJMp1606181

  17. Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5485–5495. Curran Associates Inc. (2019)

    Google Scholar 

  18. Rahad, K., Badreddin, O., Mohsin Reza, S.: The human in model-driven engineering loop: a case study on integrating handwritten code in model-driven engineering repositories. Softw. Pract. Exp. 51(6), 1308–1321 (2021). https://doi.org/10.1002/spe.2957

  19. Rajbahadur, G.K., Oliva, G.A., Hassan, A.E., Dingel, J.: Pitfalls analyzer: quality control for model-driven data science pipelines. In: 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 12–22 (2019). https://doi.org/10.1109/MODELS.2019.00-19

  20. Rupprecht, L., Davis, J.C., Arnold, C., Gur, Y., Bhagwat, D.: Improving reproducibility of data science pipelines through transparent provenance capture. Proc. VLDB Endow. 13, 3354–3368 (2020). https://doi.org/10.14778/3415478.3415556

  21. Samuel, S., König-Ries, B.: Understanding experiments and research practices for reproducibility: an exploratory study. PeerJ 9, e11140 (2021)

    Article  Google Scholar 

  22. Steeves, V., Rampin, R., Chirigati, F.: Using reprozip for reproducibility and library services. IASSIST Q. 42, 14–14 (2018). https://doi.org/10.29173/IQ18

  23. Tantithamthavorn, C., Hassan, A.E.: An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings - International Conference on Software Engineering, pp. 286–295 (2018). https://doi.org/10.1145/3183519.3183547

  24. Treveil, M., et al.: Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly Media, Inc., Newton (2021). https://www.oreilly.com/library/view/introducing-mlops/9781492083283/

  25. White, L., Togneri, R., Liu, W., Bennamoun, M.: DataDeps.jl: Repeatable data setup for reproducible data science. J. Open Res. Softw. 7(1), 33 (2019). https://doi.org/10.5334/jors.244

  26. Williamson, B.: Digital education governance: data visualization, predictive analytics, and ‘real-time’ policy instruments. J. Educ. Policy 31, 123–141 (2016). https://doi.org/10.1080/02680939.2015.1035758

  27. Willis, C., Stodden, V.: Trust but verify: how to leverage policies, workflows, and infrastructure to ensure computational reproducibility in publication. Harvard Data Sci. Rev. 2(4) (2020). https://doi.org/10.1162/99608f92.25982dcf

  28. Yin, Z., Lan, H., Tan, G., Lu, M., Vasilakos, A.V., Liu, W.: Computing platforms for big biological data analytics: perspectives and challenges. Comput. Struct. Biotechnol. J. 15, 403–411 (2017). https://doi.org/10.1016/j.csbj.2017.07.004

    Article  Google Scholar 

  29. Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLFlow. IEEE Data Eng. Bull. 41, 39–45 (2018). https://www-cs.stanford.edu/people/matei/papers/2018/ieee_mlflow.pdf

  30. Šimko, T., Heinrich, L., Hirvonsalo, H., Kousidis, D., Rodríguez, D.: Reana: a system for reusable research data analyses. EPJ Web Conf. 214, 06034 (2019). https://doi.org/10.1051/epjconf/201921406034

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fran Melchor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Prieto, Á.E., Gutiérrez, J.D. (2022). A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07472-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07471-4

  • Online ISBN: 978-3-031-07472-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics