A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects

Melchor, Fran; Rodriguez-Echeverria, Roberto; Conejero, José M.; Prieto, Álvaro E.; Gutiérrez, Juan D.

doi:10.1007/978-3-031-07472-1_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13295))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

1444 Accesses

Abstract

In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools improving, thus, process replicability; the automation of the process execution, enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions.

This work has been partially funded by the Spanish government (LOCOSS project - PID2020-114615RB-I00), and (ii) European Regional Development Fund (ERDF) and Junta de Extremadura: IB18034, and GR18112 projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.omg.org/mda/.
2.
https://www.eclipse.org/modeling/emf/.
3.
https://www.omg.org/spec/OCL/2.4.
4.
https://www.vagrantup.com/.
5.
https://www.docker.com/.
6.
https://rapidminer.com/.
7.
https://www.knime.com/.
8.
https://orangedatamining.com/.
9.
https://bit.ly/3FXDbp5.
10.
http://dmg.org/pmml/pmml-v4-2-1.html.
11.
https://github.com/i3uex/education_drop.
12.
https://www.python.org/.
13.
https://dvc.org/.
14.
https://www.knime.com/.

References

Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a, https://www.nature.com/articles/533452a
Bertoa, M.F., Burgueño, L., Moreno, N., Vallecillo, A.: Incorporating measurement uncertainty into OCL/UML primitive datatypes. Softw. Syst. Model. 19(5), 1163–1189 (2019). https://doi.org/10.1007/s10270-019-00741-0
Article Google Scholar
Brambilla, M., Cabot, J., Wimmer, M.: Model-driven software engineering in practice, second edition. Synthesis Lect. Softw. Eng. 3(1), 1–207 (2017). https://doi.org/10.2200/S00751ED2V01Y201701SWE004
Byrne, C.: Development Workflows for Data Scientists. O’Reilly Media, Inc., Newton (2017)
Google Scholar
Chapman, A., Missier, P., Simonelli, G., Torlone, R.: Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 507–520 (2020). https://doi.org/10.14778/3436905.3436911
Domenech, A.M., Guillén, A.: ml-experiment: A Python framework for reproducible data science. J. Phys. Conf. Ser. 1603(1), 012025 (2020). https://doi.org/10.1088/1742-6596/1603/1/012025
Fernández-García, A.J., Preciado, J.C., Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Sánchez-Figueroa, F.: A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access 9, 133076–133090 (2021)
Article Google Scholar
Gardner, J., Brooks, C., Andres, J.M., Baker, R.S.: Morf: a framework for predictive modeling and replication at scale with privacy-restricted MOOC data. In: Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, pp. 3235–3244, January 2019. https://doi.org/10.1109/BIGDATA.2018.8621874
Gundersen, O.E., Kjensmo, S.: State of the art: reproducibility in artificial intelligence. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI 2018/IAAI 2018/EAAI 2018, pp. 1644–1651. AAAI Press (2018)
Google Scholar
Haibe-Kains, B., et al.: Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020). https://doi.org/10.1038/s41586-020-2766-y
van den Heuvel, W.-J., Tamburri, D.A.: Model-driven ML-ops for intelligent enterprise applications: vision, approaches and challenges. In: Shishkov, B. (ed.) BMSD 2020. LNBIP, vol. 391, pp. 169–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52306-0_11
Chapter Google Scholar
Hutson, M.: Artificial intelligence faces reproducibility crisis. Science 359(6377), 725–726 (2018). https://doi.org/10.1126/science.359.6377.725
Article Google Scholar
Jaiswal, A., Bagale, P.: A survey on big data in financial sector. In: 2017 International Conference on Networking and Network Applications (NaNA), pp. 337–340. IEEE (2017). https://doi.org/10.1109/NaNA.2017.46
Konkol, M., Nüst, D., Goulier, L.: Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication. Res. Integrity Peer Rev. 5, 1–8 (2020). https://doi.org/10.1186/S41073-020-00095-Y/TABLES/2
National Academies of Sciences, Engineering, and Medicine: Reproducibility and Replicability in Science. The National Academies Press, Washington, DC (2019). https://doi.org/10.17226/25303
Obermeyer, Z., Emanuel, E.J.: Predicting the future - big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016). https://doi.org/10.1056/NEJMp1606181
Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5485–5495. Curran Associates Inc. (2019)
Google Scholar
Rahad, K., Badreddin, O., Mohsin Reza, S.: The human in model-driven engineering loop: a case study on integrating handwritten code in model-driven engineering repositories. Softw. Pract. Exp. 51(6), 1308–1321 (2021). https://doi.org/10.1002/spe.2957
Rajbahadur, G.K., Oliva, G.A., Hassan, A.E., Dingel, J.: Pitfalls analyzer: quality control for model-driven data science pipelines. In: 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 12–22 (2019). https://doi.org/10.1109/MODELS.2019.00-19
Rupprecht, L., Davis, J.C., Arnold, C., Gur, Y., Bhagwat, D.: Improving reproducibility of data science pipelines through transparent provenance capture. Proc. VLDB Endow. 13, 3354–3368 (2020). https://doi.org/10.14778/3415478.3415556
Samuel, S., König-Ries, B.: Understanding experiments and research practices for reproducibility: an exploratory study. PeerJ 9, e11140 (2021)
Article Google Scholar
Steeves, V., Rampin, R., Chirigati, F.: Using reprozip for reproducibility and library services. IASSIST Q. 42, 14–14 (2018). https://doi.org/10.29173/IQ18
Tantithamthavorn, C., Hassan, A.E.: An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings - International Conference on Software Engineering, pp. 286–295 (2018). https://doi.org/10.1145/3183519.3183547
Treveil, M., et al.: Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly Media, Inc., Newton (2021). https://www.oreilly.com/library/view/introducing-mlops/9781492083283/
White, L., Togneri, R., Liu, W., Bennamoun, M.: DataDeps.jl: Repeatable data setup for reproducible data science. J. Open Res. Softw. 7(1), 33 (2019). https://doi.org/10.5334/jors.244
Williamson, B.: Digital education governance: data visualization, predictive analytics, and ‘real-time’ policy instruments. J. Educ. Policy 31, 123–141 (2016). https://doi.org/10.1080/02680939.2015.1035758
Willis, C., Stodden, V.: Trust but verify: how to leverage policies, workflows, and infrastructure to ensure computational reproducibility in publication. Harvard Data Sci. Rev. 2(4) (2020). https://doi.org/10.1162/99608f92.25982dcf
Yin, Z., Lan, H., Tan, G., Lu, M., Vasilakos, A.V., Liu, W.: Computing platforms for big biological data analytics: perspectives and challenges. Comput. Struct. Biotechnol. J. 15, 403–411 (2017). https://doi.org/10.1016/j.csbj.2017.07.004
Article Google Scholar
Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLFlow. IEEE Data Eng. Bull. 41, 39–45 (2018). https://www-cs.stanford.edu/people/matei/papers/2018/ieee_mlflow.pdf
Šimko, T., Heinrich, L., Hirvonsalo, H., Kousidis, D., Rodríguez, D.: Reana: a system for reusable research data analyses. EPJ Web Conf. 214, 06034 (2019). https://doi.org/10.1051/epjconf/201921406034

Download references

Author information

Authors and Affiliations

INTIA, Universidad de Extremadura, Cáceres, Spain
Fran Melchor, Roberto Rodriguez-Echeverria, José M. Conejero, Álvaro E. Prieto & Juan D. Gutiérrez

Authors

Fran Melchor
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Rodriguez-Echeverria
View author publications
You can also search for this author in PubMed Google Scholar
José M. Conejero
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro E. Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Juan D. Gutiérrez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fran Melchor .

Editor information

Editors and Affiliations

Department of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain
Xavier Franch
Ghent University, Gent, Belgium
Geert Poels
Ghent University, Gent, Belgium
Frederik Gailly
KU Leuven, Leuven, Belgium
Monique Snoeck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Prieto, Á.E., Gutiérrez, J.D. (2022). A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-07472-1_9
Published: 03 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07471-4
Online ISBN: 978-3-031-07472-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics