Skip to main content

Programmatic ETL

  • Conference paper
  • First Online:
Business Intelligence and Big Data (eBISS 2017)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 324))

Included in the following conference series:

Abstract

Extract-Transform-Load (ETL) processes are used for extracting data, transforming it and loading it into data warehouses (DWs). The dominating ETL tools use graphical user interfaces (GUIs) such that the developer “draws” the ETL flow by connecting steps/transformations with lines. This gives an easy overview, but can also be rather tedious and require much trivial work for simple things. We therefore challenge this approach and propose to do ETL programming by writing code. To make the programming easy, we present the Python-based framework pygrametl which offers commonly used functionality for ETL development. By using the framework, the developer can efficiently create effective ETL solutions from which the full power of programming can be exploited. In this chapter, we present our work on pygrametl and related activities. Further, we consider some of the lessons learned during the development of pygrametl as an open source framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We assume that a dimension has a non-composite key.

  2. 2.

    It is also possible to do the lookups and insertions from the leaves towards the root but when going towards the leaves, it is possible to stop the search earlier if a part of the member is already present.

  3. 3.

    We did not test PDI’s support for distributed execution.

  4. 4.

    http://flexdanmark.dk. One of the authors (Ove Andersen) is employed by FlexDanmark.

  5. 5.

    http://chrthomsen.github.io/pygrametl/.

  6. 6.

    http://pypi.python.org/pypi.

  7. 7.

    http://chrthomsen.github.io/pygrametl/doc/index.html.

  8. 8.

    http://pandas.pydata.org/.

References

  1. Beyer, M.A., Thoo, E., Selvage, M.Y., Zaidi, E.: Gartner Magic Quadrant for Data Integration Tools (2017)

    Google Scholar 

  2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the OSDI, pp. 137–150 (2004). https://doi.org/10.1145/1327452.1327492

  3. Django. djangoproject.com/. Accessed 13 Oct 2017

  4. Grönniger, H., Krahn, H., Rumpe, B., Schindler, M., Völkel, S.: Text-based modeling. In: Proceedings of ATEM (2007)

    Google Scholar 

  5. IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/datastage. Accessed 13 Oct 2017

  6. Informatica. informatica.com. Accessed 13 Oct 2017

  7. Jensen, C.S., Pedersen, T.B., Thomsen, C.: Multidimensional Databases and Data Warehousing. Morgan and Claypool, San Rafael (2010). https://doi.org/10.2200/S00299ED1V01Y201009DTM009

    Book  MATH  Google Scholar 

  8. Kimball, R., Ross, M.: The Data Warehouse Toolkit, 2nd edn. Wiley, New York (2002)

    Google Scholar 

  9. Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on MapReduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23544-3_8

    Chapter  Google Scholar 

  10. Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services. Accessed 13 Oct 2017

  11. Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html. Accessed 13 Oct 2017

  12. Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 13 Oct 2017

  13. Petre, M.: Why looking isn’t always seeing: readership skills and graphical programming. Commun. ACM 38(6), 33–44 (1995). https://doi.org/10.1145/203241.203251

    Article  Google Scholar 

  14. PostgreSQL. postgresql.org. Accessed 13 Oct 2017

    Google Scholar 

  15. Psycopg. http://initd.org/psycopg/. Accessed 13 Oct 2017

  16. Python. python.org. Accessed 13 Oct 2017

    Google Scholar 

  17. Ruby on Rails. rubyonrails.org/. Accessed 13 Oct 2017

    Google Scholar 

  18. SAP Data Services. https://www.sap.com/products/data-services.html. Accessed 13 Oct 2017

  19. Scriptella. scriptella.org. Accessed 13 Oct 2017

    Google Scholar 

  20. Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005). https://doi.org/10.1007/11546849_5

    Chapter  Google Scholar 

  21. Thomsen, C., Pedersen, T.B.: Building a web warehouse for accessibility data. In: Proceedings of DOLAP (2006). https://doi.org/10.1145/1183512.1183522

  22. Thomsen, C., Pedersen, T.B.: A survey of open source tools for business intelligence. IJDWM 5(3), 56–75 (2009). https://doi.org/10.4018/jdwm.2009070103

    Article  Google Scholar 

  23. Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of DOLAP, pp. 49–56 (2009). https://doi.org/10.1145/2064676.2064684

  24. Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. DBTR-25, Aalborg University (2009). www.cs.aau.dk/DBTR

  25. Thomsen, C., Pedersen, T.B.: Easy and effective parallel programmable ETL. In: Proceedings of DOLAP, pp. 37–44 (2011)

    Google Scholar 

  26. Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39648-2_25

    Chapter  Google Scholar 

  27. Vaisman, A., Zimanyi, E.: Data Warehouse Systems: Design and Implementation. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6

    Book  Google Scholar 

  28. Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009). https://doi.org/10.4018/jdwm.2009070101

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Thomsen .

Editor information

Editors and Affiliations

Appendix A Data Warehouse Concepts

Appendix A Data Warehouse Concepts

This appendix offers a very short introduction to concepts and terms used in the chapter. More details and explanations can be found in the literature [7, 8, 27]. In a data warehouse (DW), data from an organization’s different operational systems is stored in a way that supports analysis (rather than the daily operations which are supported by the operational systems). An Extract-Transform-Load (ETL) process extracts data from the source systems, transforms the data (to make it fit into the DW and to cleanse it), and loads it into the DW. Data is divided into facts and dimensions. Facts represent the subjects of the desired analyses (e.g., sales) and have numerical measures (e.g., sales amount). Dimensions provide context and describe facts (Product, Store, and Time are, for example, relevant dimensions for sales). Dimensions are thus used for selection of data and grouping of data in analyses. Dimensions have hierarchies with levels (a Time dimension can, for example, have the hierarchy Day \(\rightarrow \) Month \(\rightarrow \) Quarter \(\rightarrow \) Year). Each of the levels can also have a number of attributes.

When using a relational database to represent a DW, one can choose between two approaches for the schema design. In a snowflake schema, each level in a dimension is represented by a table and the tables have foreign keys to the following levels. The dimension tables are thus normalized. In a star schema there is only one table for each dimension. This table thus represents all levels and is denormalized. In both star schemas and snowflake schemas, the facts are represented by a fact table which has a foreign key for each dimension and a column for each measure. In a star schema, the foreign keys reference the dimension tables while they reference the tables for the lowest levels of the dimensions in a snowflake schema. The keys used in a dimension table should be integers not carrying any special meaning. Such keys are called surrogate keys.

Changes may happen in the represented world. It is thus necessary to be able to represent changes in dimensions. A dimension where changes are represented is called a slowly changing dimension (SCD). There are a number of different techniques for SCDs [8]. Here we will consider two of the most commonly used. For type 1 SCDs, changes are simply represented by overwriting old values in the dimension tables. If, for example, the size of a shop changes, the size attribute is updated. This can be problematic as old facts (e.g., facts about sales from the shop when it had the previous size) now refer to the updated dimension member such that history is not correctly represented. This problem is avoided with a type 2 SCD where a new version of the dimension member is created when there is a change. In other words, for type 2 SCDs, changes result in new rows in the dimension tables. In a type 2 SCD, there are often attributes called something like ValidFrom, ValidTo, MostCurrentVersion, and VersionNumber to provide information about the represented versions.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thomsen, C., Andersen, O., Jensen, S.K., Pedersen, T.B. (2018). Programmatic ETL. In: Zimányi, E. (eds) Business Intelligence and Big Data. eBISS 2017. Lecture Notes in Business Information Processing, vol 324. Springer, Cham. https://doi.org/10.1007/978-3-319-96655-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96655-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96654-0

  • Online ISBN: 978-3-319-96655-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics