Programmatic ETL

Thomsen, Christian; Andersen, Ove; Jensen, Søren Kejser; Pedersen, Torben Bach

doi:10.1007/978-3-319-96655-7_2

Christian Thomsen⁷,
Ove Andersen⁷,
Søren Kejser Jensen⁷ &
…
Torben Bach Pedersen⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 324))

Included in the following conference series:

European Business Intelligence and Big Data Summer School

1301 Accesses
2 Citations

Abstract

Extract-Transform-Load (ETL) processes are used for extracting data, transforming it and loading it into data warehouses (DWs). The dominating ETL tools use graphical user interfaces (GUIs) such that the developer “draws” the ETL flow by connecting steps/transformations with lines. This gives an easy overview, but can also be rather tedious and require much trivial work for simple things. We therefore challenge this approach and propose to do ETL programming by writing code. To make the programming easy, we present the Python-based framework pygrametl which offers commonly used functionality for ETL development. By using the framework, the developer can efficiently create effective ETL solutions from which the full power of programming can be exploited. In this chapter, we present our work on pygrametl and related activities. Further, we consider some of the lessons learned during the development of pygrametl as an open source framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We assume that a dimension has a non-composite key.
2.
It is also possible to do the lookups and insertions from the leaves towards the root but when going towards the leaves, it is possible to stop the search earlier if a part of the member is already present.
3.
We did not test PDI’s support for distributed execution.
4.
http://flexdanmark.dk. One of the authors (Ove Andersen) is employed by FlexDanmark.
5.
http://chrthomsen.github.io/pygrametl/.
6.
http://pypi.python.org/pypi.
7.
http://chrthomsen.github.io/pygrametl/doc/index.html.
8.
http://pandas.pydata.org/.

References

Beyer, M.A., Thoo, E., Selvage, M.Y., Zaidi, E.: Gartner Magic Quadrant for Data Integration Tools (2017)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the OSDI, pp. 137–150 (2004). https://doi.org/10.1145/1327452.1327492
Django. djangoproject.com/. Accessed 13 Oct 2017
Grönniger, H., Krahn, H., Rumpe, B., Schindler, M., Völkel, S.: Text-based modeling. In: Proceedings of ATEM (2007)
Google Scholar
IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/datastage. Accessed 13 Oct 2017
Informatica. informatica.com. Accessed 13 Oct 2017
Jensen, C.S., Pedersen, T.B., Thomsen, C.: Multidimensional Databases and Data Warehousing. Morgan and Claypool, San Rafael (2010). https://doi.org/10.2200/S00299ED1V01Y201009DTM009
Book MATH Google Scholar
Kimball, R., Ross, M.: The Data Warehouse Toolkit, 2nd edn. Wiley, New York (2002)
Google Scholar
Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on MapReduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23544-3_8
Chapter Google Scholar
Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services. Accessed 13 Oct 2017
Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html. Accessed 13 Oct 2017
Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 13 Oct 2017
Petre, M.: Why looking isn’t always seeing: readership skills and graphical programming. Commun. ACM 38(6), 33–44 (1995). https://doi.org/10.1145/203241.203251
Article Google Scholar
PostgreSQL. postgresql.org. Accessed 13 Oct 2017
Google Scholar
Psycopg. http://initd.org/psycopg/. Accessed 13 Oct 2017
Python. python.org. Accessed 13 Oct 2017
Google Scholar
Ruby on Rails. rubyonrails.org/. Accessed 13 Oct 2017
Google Scholar
SAP Data Services. https://www.sap.com/products/data-services.html. Accessed 13 Oct 2017
Scriptella. scriptella.org. Accessed 13 Oct 2017
Google Scholar
Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005). https://doi.org/10.1007/11546849_5
Chapter Google Scholar
Thomsen, C., Pedersen, T.B.: Building a web warehouse for accessibility data. In: Proceedings of DOLAP (2006). https://doi.org/10.1145/1183512.1183522
Thomsen, C., Pedersen, T.B.: A survey of open source tools for business intelligence. IJDWM 5(3), 56–75 (2009). https://doi.org/10.4018/jdwm.2009070103
Article Google Scholar
Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of DOLAP, pp. 49–56 (2009). https://doi.org/10.1145/2064676.2064684
Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. DBTR-25, Aalborg University (2009). www.cs.aau.dk/DBTR
Thomsen, C., Pedersen, T.B.: Easy and effective parallel programmable ETL. In: Proceedings of DOLAP, pp. 37–44 (2011)
Google Scholar
Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39648-2_25
Chapter Google Scholar
Vaisman, A., Zimanyi, E.: Data Warehouse Systems: Design and Implementation. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6
Book Google Scholar
Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009). https://doi.org/10.4018/jdwm.2009070101
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Aalborg, Denmark
Christian Thomsen, Ove Andersen, Søren Kejser Jensen & Torben Bach Pedersen

Authors

Christian Thomsen
View author publications
You can also search for this author in PubMed Google Scholar
Ove Andersen
View author publications
You can also search for this author in PubMed Google Scholar
Søren Kejser Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Torben Bach Pedersen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Thomsen .

Editor information

Editors and Affiliations

Université Libre de Bruxelles, Brussels, Belgium
Esteban Zimányi

Appendix A Data Warehouse Concepts

This appendix offers a very short introduction to concepts and terms used in the chapter. More details and explanations can be found in the literature [7, 8, 27]. In a data warehouse (DW), data from an organization’s different operational systems is stored in a way that supports analysis (rather than the daily operations which are supported by the operational systems). An Extract-Transform-Load (ETL) process extracts data from the source systems, transforms the data (to make it fit into the DW and to cleanse it), and loads it into the DW. Data is divided into facts and dimensions. Facts represent the subjects of the desired analyses (e.g., sales) and have numerical measures (e.g., sales amount). Dimensions provide context and describe facts (Product, Store, and Time are, for example, relevant dimensions for sales). Dimensions are thus used for selection of data and grouping of data in analyses. Dimensions have hierarchies with levels (a Time dimension can, for example, have the hierarchy Day \(\rightarrow \) Month \(\rightarrow \) Quarter \(\rightarrow \) Year). Each of the levels can also have a number of attributes.

When using a relational database to represent a DW, one can choose between two approaches for the schema design. In a snowflake schema, each level in a dimension is represented by a table and the tables have foreign keys to the following levels. The dimension tables are thus normalized. In a star schema there is only one table for each dimension. This table thus represents all levels and is denormalized. In both star schemas and snowflake schemas, the facts are represented by a fact table which has a foreign key for each dimension and a column for each measure. In a star schema, the foreign keys reference the dimension tables while they reference the tables for the lowest levels of the dimensions in a snowflake schema. The keys used in a dimension table should be integers not carrying any special meaning. Such keys are called surrogate keys.

Changes may happen in the represented world. It is thus necessary to be able to represent changes in dimensions. A dimension where changes are represented is called a slowly changing dimension (SCD). There are a number of different techniques for SCDs [8]. Here we will consider two of the most commonly used. For type 1 SCDs, changes are simply represented by overwriting old values in the dimension tables. If, for example, the size of a shop changes, the size attribute is updated. This can be problematic as old facts (e.g., facts about sales from the shop when it had the previous size) now refer to the updated dimension member such that history is not correctly represented. This problem is avoided with a type 2 SCD where a new version of the dimension member is created when there is a change. In other words, for type 2 SCDs, changes result in new rows in the dimension tables. In a type 2 SCD, there are often attributes called something like ValidFrom, ValidTo, MostCurrentVersion, and VersionNumber to provide information about the represented versions.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thomsen, C., Andersen, O., Jensen, S.K., Pedersen, T.B. (2018). Programmatic ETL. In: Zimányi, E. (eds) Business Intelligence and Big Data. eBISS 2017. Lecture Notes in Business Information Processing, vol 324. Springer, Cham. https://doi.org/10.1007/978-3-319-96655-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-96655-7_2
Published: 15 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96654-0
Online ISBN: 978-3-319-96655-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Programmatic ETL

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix A Data Warehouse Concepts

Appendix A Data Warehouse Concepts

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation