Advanced Data Integration with Signifiers: Case Studies for Rail Automation

Wurl, Alexander; Falkner, Andreas; Haselböck, Alois; Mazak, Alexandra

doi:10.1007/978-3-319-94809-6_5

Alexander Wurl¹²,
Andreas Falkner¹²,
Alois Haselböck¹² &
…
Alexandra Mazak¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 814))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

523 Accesses
2 Citations

Abstract

In Rail Automation, planning future projects requires the integration of business-critical data from heterogeneous, often noisy data sources. Current integration approaches often neglect uncertainties and inconsistencies in the integration process and thus cannot guarantee the necessary data quality. To tackle these issues, we propose a semi-automated process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse entry for a source value in a proprietary, often semi-structured format is supported by the notion of a signifier which is a natural extension of composite primary keys. In three different case studies we show that this approach (i) facilitates high-quality data integration while minimizing user interaction, (ii) leverages approximate name matching of railway station and entity names, (iii) contributes to extract features from contextual data for data cross-checks and thus supports the planning phases of railway projects.

This work is funded by the Austrian Research Promotion Agency (FFG) under grant 852658 (CODA).

Alexandra Mazak is affiliated with the Christian Doppler Laboratory on Model-Integrated Smart Production at TU Wien.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-2003, 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)
Google Scholar
Langer, P., Wimmer, M., Gray, J., Kappel, G., Vallecillo, A.: Language-specific model versioning based on signifiers. J. Object Technol. 11, 4-1 (2012)
Google Scholar
Wurl, A., Falkner, A., Haselböck, A., Mazak, A.: Using signifiers for data integration in rail automation. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA, INSTICC, pp. 172–179. SciTePress (2017)
Google Scholar
Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14, 131 (2009)
Article Google Scholar
Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42, 40–49 (2014)
Article Google Scholar
Salton, G., Harman, D.: Information Retrieval. Wiley, Chichester (2003)
Google Scholar
Wimmer, M., Langer, P.: A benchmark for model matching systems: the heterogeneous metamodel case. Softwaretechnik-Trends 33 (2013)
Article Google Scholar
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April 2009, pp. 371–380 (2009)
Google Scholar
Zobel, J., Dart, P.W.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, 18–22 August 1996, Zurich, Switzerland, pp. 166–172 (1996). (Special Issue of the SIGIR Forum)
Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41, 1 (2009)
Article Google Scholar
Leser, U., Naumann, F.: Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen. dpunkt.verlag (2007)
Google Scholar
Sharma, S., Jain, R.: Modeling ETL process for data warehouse: an exploratory study. In. In: Fourth International Conference on ACCT 2014, pp. 271–276. IEEE (2014)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Google Scholar
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD, pp. 541–552. ACM (2013)
Google Scholar
Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4, 1–217 (2012)
Article Google Scholar
Dasu, T., Johnson, T.: Exploratory data mining and data cleaning: an overview. In: Exploratory Data Mining and Data Cleaning, pp. 1–16 (2003)
Google Scholar
Hellerstein, J.M.: Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008)
Google Scholar
Liu, H., Kumar, T.A., Thomas, J.P.: Cleaning framework for big data-object identification and linkage. In: 2015 IEEE International Congress on Big Data, pp. 215–221. IEEE (2015)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD, pp. 39–48. ACM (2003)
Google Scholar
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5, 1483–1494 (2012)
Article Google Scholar
Müller, H., Freytag, J.C.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. für Informatik (2005)
Google Scholar
Krishnan, S., Haas, D., Franklin, M.J., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA@ SIGMOD, p. 9 (2016)
Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. Proc. VLDB Endowment 3, 173–184 (2010)
Article Google Scholar
Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD, pp. 1215–1230. ACM (2015)
Google Scholar
Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th ICDE 2014, pp. 244–255. IEEE (2014)
Google Scholar
Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J.: Data profiling technology of data governance regarding big data: review and rethinking. Information Technology: New Generations. AISC, vol. 448, pp. 439–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32467-8_39
Chapter Google Scholar
Gill, R., Singh, J.: A review of contemporary data quality issues in data warehouse ETL environment. J. Today’s Ideas Tomorrow’s Technol. 2(2), 153–160 (2014)
Article Google Scholar
Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N.: SemGen—towards a semantic data generator for benchmarking duplicate detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6637, pp. 490–501. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20244-5_47
Chapter Google Scholar
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endowment 9, 312–323 (2015)
Article Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)
Article Google Scholar
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 87–96. IEEE (2006)
Google Scholar
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, pp. 440–445 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Siemens AG Österreich, Corporate Technology, Vienna, Austria
Alexander Wurl, Andreas Falkner & Alois Haselböck
Business Informatics Group, TU Wien, Vienna, Austria
Alexandra Mazak

Authors

Alexander Wurl
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Falkner
View author publications
You can also search for this author in PubMed Google Scholar
Alois Haselböck
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Mazak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Wurl .

Editor information

Editors and Affiliations

INSTICC, Polytechnic Institute of Setúbal, Setúbal, Portugal
Joaquim Filipe
University of Coimbra, Coimbra, Portugal
Jorge Bernardino
RWTH Aachen University, Aachen, Germany
Christoph Quix

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wurl, A., Falkner, A., Haselböck, A., Mazak, A. (2018). Advanced Data Integration with Signifiers: Case Studies for Rail Automation. In: Filipe, J., Bernardino, J., Quix, C. (eds) Data Management Technologies and Applications. DATA 2017. Communications in Computer and Information Science, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-319-94809-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-94809-6_5
Published: 30 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94808-9
Online ISBN: 978-3-319-94809-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics