Abstract
In Rail Automation, planning future projects requires the integration of business-critical data from heterogeneous, often noisy data sources. Current integration approaches often neglect uncertainties and inconsistencies in the integration process and thus cannot guarantee the necessary data quality. To tackle these issues, we propose a semi-automated process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse entry for a source value in a proprietary, often semi-structured format is supported by the notion of a signifier which is a natural extension of composite primary keys. In three different case studies we show that this approach (i) facilitates high-quality data integration while minimizing user interaction, (ii) leverages approximate name matching of railway station and entity names, (iii) contributes to extract features from contextual data for data cross-checks and thus supports the planning phases of railway projects.
This work is funded by the Austrian Research Promotion Agency (FFG) under grant 852658 (CODA).
Alexandra Mazak is affiliated with the Christian Doppler Laboratory on Model-Integrated Smart Production at TU Wien.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-2003, 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)
Langer, P., Wimmer, M., Gray, J., Kappel, G., Vallecillo, A.: Language-specific model versioning based on signifiers. J. Object Technol. 11, 4-1 (2012)
Wurl, A., Falkner, A., Haselböck, A., Mazak, A.: Using signifiers for data integration in rail automation. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA, INSTICC, pp. 172–179. SciTePress (2017)
Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14, 131 (2009)
Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42, 40–49 (2014)
Salton, G., Harman, D.: Information Retrieval. Wiley, Chichester (2003)
Wimmer, M., Langer, P.: A benchmark for model matching systems: the heterogeneous metamodel case. Softwaretechnik-Trends 33 (2013)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April 2009, pp. 371–380 (2009)
Zobel, J., Dart, P.W.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, 18–22 August 1996, Zurich, Switzerland, pp. 166–172 (1996). (Special Issue of the SIGIR Forum)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41, 1 (2009)
Leser, U., Naumann, F.: Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen. dpunkt.verlag (2007)
Sharma, S., Jain, R.: Modeling ETL process for data warehouse: an exploratory study. In. In: Fourth International Conference on ACCT 2014, pp. 271–276. IEEE (2014)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD, pp. 541–552. ACM (2013)
Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4, 1–217 (2012)
Dasu, T., Johnson, T.: Exploratory data mining and data cleaning: an overview. In: Exploratory Data Mining and Data Cleaning, pp. 1–16 (2003)
Hellerstein, J.M.: Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008)
Liu, H., Kumar, T.A., Thomas, J.P.: Cleaning framework for big data-object identification and linkage. In: 2015 IEEE International Congress on Big Data, pp. 215–221. IEEE (2015)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD, pp. 39–48. ACM (2003)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5, 1483–1494 (2012)
Müller, H., Freytag, J.C.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. für Informatik (2005)
Krishnan, S., Haas, D., Franklin, M.J., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA@ SIGMOD, p. 9 (2016)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. Proc. VLDB Endowment 3, 173–184 (2010)
Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD, pp. 1215–1230. ACM (2015)
Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th ICDE 2014, pp. 244–255. IEEE (2014)
Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J.: Data profiling technology of data governance regarding big data: review and rethinking. Information Technology: New Generations. AISC, vol. 448, pp. 439–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32467-8_39
Gill, R., Singh, J.: A review of contemporary data quality issues in data warehouse ETL environment. J. Today’s Ideas Tomorrow’s Technol. 2(2), 153–160 (2014)
Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N.: SemGen—towards a semantic data generator for benchmarking duplicate detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6637, pp. 490–501. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20244-5_47
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endowment 9, 312–323 (2015)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 87–96. IEEE (2006)
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, pp. 440–445 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Wurl, A., Falkner, A., Haselböck, A., Mazak, A. (2018). Advanced Data Integration with Signifiers: Case Studies for Rail Automation. In: Filipe, J., Bernardino, J., Quix, C. (eds) Data Management Technologies and Applications. DATA 2017. Communications in Computer and Information Science, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-319-94809-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-94809-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94808-9
Online ISBN: 978-3-319-94809-6
eBook Packages: Computer ScienceComputer Science (R0)