Skip to main content

Automated Schema Quality Measurement in Large-Scale Information Systems

  • Conference paper
  • First Online:
Data Quality and Trust in Big Data (QUAT 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11235))

Included in the following conference series:

Abstract

Assessing the quality of information system schemas is crucial, because an unoptimized or erroneous schema design has a strong impact on the quality of the stored data, e.g., it may lead to inconsistencies and anomalies at the data-level. Even if the initial schema had an ideal design, changes during the life cycle can negatively affect the schema quality and have to be tackled. Especially in Big Data environments there are two major challenges: large schemas, where manual verification of schema and data quality is very arduous, and the integration of heterogeneous schemas from different data models, whose quality cannot be compared directly. Thus, we present a domain-independent approach for automatically measuring the quality of large and heterogeneous (logical) schemas. In contrast to existing approaches, we provide a fully automatable workflow that also enables regular reassessment. Our implementation allows to measure the quality dimensions correctness, completeness, pertinence, minimality, readability, and normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.w3.org/OWL [December, 2018].

  2. 2.

    https://www.w3c.org/TR/turtle [December, 2018].

  3. 3.

    https://dev.mysql.com/doc/employee/en [December, 2018].

  4. 4.

    https://dev.mysql.com/doc/sakila/en [December, 2018].

  5. 5.

    https://archive.codeplex.com/?p=chinookdatabase [December, 2018].

  6. 6.

    https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases [December, 2018].

  7. 7.

    https://www.alphavantage.co [December, 2018].

  8. 8.

    http://dqm.faw.jku.at [December, 2018].

References

  1. Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)

    Article  Google Scholar 

  2. Otto, B., Österle, H.: Corporate Data Quality: Prerequisite for Successful Business Models. Springer Gabler, Berlin (2016)

    Book  Google Scholar 

  3. Moore, S.: How to Create a Business Case for Data Quality Improvement. Gartner Research (2017). http://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement. Accessed Dec 2018

  4. Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)

    Article  Google Scholar 

  5. Batini, C., Scannapieco, M.: Data and Information Quality: Concepts, Methodologies and Techniques. Springer (2016)

    Google Scholar 

  6. Vossen, G.: Datenmodelle, Datenbanksprachen und Datenbankmanagementsysteme [Data Models, Database Languages, and Database Management Systems]. Oldenbourg Verlag (2008)

    Google Scholar 

  7. Kruse, S.: Scalable data profiling - distributed discovery and analysis of structural metadata. Ph.D. thesis, Universität Potsdam (2018)

    Google Scholar 

  8. Coelho, F., Aillos, A., Pilot, S., Valeev, S.: On the quality of relational database schemas in open-source software. Int. J. Adv. Softw. 4(3 & 4), 11 (2012)

    Google Scholar 

  9. Batista, M.C.M., Salgado, A.C.: Information quality measurement in data integration schemas. In: Proceedings of the Fifth International Workshop on Quality in Databases, QDB 2007, at the VLDB 2007 Conference, pp. 61–72. ACM (2007)

    Google Scholar 

  10. Ehrlinger, L., Werth, B., Wöß, W.: QuaIIe: a data quality assessment tool for integrated information systems. In: Proceedings of the Tenth International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2018), pp. 21–31 (2018)

    Google Scholar 

  11. Herden, O.: Measuring quality of database schema by reviewing - concept, criteria and tool. In: Proceedings of 5th International Workshop on Quantitative Approaches in Object-Oriented Software Engineering, pp. 59–70 (2001)

    Google Scholar 

  12. Duchateau, F., Bellahsene, Z.: Measuring the quality of an integrated schema. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 261–273. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16373-9_19

    Chapter  Google Scholar 

  13. Feilmayr, C., Wöß, W.: An analysis of ontologies and their success factors for application to business. Data Knowl. Eng. 101, 1–23 (2016)

    Article  Google Scholar 

  14. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer-Verlag New York Inc., Secaucus (2007)

    MATH  Google Scholar 

  15. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002, pp. 117–128. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  16. Ehrlinger, L., Wöß, W.: Semi-automatically generated hybrid ontologies for information integration. In: Joint Proceedings of the Posters and Demos Track of 11th International Conference on Semantic Systems, pp. 100–104. CEUR Workshop Proceedings (2015)

    Google Scholar 

  17. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)

    Google Scholar 

  18. Logan, J.R., Gorman, P.N., Middleton, B.: Measuring the quality of medical records: a method for comparing completeness and correctness of clinical encounter data. In: American Medical Informatics Association Annual Symposium, AMIA 2001, Washington, DC, USA, 3–7 November 2001, pp. 408–4012 (2001)

    Google Scholar 

  19. Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Inf. Syst. 29(7), 583–615 (2004)

    Article  Google Scholar 

  20. Heinrich, B., Hristova, D., Klier, M., Schiller, A., Szubartowicz, M.: Requirements for data quality metrics. J. Data Inf. Qual. 9(2), 12:1–12:32 (2018)

    Google Scholar 

  21. Ehrlinger, L., Wöß, W.: A novel data quality metric for minimality. In: Hacid, H., Sheng, Q.Z., Yoshida, T., Sarkheyli, A., Zhou, R. (eds.) WISE 2018. LNCS, vol. 10042, pp. 1–15. Springer, Cham (2019)

    Google Scholar 

  22. W3C Working Group: Data on the Web Best Practices: Data Quality Vocabulary. (2016). https://www.w3.org/TR/vocab-dqv. Accessed Dec 2018

  23. Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)

    Article  Google Scholar 

  24. Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4), 323–364 (1986)

    Article  Google Scholar 

Download references

Acknowledgments

The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry for Digital and Economic Affairs, and the Province of Upper Austria in the frame of the COMET center SCCH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lisa Ehrlinger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ehrlinger, L., Wöß, W. (2019). Automated Schema Quality Measurement in Large-Scale Information Systems. In: Hacid, H., Sheng, Q., Yoshida, T., Sarkheyli, A., Zhou, R. (eds) Data Quality and Trust in Big Data. QUAT 2018. Lecture Notes in Computer Science(), vol 11235. Springer, Cham. https://doi.org/10.1007/978-3-030-19143-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-19143-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-19142-9

  • Online ISBN: 978-3-030-19143-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics