Abstract
To study the data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDS), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDS), metric functional dependencies (MFDS), and matching dependencies (MDS). As we illustrated, comparable dependencies are useful in real practice of dataspaces, such as semantic query optimization. Due to heterogeneous data in dataspaces, the first question, known as the validation problem, is to tell whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, such as greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.
Similar content being viewed by others
References
Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)
Armstrong, W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)
Bertossi L.E., Bravo L., Franconi E., Lopatenko A.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst. 33(4-5), 407–434 (2008)
Bitton, D., Millman, J., Torgersen, S.: A feasibility and performance study of dependency inference. In: ICDE, pp. 635–641 (1989)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Chakravarthy U.S., Grant J., Minker J.: Logic-based approach to semantic query optimization. ACM Trans. Database Syst. 15(2), 162–207 (1990)
Cheng, Q., Gryz, J., Koo, F., Leung, T.Y.C., Liu, L., Qian, X., Schiefer, K.B.: Implementation of two semantic query optimization techniques in db2 universal database. In: VLDB, pp. 687–698 (1999)
Chiang F., Miller R.J.: Discovering data quality rules. PVLDB 1(1), 1166–1177 (2008)
Chomicki J.: Semantic optimization techniques for preference queries. Inf. Syst. 32(5), 670–684 (2007)
Chomicki J., Marcinkowski J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1-2), 90–121 (2005)
Cormode, G., Golab, L., Korn, F., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: SIGMOD Conference, pp. 469–482 (2009)
Dinur, I., Safra, S.: The importance of being biased. In: STOC, pp. 33–42 (2002)
Dong, X., Halevy, A.Y.: Indexing dataspaces. In: SIGMOD Conference, pp. 43–54 (2007)
Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fan, W.: Dependencies revisited for improving data quality. In: PODS, pp. 159–170 (2008)
Fan, W., Geerts, F., Lakshmanan, L.V.S., Xiong, M.: Discovering conditional functional dependencies. In: ICDE, pp. 1231–1234 (2009)
Fan, W., Li, J., Jia, X., Ma, S.: Reasoning about record matching rules. In: PVLDB (2009)
Feige, U., Goldwasser, S., Lovász, L., Safra, S., Szegedy, M.: Approximating clique is almost np-complete (preliminary version). In: FOCS, pp. 2–12 (1991)
Flach P.A., Savnik I.: Database dependency discovery: a machine learning approach. AI Commun. 12(3), 139–160 (1999)
Garey M.R., Johnson D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, London (1979)
Giannella C., Robertson E.L.: On approximation measures for functional dependencies. Inf. Syst. 29(6), 483–507 (2004)
Golab L., Karloff H.J., Korn F., Srivastava D., Yu B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)
Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Halldórsson, M.M., Radhakrishnan, J.: Greed is good: approximating independent sets in sparse and bounded-degree graphs. In: STOC, pp. 439–448 (1994)
Hsu C.N., Knoblock C.A.: Semantic query optimization for query plans of heterogeneous multidatabase systems. IEEE Trans. Knowl. Data Eng. 12(6), 959–978 (2000)
Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Efficient discovery of functional and approximate dependencies using partitions. In: ICDE, pp. 392–401 (1998)
Huhtala Y., Kärkkäinen J., Porkka P., Toivonen H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD Conference, pp. 647–658 (2004)
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD Conference, pp. 847–860 (2008)
Karakostas G.: A better approximation ratio for the vertex cover problem. ACM Trans. Algorithm. 5(4), 1–8 (2009). doi:10.1145/1597036.1597045
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, Plenum Press, Berlin, pp. 85–103 (1972)
King R.S., Legendre J.J.: Discovery of functional and approximate functional dependencies in relational databases. JAMDS 7(1), 49–59 (2003)
Kivinen J., Mannila H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129–149 (1995)
Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)
Kramer, S., Pfahringer, B.: Efficient search for strong partial determinations. In: KDD, pp. 371–374 (1996)
Levy, A.Y., Sagiv, Y.: Semantic query optimization in datalog programs. In: PODS, pp. 163–173 (1995)
Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: CIDR, pp. 342–350 (2007)
Mannila, H., Räihä, K.J.: Dependency inference. In: VLDB, pp. 155–158 (1987)
Mannila H., Räihä K.J.: Design of Relational Databases. Addison-Wesley, Boston (1992)
Mannila H., Räihä K.J.: Algorithms for inferring functional dependencies from relations. Data Knowl. Eng. 12(1), 83–99 (1994)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Parnas M., Ron D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1-3), 183–196 (2007)
Pfahringer, B., Kramer, S.: Compression-based evaluation of partial determinations. In: KDD, pp. 234–239 (1995)
Rahm E., Bernstein P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Salles, M.A.V., Dittrich, J., Blunschi, L.: Intensional associations in dataspaces. In: ICDE (2010)
Salles, M.A.V., Dittrich, J.P., Karakashian, S.K., Girard, O.R., Blunschi, L.: Itrails: pay-as-you-go information integration in dataspaces. In: VLDB, pp. 663–674 (2007)
Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)
Song S., Chen L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)
Song, S., Chen, L., Cheng, H.: Parameter-free determination of distance thresholds for metric distance constraints. In: ICDE (2012, to appear)
Song, S., Chen, L., Yu, P.S.: On data dependencies in dataspaces. In: ICDE, pp. 470–481 (2011)
Song S., Chen L., Yuan M.: Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng. 23(12), 1872–1887 (2011)
Su, H., Rundensteiner, E.A., Mani, M.: Semantic query optimization for xquery over xml streams. In: VLDB, pp. 277–288 (2005)
Wang, D.Z., Dong, X.L., Sarma, A.D., Franklin, M.J., Halevy, A.Y.: Functional dependency generation and applications in pay-as-you-go data integration systems. In: WebDB (2009)
Wyss, C.M., Giannella, C., Robertson, E.L.: Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances-extended abstract. In: DaWaK, pp. 101–110 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, S., Chen, L. & Yu, P.S. Comparable dependencies over heterogeneous data. The VLDB Journal 22, 253–274 (2013). https://doi.org/10.1007/s00778-012-0285-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-012-0285-7