Skip to main content
Log in

Data Mining in Large Databases Using Domain Generalization Graphs

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on the Management of Data (SIGMOD'93) (pp. 207–216). Washington, D.C.

  • Agrawal, R., Lin, K., Sawhney, H.S., and Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of the 21th International Conference on Very Large Databases (VLDB'95) (pp. 490–501). Zurich, Switzerland.

  • Agrawal, R. and Srikant, R. (1995.) Mining sequential patterns. Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (pp. 3–14).

  • Atkinson, A.B. (1970). On the Measurement of Inequality, Journal of Economic Theory, 2, 244–263.

    Google Scholar 

  • Barber, D.B. and Hamilton, H.J. (1997). Comparison of attribute selection strategies for attribute-oriented generalization. In Lecture Notes in Artificial Intelligence, The 11th International Symposium on Methodologies for Intelligent Systems (ISMIS'97) (pp. 106–116). Charlotte, North Carolina.

  • Bray, J.R. and Curtis, J.T. (1957). An Ordination of the Upland Forest Communities of Southern Wisconsin, Ecological Monographs, 27, 325–349.

    Google Scholar 

  • Brin, S., Motwani, R., and Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97) (pp. 265–276).

  • Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. (1997). Dynamic itemset counting and implication rules for market basket data. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97) (pp. 255–264).

  • Cai, Y., Cercone, N., and Han, J. (1991). Attribute-oriented induction in relational databases. In G. Piatetsky-Shapiro and W. Frawley (Eds.), Knowledge Discovery in Databases (pp. 213–228). Cambridge, MA: AAAI/MIT Press.

    Google Scholar 

  • Carter, C.L. and Hamilton, H.J. (1995). Fast, incremental generalization and regeneralization for knowledge discovery from databases. Proceedings of the 8th Florida Artificial Intelligence Symposium (pp. 319–323). Melbourne, Florida.

  • Carter, C.L. and Hamilton, H.J. (1995). A Fast, On-Line Generalization Algorithm for Knowledge Discovery, Applied Mathematics Letters, 8(2), 5–11.

    Google Scholar 

  • Carter, C.L. and Hamilton, H.J. (1995). Performance evaluation of attribute-oriented algorithms for knowledge discovery from databases. Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence (ICTAI'95) (pp. 486–489). Washington, D.C.

  • Carter, C.L. and Hamilton, H.J. (1998). Efficient Attribute-Oriented Algorithms for Knowledge Discovery from Large Databases, IEEE Transactions on Knowledge and Data Engineering, 10(2), 193–208.

    Google Scholar 

  • Carter, C.L., Hamilton, H.J., and Cercone, N. (1997). Share-based measures for itemsets. In J. Komorowski and J. Zytkow (Eds.), Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'97) (pp. 14–24). Trondheim, Norway.

  • Chu, W.W., Chiang, K., Hsu, C.C., and Yau, H. (1996). An Error-Based Conceptual Clustering Method for Providing Approximate Query Answers, Communications of the ACM, 39(12), VE. http://www.acm.org/-pubs/cacm/extension.

  • Feldman, R. and Dagan, I. (1995). Knowledge discovery in textual databases (KDT). Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95) (pp. 112–117). Montreal.

  • Godin, R., Missaoui, R., and Alaoui, H. (1995). Incremental Concept Formation Algorithms Based on Galois (Concept) Lattices, Computational Intelligence, 11(2), 246–267.

    Google Scholar 

  • Hamilton, H.J. and Fudger, D.F. (1995). Measuring the Potential for Knowledge Discovery in Databases with DBLearn, Computational Intelligence, 11(2), 280–296.

    Google Scholar 

  • Hamilton, H.J., Hilderman, R.J., and Cercon, N. (1996). Attribute-oriented induction using domain generalization graphs. Proceedings of the Eighth IEEE International Conference on Tools with Artificial Intelligence (ICTAI'96) (pp. 246–253). Toulouse, France.

  • Han, J. (1994). Towards Efficient Induction Mechanisms in Database Systems, Theoretical Computer Science, 133, 361–385.

    Google Scholar 

  • Han, J., Cai, Y., and Cercone, N. (1992). Knowledge discovery in databases: An attribute-oriented approach. Proceedings of the 18th International Conference on Very Large Data Bases (pp. 547–559). Vancouver.

  • Han, J., Cai, Y., and Cercone, N. (1993). Data-Driven Discovery of Quantitative Rules in Relational Databases, IEEE Transactions on Knowledge and Data Engineering, 5(1), 29–40.

    Google Scholar 

  • Han, J. and Fu, Y. (1995). Discovery of multiple-level association rules from large databases. Proceedings of the 1995 International Conference on Very Large Data Bases (VLDB'95) (pp. 420–431).

  • Han, J., Fu, Y., and Tang, S. (1995). Advances of the DBLearn system for knowledge discovery in large databases. Proceedings of the 1995 International Joint Conference on Artificial Intelligence (IJCAI'95) (pp. 2049–2050).

  • Hilderman, R.J., Carter, C.L., Hamilton, H.J., and Cercone, N. (1998). Mining Association Rules from Market Basket Data Using Share Measures and Characterized Itemsets, International Journal on Artificial Intelligence Tools, 7(2), 189–220.

    Google Scholar 

  • Hilderman, R.J., Carter, C.L., Hamilton, H.J., and Cercone, N. (1998). Mining market basket data using share measures and characterized itemsets. In X. Wu, R. Kotagiri, and K. Korb (Eds.), Proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'98) (pp. 159–173). Melbourne, Australia.

  • Hilderman, R.J. and Hamilton, H.J. (1999). Heuristics for ranking the interestingness of discovered knowledge. In N. Zhong (Ed.), Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99). Beijing, China.

  • Hilderman, R.J., Hamilton, H.J., and Brock Barber. (1999). Ranking the interestingness of summaries from data mining systems. Proceedings of the 12th Annual Florida Artificial Intelligence Research Symposium (FLAIRS'99). Orlando, FL.

  • Hilderman, R.J., Hamilton, H.J., Kowalchuk, R.J., and Cercone, N. (1997). Parallel knowledge discovery using domain generalization graphs. In J. Komorowski and J. Zytkow (Eds.), Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'97) (pp. 25–35). Trondheim, Norway.

  • Hilderman, R.J., Li, L., and Hamilton, H.J. (1997). Data visualization in the DB-Discover system. Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97) (pp. 474–477). Newport Beach, CA.

  • Hu, T. and Cercone, N. (1994). Object Aggregation and Cluster Identification, Applied Mathematics Letters, 7(4), 29–34.

    Google Scholar 

  • Hwang, H.-Y. and Fu, W.-C. (1995). Efficient algorithms for attribute-oriented induction. Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95) (pp. 168–173). Montreal.

  • Kullback, S. and Leibler, R.A. (1951). On Information and Sufficiency, Annals of Mathematical Statistics, 22, 79–86.

    Google Scholar 

  • MacArthur, R.H. (1965). Patterns of Species Diversity, Biological Reviews of the Cambridge Philosophical Society, 40, 510–533.

    Google Scholar 

  • Michalski, R.S. (1983). A theory and methodology of inductive learning. In R.S. Michalski, J.G. Carbonell, and T.M. Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach (pp. 83–134). Tioga Publishing Company.

  • Mitchell, T.M. (1978). Version Spaces: An Approach to Concept Learning. PhD thesis, Stanford University.

  • Pang, W., Hilderman, R.J., Hamilton, H.J., and Goodwin, S.D. (1996). Data mining with concept generalization graphs. Proceedings of the 9th Annual Florida Artificial Iintelligence Research Symposium (FLAIRS'96) (pp. 390–394). Key West, FL.

  • Park, J.S., Chen, M.-S., and Yu, P.S. (1995). An effective hash-based algorithm for mining association rules. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'95) (pp. 175– 186).

  • Piatetsky-Shapiro, G. (1991). Discovery, analysis and presentation of strong rules. Knowledge Discovery in Databases (pp. 229–248). AAAI/MIT Press.

  • Quinlan, J.R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.

  • Schutz, R.R. (1951). On the Measurement of Income Inequality, American Economic Review, 41, 107–122.

    Google Scholar 

  • Srikant, R. and Agrawal, R. (1995). Mining generalized association rules. Proceedings of the 21th International Conference on Very Large Databases (VLDB'95) (pp. 407–419). Zurich, Switzerland.

  • Srikant, R. and Agrawal, R. (1996). Mining sequential patterns: Generalization and performance improvements. Proceedings of the Fifth International Conference on Extending Database Technology (EDBT'96). Avignon, France.

  • Stumme, G., Wille, R., and Wille, U. (1998). Conceptual knowledge discovery in databases using formal concept analysis methods. In J. Zytkow and M. Quafafou (Eds.), Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'98) (pp. 450–458). Nantes, France.

  • Theil, H. (1967). Economics and Information Theory. Rand McNally.

  • Toivonen, H. (1996). Sampling large databases for finding association rules. Proceedings of the 22nd International Conference on Very Large Databases (VLDB'96) (pp. 134–145). Mumbay, India.

  • Whittaker, R.H. (1972). Evolution and Measurement of Species Diversity, Taxon, 21(2/3), 213–251.

    Google Scholar 

  • Wille, R. (1992). Concept lattices and conceptual knowledge systems, Computers and Mathematics with Applications, 23, 493–515.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hilderman, R.J., Hamilton, H.J. & Cercone, N. Data Mining in Large Databases Using Domain Generalization Graphs. Journal of Intelligent Information Systems 13, 195–234 (1999). https://doi.org/10.1023/A:1008769516670

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008769516670

Navigation