Skip to main content

Related Work and Concepts

  • Chapter
  • First Online:
Data Mining in Large Sets of Complex Data

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

  • 1972 Accesses

Abstract

This chapter presents the main background knowledge relevant to the book. Sections 2.1 and 2.2 describe the areas of processing complex data and knowledge discovery in traditional databases. The task of clustering complex data is discussed in Sect. 2.3, while the task of labeling such kind of data is described in Sect. 2.4. Section 2.5 introduces the MapReduce framework, a promising tool for large scale data analysis, which has been proven to offer one valuable support to the execution of data mining algorithms in a parallel processing environment. Section 2.6 concludes the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    www.google.com

References

  1. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. SDM, USA, In (2007)

    Google Scholar 

  2. Achtert, E., Böhm, C., David, J., Kröger, P., Zimek, A.: Global correlation clustering based on the hough transform. Stat. Anal. Data Min. 1, 111–127 (2008). doi:10.1002/sam.v1:3

    Article  MathSciNet  Google Scholar 

  3. Aggarwal, C., Yu, P.: Redefining clustering for high-dimensional applications. IEEE TKDE 14(2), 210–225 (2002). doi:10.1109/69.991713

    Google Scholar 

  4. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec. 29(2), 70–81 (2000). doi:10.1145/335191.335383

    Article  Google Scholar 

  5. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi:10.1145/276305.276314

  6. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Rec. 28(2), 61–72 (1999). doi:10.1145/304181.304188

    Article  Google Scholar 

  7. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). doi:10.1007/s10618-005-1396-1

    Article  MathSciNet  Google Scholar 

  8. Al-Razgan, M., Domeniconi, C.: Weighted clustering ensembles. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM. SIAM (2006).

    Google Scholar 

  9. Ando, S., Iba, H.: Classification of gene expression profile using combinatory method of evolutionary computation and machine learning. Genet. Program Evolvable Mach. 5, 145–156 (2004). doi:10.1023/B:GENP.0000023685.83861.69

    Article  Google Scholar 

  10. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT, pp. 217–235. UK (1999).

    Google Scholar 

  11. Blicher, A.P.: Edge detection and geometric methods in computer vision (differential topology, perception, artificial intelligence, low-level). Ph.D. thesis, University of California, Berkeley (1984). AAI8512758

    Google Scholar 

  12. Bohm, C., Kailing, K., Kriegel, H.P., Kroger, P.: Density connected clustering with local subspace preferences. In: ICDM ’04: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 27–34. IEEE Computer Society, Washington, DC, USA (2004).

    Google Scholar 

  13. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)

    MATH  Google Scholar 

  14. Chan, T.F., Shen, J.: Image processing and analysis-variational, PDE, wavelet, and stochastic methods. SIAM (2005).

    Google Scholar 

  15. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: USENIX’06. Berkeley, CA, USA (2006).

    Google Scholar 

  16. Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD, pp. 84–93. NY, USA (1999). doi:http://doi.acm.org/10.1145/312129.312199

  17. Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: The, VLDB Journal, pp. 426–435 (1997).

    Google Scholar 

  18. Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Apté, C., Ghosh, J., Smyth, P. (eds.) KDD, pp. 690–698. ACM (2011).

    Google Scholar 

  19. Dash, M., Liu, H., Yao, J.: Dimensionality reduction for unsupervised data. In: Proceedings of the 9th IEEE International Conference on Tools with, Artificial Intelligence (ICTAI’97), pp. 532–539 (1997).

    Google Scholar 

  20. Daugman, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985). doi:10.1364/JOSAA.2.001160

    Article  Google Scholar 

  21. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. OSDI (2004)

    Google Scholar 

  22. Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Berry, M.W., Dayal, U., Kamath, C., Skillicorn, D.B. (eds.) SDM (2004)

    Google Scholar 

  23. Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Min. Knowl. Discov. 14(1), 63–97 (2007). doi:10.1007/s10618-006-0060-8

    Article  MathSciNet  Google Scholar 

  24. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, New York (2001)

    MATH  Google Scholar 

  25. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)

    Google Scholar 

  26. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996).

    Google Scholar 

  27. Fayyad, U.: A data miner’s story-getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/

  28. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–34 (1996).

    Google Scholar 

  29. Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. Roy. Stat. Soc. B 66(4), 815–849 (2004). doi:ideas.repec.org/a/bla/jorssb/v66y2004i4p815-849.html

    Article  MathSciNet  MATH  Google Scholar 

  30. Hadoop information. http://hadoop.apache.org/

  31. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)

    MATH  Google Scholar 

  32. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. Syst. Man Cybern. IEEE Trans. 3(6), 610–621 (1973). doi:10.1109/TSMC.1973.4309314

    Article  Google Scholar 

  33. Huang, J., Kumar, S., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of 1997 IEEE Computer Society Conference on Computer Vision and, Pattern Recognition, pp. 762–768 (1997). doi:10.1109/CVPR.1997.609412

  34. Kailing, K., Kriegel, H.: Kroger. P, Density-connected subspace clustering for highdimensional data (2004).

    Google Scholar 

  35. Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: a peta-scale graph mining system-implementation and observations. ICDM (2009).

    Google Scholar 

  36. Kang, U., Tsourakakis, C., Appel, A.P., Faloutsos, C., Leskovec., J.: Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. SDM (2010).

    Google Scholar 

  37. Korn, F., Pagel, B.U., Faloutsos, C.: On the ‘dimensionality curse’ and the ‘self-similarity blessing. IEEE Trans. Knowl. Data Eng. (TKDE) 13(1), 96–111 (2001). doi:10.1109/69.908983

    Article  Google Scholar 

  38. Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: ICDM, pp. 250–257. Washington, USA (2005). doi:http://dx.doi.org/10.1109/ICDM.2005.5

  39. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009). doi:10.1145/1497577.1497578

    Article  Google Scholar 

  40. Lämmel, R.: Google’s mapreduce programming model-revisited. Sci. Comput. Program. 70, 1–30 (2008)

    Article  MATH  Google Scholar 

  41. Lazebnik, S., Raginsky, M.: An empirical bayes approach to contextual region classification. In: CVPR, pp. 2380–2387. IEEE (2009).

    Google Scholar 

  42. Lloyd, S.: Least squares quantization in pcm. Inf. Theory IEEE Trans. 28(2), 129–137 (1982). doi:10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  43. Long, F., Zhang, H., Feng, D.D.: Fundamentals of content-based image retrieval. In: Multimedia Information Retrieval and Management. Springer (2002).

    Google Scholar 

  44. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967).

    Google Scholar 

  45. Mehrotra, S., Rui, Y., Chakrabarti, K., Ortega, M., Huang, T.S.: Multimedia analysis and retrieval system. In: Proceedings of 3rd International Workshop on Multimedia. Information Systems, pp. 25–27 (1997).

    Google Scholar 

  46. Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533–541 (2008).

    Google Scholar 

  47. Moise, G., Sander, J., Ester, M.: P3C: a robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006).

    Google Scholar 

  48. Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst 14(3), 273–298 (2008). doi:10.1007/s10115-007-0090-6

    Article  MATH  Google Scholar 

  49. Moise, G., Zimek, A., Kröger, P., Kriegel, H.P., Sander, J.: Subspace and projected clustering: experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)

    Article  Google Scholar 

  50. Mount, D.M., Arya, S.: Ann: a library for approximate nearest neighbor searching. http://www.cs.umd.edu/mount/ANN/

  51. Ng, E.K.K., Fu, A.W.: Efficient algorithm for projected clustering. In: ICDE ’02: Proceedings of the 18th International Conference on Data Engineering, p. 273. IEEE Computer Society, Washington, DC, USA (2002).

    Google Scholar 

  52. Ng, E.K.K., chee Fu, A.W., Wong, R.C.W.: Projective clustering by histograms. TKDE 17(3), 369–383 (2005). doi:10.1109/TKDE.2005.47

    Google Scholar 

  53. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD ’08, pp. 1099–1110 (2008).

    Google Scholar 

  54. Pan, J.Y., Yang, H.J., Faloutsos, C., Duygulu, P.: Gcap: graph-based automatic image captioning. In: CVPRW ’04: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition, Workshop (CVPRW’04) vol. 9, p. 146 (2004).

    Google Scholar 

  55. Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce. ICDM (2008)

    Google Scholar 

  56. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl 6(1), 90–105 (2004). doi:10.1145/1007730.1007731

    Article  Google Scholar 

  57. Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: ACM Multimedia, pp. 65–73 (1996).

    Google Scholar 

  58. Pentland, A., Picard, R.W., Sclaroff, S.: Photobook: tools for content-based manipulation of image databases. In: Storage and Retrieval for Image and Video Databases (SPIE), pp. 34–47 (1994).

    Google Scholar 

  59. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427. USA (2002). doi:http://doi.acm.org/10.1145/564691.564739

  60. Rangayyan, R.M.: Biomedical Image Analysis. CRC Press, Boca Raton (2005)

    Google Scholar 

  61. Rezende, S.O.: Sistemas Inteligentes: Fundamentos e Aplicações. Ed , Manole Ltda (2002). (in Portuguese)

    Google Scholar 

  62. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Leonardis, A., Bischof, H., Pinz A. (eds.) ECCV (1), Lecture Notes in Computer Science, vol. 3951, pp. 1–15. Springer (2006).

    Google Scholar 

  63. Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edn. Brooks/Cole Pub Co, Pacific Grove (1998)

    Google Scholar 

  64. Sousa, E.P.M.: Identificação de correlações usando a teoria dos fractais. Ph.D. Dissertation, Computer Science Department–ICMC, University of São Paulo-USP, São Carlos, Brazil (2006). (in Portuguese).

    Google Scholar 

  65. Sousa, E.P.: Caetano Traina, J., Traina, A.J., Wu, L., Faloutsos, C.: A fast and effective method to find correlations among attributes in databases. Data Min. Knowl. Discov. 14(3), 367–407 (2007). doi:10.1007/s10618-006-0056-4

    Article  MathSciNet  Google Scholar 

  66. Stehling, R.O., Nascimento, M.A., Falcão, A.X.: Cell histograms versus color histograms for image representation and retrieval. Knowl. Inf. Syst. 5, 315–336 (2003). doi:10.1007/s10115-003-0084-y. http://portal.acm.org/citation.cfm?id=959128.959131

  67. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1, 801–804 (1956). (in French).

    Google Scholar 

  68. Tong, H., Faloutsos, C., Pan, J.Y.: Random walk with restart: fast solutions and applications. Knowl. Inf. Syst. 14, 327–346 (2008). doi:10.1007/s10115-007-0094-2. http://portal.acm.org/citation.cfm?id=1357641.1357646

  69. Torralba, A.B., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)

    Article  Google Scholar 

  70. Traina, A.J.M., Traina, C., Bueno, J.M., Chino, F.J.T., Azevedo-Marques, P.: Efficient content-based image retrieval through metric histograms. World Wide Web 6, 157–185 (2003). doi:10.1023/A:1023670521530

    Google Scholar 

  71. Traina Jr, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: Zaniolo, C., Lockemann, P.C., Scholl, M.H., Grust, T. (eds.) International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 1777, pp. 51–65. Springer, Konstanz, Germany (2000).

    Google Scholar 

  72. Traina Jr., C., Traina, A.J.M., Santos Filho, R.F., Faloutsos, C.: How to improve the pruning ability of dynamic metric access methods. In: International Conference on Information and Knowledge Management (CIKM), pp. 219–226. ACM Press, McLean, VA, USA (2002)

    Google Scholar 

  73. Tung, A.K.H., Xu, X., Ooi, B.C.: Curler: finding and visualizing nonlinear correlation clusters. In: SIGMOD, pp. 467–478 (2005). doi:http://doi.acm.org/10.1145/1066157.1066211

  74. Vieira, M.R., Traina Jr, C., Traina, A.J.M., Chino, F.J.T.: Dbm-tree: a dynamic metric access method sensitive to local density data. In: Lifschitz, S. (ed.) Brazilian Symposium on Databases (SBBD), vol. 1, pp. 33–47. SBC, Brasìlia, DF (2004)

    Google Scholar 

  75. Wang, W., Yang, J., Muntz, R.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, pp. 186–195 (1997).

    Google Scholar 

  76. Wiki: http://wiki.apache.org/hadoop/hbase. Hadoop’s Bigtable-like structure

  77. Woo, K.G., Lee, J.H., Kim, M.H., Lee, Y.J.: Findit: a fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46(4), 255–271 (2004)

    Article  Google Scholar 

  78. Yip, K.Y., Ng, M.K.: Harp: a practical projected clustering algorithm. IEEE Trans. on Knowl. Data Eng. 16(11), 1387–1397 (2004). doi:http://dx.doi.org/10.1109/TKDE.2004.74. Member-David W. Cheung

    Google Scholar 

  79. Yip, K.Y., Cheung, D.W., Ng, M.K.: On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: ICDE, pp. 329–340. Washington, USA (2005). doi:http://dx.doi.org/10.1109/ICDE.2005.96

  80. Zhang, B., Hsu, M., Dayal, U.: K-harmonic means-a spatial clustering algorithm with boosting. In: Roddick, J.F., Hornsby, K. (eds.) TSDM. Lecture Notes in Computer Science, vol. 2007, pp. 31–45. Springer (2000).

    Google Scholar 

  81. Zhang, H.: The optimality of naive Bayes. In: V. Barr, Z. Markov (eds.) FLAIRS Conference. AAAI Press (2004). http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf

  82. Zhou, C., Xiao, W., Tirpak, T.M., Nelson, P.C.: Evolving accurate and compact classification rules with gene expression programming. IEEE Trans. Evol. Comput. 7(6), 519–531 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robson L. F. Cordeiro .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Cite this chapter

Cordeiro, R. ., Faloutsos, C., Traina Júnior, C. (2013). Related Work and Concepts. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-4890-6_2

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-4889-0

  • Online ISBN: 978-1-4471-4890-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics