Abstract
This book was motivated by the increasing amount and complexity of the dada collected by digital systems in several areas, which turns the task of knowledge discovery out to an essential step in businesses’ strategic decisions. The mining techniques used in the process usually have high computational costs and force the analyst to make complex choices. The complexity stems from the diversity of tasks that may be used in the analysis and from the large amount of alternatives to execute each task. The most common data mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. The large computational cost comes from the need to explore several alternative solutions, in different combinations, to obtain the desired information. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making the traditional techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. We discussed new data mining techniques for large sets of complex data, especially for the clustering task tightly associated to other mining tasks that are performed together. Specifically, this book described in detail three novel data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering [11, 13]; the method BoW for clustering Terabyte-scale datasets [14]; and the method QMAS for labeling and summarization [12].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
www.yahoo.com
- 2.
twitter.com
- 3.
Table 7.1 includes a summary of one table found in [17], i.e., Table 7.1 includes a selection of most relevant desirable properties and most closely related works from the original table. Table 7.1 also includes two novel desirable properties not found in [17]—Linear or quasi-linear complexity and Terabyte-scale data analysis.
References
Achtert, E., Böhm, C., David, J., Kröger, P., Zimek, A.: Global correlation clustering based on the hough transform. Stat. Anal. Data Min 1, 111–127 (2008). doi:10.1002/sam.v1:3
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. SDM, USA (2007)
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec. 29(2), 70–81 (2000). doi:10.1145/335191.335383
Aggarwal, C., Yu, P.: Redefining clustering for high-dimensional applications. IEEE TKDE 14(2), 210–225 (2002). doi:10.1109/69.991713
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Rec. 28(2), 61–72 (1999). doi:10.1145/304181.304188
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi:10.1145/276305.276314
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). doi:10.1007/s10618-005-1396-1
Bohm, C., Kailing, K., Kriegel, H.P., Kroger, P.: Density connected clustering with local subspace preferences. In: ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 27–34. IEEE Computer Society, USA (2004).
Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: SIGMOD, pp. 455–466. USA (2004). http://doi.acm.org/10.1145/1007568.1007620
Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD, pp. 84–93. NY, USA (1999). http://doi.acm.org/10.1145/312129.312199
Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr, C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R., Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) pp. 625–636. IEEE In ICDE. (2010).
Cordeiro, R.L.F., Guo, F., Haverkamp, D.S., Horne, J.H., Hughes, E.K., Kim, G., Traina, A.J.M., Traina Jr., C., Faloutsos, C.: Qmas: Querying, mining and summarization of multi-modal databases. In: Webb, G.I., Liu, B., Zhang, C., Gunopulos, D., Wu, X. (eds.) ICDM, pp. 785–790. IEEE Computer Society (2010).
Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Halite: Fast and scalable multi-resolution local-correlation clustering. IEEE Trans. Knowl. Data Eng. 99(PrePrints) (2011). doi:10.1109/TKDE.2011.176.
Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: C. Apté, J. Ghosh, P. Smyth (eds.) KDD, pp. 690–698. ACM (2011).
Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. Roy. Stat. Soc. Ser. B 66(4), 815–849 (2004). doi:a/bla/jorssb/v66y2004i4p815-849.html
Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: ICDM, pp. 250–257. Washington, USA (2005). http://dx.doi.org/10.1109/ICDM.2005.5
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009). doi:10.1145/1497577.1497578
Kröger, P., Kriegel, H.P., Kailing, K.: Density-connected subspace clustering for high-dimensional data. SDM, USA (2004)
Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006).
Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273–298 (2008). doi:10.1007/s10115-007-0090-6
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427. USA (2002). http://doi.acm.org/10.1145/564691.564739
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cordeiro, R.L., Faloutsos, C., Traina Júnior, C. (2013). Conclusion. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_7
Download citation
DOI: https://doi.org/10.1007/978-1-4471-4890-6_7
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4889-0
Online ISBN: 978-1-4471-4890-6
eBook Packages: Computer ScienceComputer Science (R0)