Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters

Zhao, Haimei; Chen, Zhuo; Tong, Qiuhui; Bo, Yuan

doi:10.1007/978-3-030-04212-7_14

Haimei Zhao¹⁶,
Zhuo Chen¹⁶,
Qiuhui Tong¹⁶ &
…
Yuan Bo¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11304))

Included in the following conference series:

International Conference on Neural Information Processing

Abstract

Due to the restriction of computing resources, it is often inconvenient to directly conduct analysis on massive datasets. Instead, a set of representatives can be extracted to approximate the spatial distribution of data objects. Standard data mining algorithms are then performed on these selected points only, which typically account for a small fraction of the original data, reducing the computational time significantly. In practice, the boundary points of data clusters can be regarded as a compact and effective representation of the original data, with great potential in clustering, outlier or anomaly detection and classification. As a result, given a complex dataset, how to reliably identify a set of effective boundary points creates a new challenge in data mining. In this paper, we present a boundary extraction technique similar to the method in SCUBI (Scalable Clustering Using Boundary Information). The key difference is that our technique exploits the clustering information in a feedback loop to further refine the boundary. Experimental results show that our technique is more robust and can produce more representative boundary points than SCUBI, especially on complex datasets with large inhomogeneity in terms of cluster density.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jain, K., Murty, N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (2008)
MATH Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistic and Probability, vol. 1, pp. 281–297 (1967)
Google Scholar
Arthur, D., Manthey, B., Röglin, H.: K-means has polynomial smoothed complexity. In: Foundations of Computer Science, vol. 157, pp. 405–414 (2009)
Google Scholar
Ester, M., Kriegel, H.P., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)
Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Article MathSciNet Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Tong, Q.H., Li, X., Yuan, B.: A highly scalable clustering scheme using boundary information. Pattern Recogn. Lett. 89, 1–7 (2017)
Article Google Scholar
Edelsbrunner, H., Kirkpatrick, D., Seidel, R.: On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29(4), 551–559 (1983)
Article MathSciNet Google Scholar
Moreira, A.J.C., Santos, M.Y.: Concave hull: a k-nearest neighbors approach for the computation of the region occupied by a set of points. In: Proceedings of the Second International Conference on Computer Graphics Theory and Applications, vol. 3520, pp. 61–68. Springer, Barcelona (2006)
Google Scholar
López Chau, A., Li, X., Yu, W., Cervantes, J., Mejía-Álvarez, P.: Border samples detection for data mining applications using non convex hulls. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011. LNCS (LNAI), vol. 7095, pp. 261–272. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25330-0_23
Chapter Google Scholar
Hoogs, A., Collins, R.: Object boundary detection in images using a semantic ontology. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 956–963 (2006)
Google Scholar
Liu, D., Nosovskiy, G.V., Sourina, O.: Effective clustering and boundary detection algorithm based on delaunay triangulation. Pattern Recogn. Lett. 29, 1261–1273 (2008)
Article Google Scholar
Estivill-Castro, V., Lee, I.: AUTOCLUST: automatic clustering via boundary extraction for mining massive point-data sets. In: International Conference on Geocomputation, vol. 26, pp. 23–25 (2000)
Google Scholar
Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: International Conference on Neural Information Processing, vol. 2, pp. 898–903. IEEE, Singapore (2002)
Google Scholar
Chen, X.J., Zhang, G., Hua, X.H.: Point cloud simplification based on the information entropy of normal vector angle. Chin. J. Lasers 42(8), 328–336 (2015)
Google Scholar
Xia, C., Hsu, W., Lee, M.L.: BORDER: efficient computation of boundary points. IEEE Trans. Knowl. Data Eng. 18(3), 289–303 (2006)
Article Google Scholar
Nosovskiy, G.V., Liu, D., Sourina, O.: Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recogn. 41, 2757–2776 (2008)
Article Google Scholar
Zhu, F., Ye, N., Yu, W., Xu, S., Li, G.: Boundary detection and sample reduction for one-class support vector machines. Neurocomputing 123, 166–173 (2014)
Article Google Scholar
Qiu, B.-Z., Yue, F., Shen, J.-Y.: BRIM: an efficient boundary points detecting algorithm. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 761–768. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71701-0_83
Chapter Google Scholar
Li, Y.: Selecting training points for one-class support vector machines. Pattern Recogn. Lett. 32(11), 1517–1522 (2011)
Article Google Scholar
He, Y.Z., Wang, C.H., Qiu, B.Z.: Clustering boundary points detection algorithm based on gradient binarization. Appl. Mech. Mater. 266, 2358–2363 (2013)
Google Scholar
Silva, J.A., Faria, E.R., Barros, R.C.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)
Article Google Scholar
Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515. IEEE, Honolulu (2007)
Google Scholar
Salehi, M., Leckie, C., Bezdek, J.C.: Fast memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng. 28(12), 3246–3260 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Computing Lab, Division of Informatics, Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, People’s Republic of China
Haimei Zhao, Zhuo Chen, Qiuhui Tong & Yuan Bo

Authors

Haimei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qiuhui Tong
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Bo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Bo .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, H., Chen, Z., Tong, Q., Bo, Y. (2018). Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-04212-7_14
Published: 17 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04211-0
Online ISBN: 978-3-030-04212-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics