Skip to main content

Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11304))

Included in the following conference series:

Abstract

Due to the restriction of computing resources, it is often inconvenient to directly conduct analysis on massive datasets. Instead, a set of representatives can be extracted to approximate the spatial distribution of data objects. Standard data mining algorithms are then performed on these selected points only, which typically account for a small fraction of the original data, reducing the computational time significantly. In practice, the boundary points of data clusters can be regarded as a compact and effective representation of the original data, with great potential in clustering, outlier or anomaly detection and classification. As a result, given a complex dataset, how to reliably identify a set of effective boundary points creates a new challenge in data mining. In this paper, we present a boundary extraction technique similar to the method in SCUBI (Scalable Clustering Using Boundary Information). The key difference is that our technique exploits the clustering information in a feedback loop to further refine the boundary. Experimental results show that our technique is more robust and can produce more representative boundary points than SCUBI, especially on complex datasets with large inhomogeneity in terms of cluster density.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jain, K., Murty, N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  2. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (2008)

    MATH  Google Scholar 

  3. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistic and Probability, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  4. Arthur, D., Manthey, B., Röglin, H.: K-means has polynomial smoothed complexity. In: Foundations of Computer Science, vol. 157, pp. 405–414 (2009)

    Google Scholar 

  5. Ester, M., Kriegel, H.P., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)

    Google Scholar 

  6. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)

    Article  MathSciNet  Google Scholar 

  7. Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2

  8. Tong, Q.H., Li, X., Yuan, B.: A highly scalable clustering scheme using boundary information. Pattern Recogn. Lett. 89, 1–7 (2017)

    Article  Google Scholar 

  9. Edelsbrunner, H., Kirkpatrick, D., Seidel, R.: On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29(4), 551–559 (1983)

    Article  MathSciNet  Google Scholar 

  10. Moreira, A.J.C., Santos, M.Y.: Concave hull: a k-nearest neighbors approach for the computation of the region occupied by a set of points. In: Proceedings of the Second International Conference on Computer Graphics Theory and Applications, vol. 3520, pp. 61–68. Springer, Barcelona (2006)

    Google Scholar 

  11. López Chau, A., Li, X., Yu, W., Cervantes, J., Mejía-Álvarez, P.: Border samples detection for data mining applications using non convex hulls. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011. LNCS (LNAI), vol. 7095, pp. 261–272. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25330-0_23

    Chapter  Google Scholar 

  12. Hoogs, A., Collins, R.: Object boundary detection in images using a semantic ontology. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 956–963 (2006)

    Google Scholar 

  13. Liu, D., Nosovskiy, G.V., Sourina, O.: Effective clustering and boundary detection algorithm based on delaunay triangulation. Pattern Recogn. Lett. 29, 1261–1273 (2008)

    Article  Google Scholar 

  14. Estivill-Castro, V., Lee, I.: AUTOCLUST: automatic clustering via boundary extraction for mining massive point-data sets. In: International Conference on Geocomputation, vol. 26, pp. 23–25 (2000)

    Google Scholar 

  15. Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: International Conference on Neural Information Processing, vol. 2, pp. 898–903. IEEE, Singapore (2002)

    Google Scholar 

  16. Chen, X.J., Zhang, G., Hua, X.H.: Point cloud simplification based on the information entropy of normal vector angle. Chin. J. Lasers 42(8), 328–336 (2015)

    Google Scholar 

  17. Xia, C., Hsu, W., Lee, M.L.: BORDER: efficient computation of boundary points. IEEE Trans. Knowl. Data Eng. 18(3), 289–303 (2006)

    Article  Google Scholar 

  18. Nosovskiy, G.V., Liu, D., Sourina, O.: Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recogn. 41, 2757–2776 (2008)

    Article  Google Scholar 

  19. Zhu, F., Ye, N., Yu, W., Xu, S., Li, G.: Boundary detection and sample reduction for one-class support vector machines. Neurocomputing 123, 166–173 (2014)

    Article  Google Scholar 

  20. Qiu, B.-Z., Yue, F., Shen, J.-Y.: BRIM: an efficient boundary points detecting algorithm. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 761–768. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71701-0_83

    Chapter  Google Scholar 

  21. Li, Y.: Selecting training points for one-class support vector machines. Pattern Recogn. Lett. 32(11), 1517–1522 (2011)

    Article  Google Scholar 

  22. He, Y.Z., Wang, C.H., Qiu, B.Z.: Clustering boundary points detection algorithm based on gradient binarization. Appl. Mech. Mater. 266, 2358–2363 (2013)

    Google Scholar 

  23. Silva, J.A., Faria, E.R., Barros, R.C.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)

    Article  Google Scholar 

  24. Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515. IEEE, Honolulu (2007)

    Google Scholar 

  25. Salehi, M., Leckie, C., Bezdek, J.C.: Fast memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng. 28(12), 3246–3260 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuan Bo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, H., Chen, Z., Tong, Q., Bo, Y. (2018). Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04212-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04211-0

  • Online ISBN: 978-3-030-04212-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics