Skip to main content

k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values

  • Conference paper
Exploratory Data Analysis in Empirical Research

Abstract

This paper addresses practical issues in k-means cluster analysis or segmentation with mixed types of variables and missing values. A more general k-means clustering procedure is developed that is suitable for use with very large datasets, such as arise in data mining and survey analysis. An exact assignment test guarantees that the algorithm will converge, and the detection of outliers allows the densest regions of the sample space to be mapped by tessellations of tightly-specified spherical clusters. A summary tree is obtained for the resulting k-cluster partition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • BALL, G. H. (1965): Data analysis in the social sciences: What about the details? Proc. Fall Joint Computer Conf., Spartan Books, Washington D.C., Vol. 27 (1), 533–539.

    Google Scholar 

  • BALL, G. H. and HALL, D. J. (1967): A clustering technique for summarizing multivariate data. Behavioral Science, Vol. 12, 153–155.

    Article  Google Scholar 

  • BEALE, E. M. L. (1969): Euclidean cluster analysis. Bull. I. S. I., Vol. 43 (2), 92–94.

    Google Scholar 

  • DIDAY, E., and SIMON, J. C. (1976): Cluster analysis, in Fu, K. S. (Ed): Digital pattern recognition. Springer, Berlin, 47–94.

    Chapter  Google Scholar 

  • FORGEY, E. W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, Vol. 21, 768–769.

    Google Scholar 

  • GOWER, J. C. (1971): A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857–874.

    Article  Google Scholar 

  • JANCEY, R. C. (1966): Multidimensional group analysis. Austral. J. Botany, Vol. 14 (1), 127–130.

    Article  Google Scholar 

  • KASS, G. V. (1980): An exploratory technique for investigating large quantities of categorical data. Applied Statistics, Vol. 29, 119–127.

    Article  Google Scholar 

  • KAUFMAN, L. and ROUSSEEUW, P. J. (1960): Finding groups in data. Wiley, New York.

    Google Scholar 

  • MacQUEEN, J. (1967): Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp., Vol. I, 281–297.

    MathSciNet  Google Scholar 

  • THORNDIKE, R. L. (1953): Who belongs in the family. Psychometrika, Vol. 18, 267–276.

    Article  Google Scholar 

  • WISHART, D. (1970): Some problems in the theory and application of the methods of numerical taxonomy. Ph.D. dissertation, University of St. Andrews.

    Google Scholar 

  • WISHART, D. (1978): Treatment of missing values in cluster analysis. Proc. Compstat 1978, Physica-Verlag, Wien, 281–287.

    Google Scholar 

  • WISHART, D. (1984): Clustan Benutzerhandbuch. Gustav Fischer Verlag, Stuttgart, 46–54.

    MATH  Google Scholar 

  • WISHART, D. (1986): Hierarchical cluster analysis with messy data, in: Gaul, Schader, (Eds.): Classification as a Tool of Research. North-Holland, Amsterdam, 453–460.

    Google Scholar 

  • WISHART, D. (1999): ClustanGraphics Primer. Clustan, Edinburgh, 37–38.

    Google Scholar 

  • WISHART, D. (2002): Clustan Professional User Guide. Clustan, Edinburgh (in preparation).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wishart, D. (2003). k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values. In: Schwaiger, M., Opitz, O. (eds) Exploratory Data Analysis in Empirical Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55721-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-55721-7_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44183-0

  • Online ISBN: 978-3-642-55721-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics