Supporting KDD Applications by the k-Nearest Neighbor Join

Böhm, Christian; Krebs, Florian

doi:10.1007/978-3-540-45227-0_50

Christian Böhm⁷ &
Florian Krebs⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2736))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

678 Accesses
15 Citations
3 Altmetric

Abstract

The similarity join has become an important database primitive to sup-port similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance join which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called k-nearest neighbor join which combines each point of one point set with its k nearest neighbors in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbor classifi-cation, data cleansing, postprocessing of sampling-based data mining etc. can be implemented on top of the k-nn join operation to achieve performance improve-ments without affecting the quality of the result of these algorithms. Our list of possible applications includes standard methods for all stages of the KDD process including preprocessing, data mining, and postprocessing. Thus, our method is turbo charging the complete KDD process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. In: ACM SIGMOD Int. Conf. on Management of Data (1999)
Google Scholar
Agrawal, R., Lin, K., Sawhney, H., Shim, K.: Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. In: Int. Conf on Very Large Data Bases, VLDB (1995)
Google Scholar
Brachmann, R., Anand, T.: The Process of Knowledge Discovery in Database. In: Fayyad, U.M., et al. (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1996)
Google Scholar
Böhm, C., Braunmüller, B., Breunig, M.M., Kriegel, H.-P.: Fast Clustering Based on High-Dimensional Similarity Joins. In: Int. Conf. on Information Knowledge Management, CIKM (2000)
Google Scholar
Berchtold, S., Böhm, C., Keim, D., Kriegel, H.-P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. In: ACM Symposium on Principles of Database Systems, PODS (1997)
Google Scholar
Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In: ACM SIGMOD Int. Conf. on Management of Data (2001)
Google Scholar
Böhm, C., Kriegel, H.-P.: A Cost Model and Index Architecture for the Similarity Join. In: IEEE Int. Conf on Data Engineering, ICDE (2001)
Google Scholar
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-trees. In: ACM SIGMOD Int. Conf. on Management of Data (1993)
Google Scholar
Breunig, M.M., Kriegel, H.-P., Kröger, P., Sander, J.: Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. In: ACM SIGMOD Int. Conf. on Management of Data (2001)
Google Scholar
Böhm, C.: The Similarity Join: A Powerful Database Primitive for High Performance Data Mining, tutorial. In: IEEE Int. Conf. on Data Engineering, ICDE (2001)
Google Scholar
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest Pair Queries in Spatial Databases. In: ACM SIGMOD Int. Conf. on Management of Data (2000)
Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Fayyad, U.M., et al. (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1996)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Hjaltason, G.R., Samet, H.: Ranking in Spatial Databases. In: Int. Symp. on Large Spatial Datab, SSD (1995)
Google Scholar
Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases. In: SIGMOD Int. Conf. on Management of Data (1998)
Google Scholar
Hattori, K., Torii, Y.: Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognition 26(5) (1993)
Google Scholar
Koudas, N., Sevcik, C.: Size Separation Spatial Join. In: ACM SIGMOD Int. Conf. on Managem. of Data (1997)
Google Scholar
Koudas, N., Sevcik, C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. In: IEEE Int. Conf. on Data Engineering (ICDE) (1998) (best paper award)
Google Scholar
Preparata, F.P., Shamos, M.I.: Computational Geometry. Springer, Heidelberg (1985)
Google Scholar
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. In: ACM SIGMOD Int. Conf. on Management of Data (1995)
Google Scholar
Sander, J., Ester, M., Kriegel, H.-P., Xu, X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications. Data Mining and Knowledge Discovery 2(2). Kluwer Academic Publishers(1998)
Google Scholar
Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing. In: ACM SIGMOD Int. Conf. on Management of Data (2000)
Google Scholar
Shim, K., Srikant, R., Agrawal, R.: High-Dimensional Similarity Joins. In: IEEE Int. Conf. on Data Engin. (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

University for Health Informatics and Technology, Innrain 98, 6020, Innsbruck, Austria
Christian Böhm & Florian Krebs

Authors

Christian Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Florian Krebs
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Vladimír Mařík
Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
Werner Retschitzegger
Faculty of Electrical Engineering, The Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Olga Štěpánková

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Böhm, C., Krebs, F. (2003). Supporting KDD Applications by the k-Nearest Neighbor Join. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_50

Download citation

DOI: https://doi.org/10.1007/978-3-540-45227-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40806-2
Online ISBN: 978-3-540-45227-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics