Improving Classification by Removing or Relabeling Mislabeled Instances

Lallich, Stéphane; Muhlenbach, Fabrice; Zighed, Djamel A.

doi:10.1007/3-540-48050-1_3

Stéphane Lallich⁵,
Fabrice Muhlenbach⁵ &
Djamel A. Zighed⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2366))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

763 Accesses
15 Citations

Abstract

It is common that a database contains noisy data. An important source of noise consists in mislabeled training instances. We present a new approach that deals with improving classification accuracies in such a case by using a preliminary filtering procedure. An example is suspect when in its neighborhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the whole database. Such suspect examples in the training data can be removed or relabeled. The filtered training set is then provided as input to learning algorithm. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removing give better results than relabeling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

V. Barnett and T. Lewis. Outliers in statistical data. Wiley, Norwich, 1984.
MATH Google Scholar
R. J. Beckman and R. D. Cooks. Oulier...s. Technometrics, 25:119–149, 1983.
Article MATH MathSciNet Google Scholar
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html], 1998.
Google Scholar
C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 799–805, Portland OR, 1996. AAI Press.
Google Scholar
C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131–167, 1999.
MATH Google Scholar
G. H. John. Robust decision trees: removing outliers from data. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 174–179, Montréal, Québec, 1995. AAI Press.
Google Scholar
E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3):237–253, February 2000.
Google Scholar
C. Largeron. Reconnaissance des formes par relaxation: un modèle d’aide à la décision. PhD thesis, Université Lyon 1, 1991.
Google Scholar
F. Muhlenbach, S. Lallich, and D. A. Zighed. Amélioration d’une classification par filtrage des exemples mal étiquetés. ECA, 1(4):155–166, 2001.
Google Scholar
J. R. Quinlan. Induction of decisions trees. Machine Learning, 1:81–106, 1986.
Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
Google Scholar
I. Tomek. An experiment with the edited nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6):448–452, 1976.
Article MATH MathSciNet Google Scholar
G. Toussaint. The relative neighborhood graph of a finite planar set. Pattern recognition, 12:261–268, 1980.
Article MATH MathSciNet Google Scholar
D. Wilson. Asymptotic properties of nearest neighbors rules using edited data. In IEEE Transactions on systems, Man and Cybernetics, 2:408–421, 1972.
Article MATH Google Scholar
D. R. Wilson and T. R. Martinez. Reduction techniques for exemplar-based learning algorithms. Machine Learning, 38:257–268, 2000.
Article MATH Google Scholar
D. A. Zighed, S. Lallich, and F. Muhlenbach. Séparabilité des classes dans R^p. In Actes des 8èmes Rencontres de la SFC, pages 356–363, 2001.
Google Scholar
D. A. Zighed and M. Sebban. Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini, editors, Apprentissage automatique. Hermès Science, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

ERIC Laboratory, University of Lyon 2, 5, av. Pierre Mendès, France, F-69676, BRON Cedex, France
Stéphane Lallich, Fabrice Muhlenbach & Djamel A. Zighed

Authors

Stéphane Lallich
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Muhlenbach
View author publications
You can also search for this author in PubMed Google Scholar
Djamel A. Zighed
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UFR d’Informatique, Université Claude Bernard Lyon I, 8, boulevard Niels Bohr, 69622, Villeurbanne Cedex, France
Mohand-Saïd Hacid
Dept. of Computer Science College of IT, University of North Carolina, Charlotte, NC, 28223, USA
Zbigniew W. Raś
Băt. L. Equipe de Recherche en Ingénierie des Connaissances, Université Lumière Lyon 2, 5, avenue Pierre Mendes-France, 69676, Bron Cedex, France
Djamel A. Zighed
LRI, Université Paris Sud, Băt. 490, 91405, Orsay Cedex, France
Yves Kodratoff

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lallich, S., Muhlenbach, F., Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In: Hacid, MS., Raś, Z.W., Zighed, D.A., Kodratoff, Y. (eds) Foundations of Intelligent Systems. ISMIS 2002. Lecture Notes in Computer Science(), vol 2366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48050-1_3

Download citation

DOI: https://doi.org/10.1007/3-540-48050-1_3
Published: 21 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43785-7
Online ISBN: 978-3-540-48050-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics