Skip to main content

Noisy Data Set Identification

  • Conference paper
Hybrid Artificial Intelligent Systems (HAIS 2013)

Abstract

Real data are often corrupted by noise, which can be provenient from errors in data collection, storage and processing. The presence of noise hampers the induction of Machine Learning models from data, which can have their predictive or descriptive performance impaired, while also making the training time longer. Moreover, these models can be overly complex in order to accomodate such errors. Thus, the identification and reduction of noise in a data set may benefit the learning process. In this paper, we thereby investigate the use of data complexity measures to identify the presence of noise in a data set. This identification can support the decision regarding the need of the application of noise redution techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)

    Article  Google Scholar 

  2. Wu, X.: Knowledge Acquisition from Databases. Ablex Pulishing Corp. (1995)

    Google Scholar 

  3. Maletic, J.I., Marcus, A.: Data cleansing: Beyond integrity analysis. In: Proc. Conf. Information Quality, pp. 200–209 (2000)

    Google Scholar 

  4. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)

    Google Scholar 

  5. Quinlan, J.R.: The effect of noise on concept learning. In: Michalski, R.S.I., Carboneel, J.G., Mitchell (eds.) Machine Learning. Morgan Kaufmann Publishers Inc. (1986)

    Google Scholar 

  6. Lorena, A.C., Carvalho, A.C.P.L.F.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genetics and Molecular Biology 27(4), 665–672 (2004)

    Article  Google Scholar 

  7. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)

    Article  Google Scholar 

  8. Gamberger, D., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data proprocessing: Experiments in medical domains. Applied Artificial Intelligence 14(2), 205–223 (2000)

    Article  Google Scholar 

  9. John, G.H.: Robust decision trees: Removing outliers from databases. In: KDD, pp. 174–179 (1995)

    Google Scholar 

  10. Zhao, Q., Nishida, T.: Using qualitative hypotheses to identify inaccurate data. J. Artif. Intell. Res. (JAIR) 3, 119–145 (1995)

    MATH  Google Scholar 

  11. Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, vol. 1, pp. 799–805 (1996)

    Google Scholar 

  12. Teng, C.M.: Correcting noisy data. In: ICML, pp. 239–248 (1999)

    Google Scholar 

  13. Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: ICML, pp. 920–927 (2003)

    Google Scholar 

  14. Zhu, X., Wu, X., Yang, Y.: Error detection and impact-sensitive instance ranking in noisy datasets. In: AAAI, pp. 378–384 (2004)

    Google Scholar 

  15. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)

    Article  Google Scholar 

  16. Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355–364 (2013)

    Article  Google Scholar 

  17. Sluban, B., Gamberger, D., Lavrac, N.: Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Mining and Knowledge Discovery (2013)

    Google Scholar 

  18. Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  19. Orriols-Puig, A., Maciá, N., Ho, T.K.: Documentation for the data complexity library in C++. Technical report, La Salle - Universitat Ramon Llull (2010)

    Google Scholar 

  20. Heckerman, D.: A tutorial on learning with bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research (1995)

    Google Scholar 

  21. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  22. Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)

    Google Scholar 

  23. Mitchell, T.M.: Machine Learning, 1st edn. McGraw Hill series in computer science. McGraw-Hill (1997)

    Google Scholar 

  24. Vapnik, V.N.: The nature of Statistical learning theory. Springer (1995)

    Google Scholar 

  25. Bache, K., Lichman, M.: UCI machine learning repository (2013)

    Google Scholar 

  26. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Garćia, L.P.F., de Carvalho, A.C.P.L.F., Lorena, A.C. (2013). Noisy Data Set Identification. In: Pan, JS., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2013. Lecture Notes in Computer Science(), vol 8073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40846-5_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40846-5_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40845-8

  • Online ISBN: 978-3-642-40846-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics