Skip to main content

An Improved Data Cleaning Algorithm Based on SNM

  • Conference paper
  • First Online:
Cloud Computing and Security (ICCCS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9483))

Included in the following conference series:

Abstract

The basic sorted-neighborhood method (SNM) is a classic algorithm to detect approximately duplicate records in data cleaning, but the drawback is that the size of sliding window is hard to select and the attribute matching is too frequent so the detection efficiency is unfavorable. An optimized algorithm is proposed based on SNM By setting the size and speed of the sliding window variable to avoid missing record comparisons and reduce unnecessary ones, also it uses cosine similarity algorithm in attribute matching to improve precision of detection, and the Top-k effective weight filtering algorithm is proposed to reduce the number of attribute matching and improve the detection efficiency. The experiment results show that the improved algorithm is better than SNM in recall rate, precision rate and execution time efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hylton, J.A.: Identifying and merging related bibliographic records. M S dissertation, MIT Laboratory for Computer Science Technical Report, MIT, p. 678 (1996)

    Google Scholar 

  2. Li, J.: Improvement on the algorithm of data cleaning based on SNM. Comput. Appl. Softw. (2008)

    Google Scholar 

  3. He, L., Zhang, Z., Tan, Y., Liao, M.: An efficient data cleaning algorithm based on attributes selection. In: 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT), pp. 375–379. IEEE (2011)

    Google Scholar 

  4. Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and framework for data and information quality research. ACM J. Data Inf. Qual. 1, 2 (2009)

    Google Scholar 

  5. Omar, B., Hector, G., David, M., Jennifer, W., Steven, E., Su, Q.: Swoosh: a generic approach to entity resolution. VLDB J. 18, 255–276 (2009)

    Article  Google Scholar 

  6. Sotomayor, B.: The globus toolkit 3 programmer’s tutorial, 2004, pp. 81–88, Zugriffsdatum (2005). http://gdp.globus.org/gt3-tutorial/multiplehtml/

  7. Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining (1996)

    Google Scholar 

  8. Krishnamoorthy, R., Kumar, S.S., Neelagund, B.: A new approach for data cleaning process. In: Recent Advances and Innovations in Engineering (ICRAIE), pp. 1–5. IEEE (2014)

    Google Scholar 

  9. Chen, H.Q., Ku, W.S., Wang, H.X., Sun, M.T.: Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 51–62 (2010)

    Google Scholar 

  10. Zhang, F., Xue, H.F., Xu, D.S., Zhang, Y.H., You, F.: Big data cleaning algorithms in cloud computing. Int. J. Interact. Mobile Technol. (2013)

    Google Scholar 

  11. Arora, R., Pahwa, P., Bansal, S.: Alliance rules for data warehouse cleansing. In: International Conference on Signal Processing Systems, pp. 743–747. IEEE (2009)

    Google Scholar 

  12. Ali, K., Warraich, M.: A framework to implement data cleaning in enterprise data warehouse for robust data quality. In: 2010 International Conference on Information and Emerging Technologies (ICIET), pp. 1–6. IEEE (2010). 978-1-4244-8003-6/10

    Google Scholar 

  13. Li, J., Zheng, N.: An improved algorithm based on SNM data cleaning algorithm. Comput. Appl. Softw. 25(2), 245–247 (2008). doi:10.3969/j.issn.1000-386X.2008.02.089

    Google Scholar 

  14. Luo, Q., Wang, X.F.: Analysis of data cleaning technology in data warehouse. Comput. Program. Skills Maintenance 2 (2015)

    Google Scholar 

  15. Dai, J.W., Wu, Z.L., Zhu, M.D.: Data Engineering Theory and Technology, pp. 148–155. National Defense Industry Press, Beijing (2010)

    Google Scholar 

  16. Zhang, J.Z., Fang, Z., Xiong, Y.J.: Data cleaning algorithm optimization based on SNM. J. Central South Univ. (Nat. Sci. Ed.) 41(6), 2240–2245 (2010)

    Google Scholar 

  17. Wang, L., Xu, L.D., Bi, Z.M., Xu, Y.C.: Data cleaning for RFID and WSN integration. Ind. Inf. IEEE Trans. 10(1), 408–418 (2014)

    Article  Google Scholar 

  18. Tong, Y.X., Cao, C.C., Zhang, C.J., Li, Y.T., Lei, C.: CrowdCleaner: data cleaning for multi-version data on the web via crowd sourcing. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 182–1185. IEEE Computer Society (2014)

    Google Scholar 

  19. Volkovs, M., Fei, C., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 244–255. IEEE (2014)

    Google Scholar 

  20. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)

    Google Scholar 

  21. Ebaid, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Yin, S., Tang, N.: NADEEF: a generalized data cleaning system. Proc. VLDB Endowment 6, 1218–1221 (2013)

    Article  Google Scholar 

  22. Broeck, J.V.D., Fadnes, L.T.: Data cleaning. Epidemiol. Principles Pract. Guidel. 66 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Xie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, M., Xie, Q., Ding, Q. (2015). An Improved Data Cleaning Algorithm Based on SNM. In: Huang, Z., Sun, X., Luo, J., Wang, J. (eds) Cloud Computing and Security. ICCCS 2015. Lecture Notes in Computer Science(), vol 9483. Springer, Cham. https://doi.org/10.1007/978-3-319-27051-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27051-7_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27050-0

  • Online ISBN: 978-3-319-27051-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics