Skip to main content

An Effective Content-Based Strategy Analysis for Large-Scale Deduplication Using a Multi-level Pattern-Matching Algorithm

  • Conference paper
  • First Online:
Information and Communication Technology for Intelligent Systems

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 107))

  • 1421 Accesses

Abstract

As the volume of data increases every day, it has become very difficult to manage storage devices to handle this explosive development of digital data. Deduplication plays a crucial role to remove redundancy in large-scale cluster storage space. Existing deduplication research using overlapping algorithms is working inefficiently in a lot of situations—it absorbs high memory and uses a lot of processing time. Real-time data is repeatedly incomplete, conflicting, and/or missing in certain behaviors or trends, and often includes significant errors. In the deduplication process, data pre-processing is a method which involves transforming raw data into a comprehendible format which is easy to analyze in terms of duplication data. So, data deduplication clusters have been accepted in data storage systems for records and data backup. Most of the researchers in this field are focused on data deduplication clusters, to reduce replica data in order to improve server memory. Especially popular is the pattern-matching deduplication clustering process. In this chapter, the overlapping algorithm and how the proposed multi-level pattern-matching algorithm (MLPMA) works for deduplication with large amounts of data and higher efficiencies is discussed. This technique of combining similarity with locality is achieved by applying a Bloom filter to the deduplication cluster for efficient data removal, which moves toward exploiting data redundancy. As an end result, in the deduplication scenario this technique is significant in improving the efficiency of the data deduplication ratio and throughput. To conclude, the evaluations show that the deduplication method has excellent performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, Y., Feng, D., Jiang, H., Xia, W., Fu, M.: A fast asymmetric extremum content defined chunking algorithm for data deduplication in backup storage systems. IEEE Trans. Comput. (2016)

    Google Scholar 

  2. Goasdoué, F., Rousset, M.-C.: Robust module-based data management. IEEE Trans. Knowl. Data Eng. (2013)

    Google Scholar 

  3. El Rouayheb, S.: Synchronization and deduplication in coded distributed storage networks. IEEE/ACM Trans. Netw. (2015)

    Google Scholar 

  4. Dai, H., Zhang, S., Wang, L., Ding, Y.: Research and implementation of big data preprocessing system based on Hadoop. IEEE Trans. Knowl. Data Eng. (2015)

    Google Scholar 

  5. Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. (2015)

    Google Scholar 

  6. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. (2014)

    Google Scholar 

  7. Zhou, Z., Zhang, T., Chow, S.S.M.: Efficient authenticated multi-pattern matching. Int. J. Sci. Res. (2016)

    Google Scholar 

  8. Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data deduplication. Int. J. Sci. Res. 21 (2012)

    Google Scholar 

  9. Song, G., Han, L., Xie, K.: Overlapping decomposition for Gaussian graphical modeling. IEEE Trans. Knowl. Data Eng. 27(8) (2015)

    Article  Google Scholar 

  10. Dal Bianco, G., Galante, R., Gonçalves, M.A., Canuto, S., Heuser, C.A.: A practical and effective sampling selection strategy for large-scale deduplication. IEEE Trans. Knowl. Data Eng. (2015)

    Google Scholar 

  11. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R.J.: Model based overlapping clustering. In: ACM International Conference on Knowledge Discovery and Data Mining, Aug 2015

    Google Scholar 

  12. Liu, G., Zheng, K., Wang, Y., Orgun, M.A., Liu, A., Zhao, L.: Multi-constrained graph pattern matching in large-scale contextual social graphs. In: IEEE International Conference, Apr 2015

    Google Scholar 

  13. Li, Z., He, K., Wei, W., Lin, F.: Deduplication of files in cloud storage based on differential Bloom filter. IEEE Trans. Knowl. Data Eng. (2016) (in press)

    Google Scholar 

  14. Lomte, V.M., Deorukhakar, H.B.: Review of slicing approach: data publishing with data privacy and data utility. Int. J. Sci. Res. (IJSR) (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Sahaya Jenitha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sahaya Jenitha, A., Sinthu Janita Prakash, V. (2019). An Effective Content-Based Strategy Analysis for Large-Scale Deduplication Using a Multi-level Pattern-Matching Algorithm. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore. https://doi.org/10.1007/978-981-13-1747-7_23

Download citation

Publish with us

Policies and ethics