An Effective Content-Based Strategy Analysis for Large-Scale Deduplication Using a Multi-level Pattern-Matching Algorithm

Sahaya Jenitha, A.; Sinthu Janita Prakash, V.

doi:10.1007/978-981-13-1747-7_23

A. Sahaya Jenitha⁵ &
V. Sinthu Janita Prakash⁵

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 107))

1421 Accesses

Abstract

As the volume of data increases every day, it has become very difficult to manage storage devices to handle this explosive development of digital data. Deduplication plays a crucial role to remove redundancy in large-scale cluster storage space. Existing deduplication research using overlapping algorithms is working inefficiently in a lot of situations—it absorbs high memory and uses a lot of processing time. Real-time data is repeatedly incomplete, conflicting, and/or missing in certain behaviors or trends, and often includes significant errors. In the deduplication process, data pre-processing is a method which involves transforming raw data into a comprehendible format which is easy to analyze in terms of duplication data. So, data deduplication clusters have been accepted in data storage systems for records and data backup. Most of the researchers in this field are focused on data deduplication clusters, to reduce replica data in order to improve server memory. Especially popular is the pattern-matching deduplication clustering process. In this chapter, the overlapping algorithm and how the proposed multi-level pattern-matching algorithm (MLPMA) works for deduplication with large amounts of data and higher efficiencies is discussed. This technique of combining similarity with locality is achieved by applying a Bloom filter to the deduplication cluster for efficient data removal, which moves toward exploiting data redundancy. As an end result, in the deduplication scenario this technique is significant in improving the efficiency of the data deduplication ratio and throughput. To conclude, the evaluations show that the deduplication method has excellent performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, Y., Feng, D., Jiang, H., Xia, W., Fu, M.: A fast asymmetric extremum content defined chunking algorithm for data deduplication in backup storage systems. IEEE Trans. Comput. (2016)
Google Scholar
Goasdoué, F., Rousset, M.-C.: Robust module-based data management. IEEE Trans. Knowl. Data Eng. (2013)
Google Scholar
El Rouayheb, S.: Synchronization and deduplication in coded distributed storage networks. IEEE/ACM Trans. Netw. (2015)
Google Scholar
Dai, H., Zhang, S., Wang, L., Ding, Y.: Research and implementation of big data preprocessing system based on Hadoop. IEEE Trans. Knowl. Data Eng. (2015)
Google Scholar
Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. (2015)
Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. (2014)
Google Scholar
Zhou, Z., Zhang, T., Chow, S.S.M.: Efficient authenticated multi-pattern matching. Int. J. Sci. Res. (2016)
Google Scholar
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data deduplication. Int. J. Sci. Res. 21 (2012)
Google Scholar
Song, G., Han, L., Xie, K.: Overlapping decomposition for Gaussian graphical modeling. IEEE Trans. Knowl. Data Eng. 27(8) (2015)
Article Google Scholar
Dal Bianco, G., Galante, R., Gonçalves, M.A., Canuto, S., Heuser, C.A.: A practical and effective sampling selection strategy for large-scale deduplication. IEEE Trans. Knowl. Data Eng. (2015)
Google Scholar
Banerjee, A., Krumpelman, C., Basu, S., Mooney, R.J.: Model based overlapping clustering. In: ACM International Conference on Knowledge Discovery and Data Mining, Aug 2015
Google Scholar
Liu, G., Zheng, K., Wang, Y., Orgun, M.A., Liu, A., Zhao, L.: Multi-constrained graph pattern matching in large-scale contextual social graphs. In: IEEE International Conference, Apr 2015
Google Scholar
Li, Z., He, K., Wei, W., Lin, F.: Deduplication of files in cloud storage based on differential Bloom filter. IEEE Trans. Knowl. Data Eng. (2016) (in press)
Google Scholar
Lomte, V.M., Deorukhakar, H.B.: Review of slicing approach: data publishing with data privacy and data utility. Int. J. Sci. Res. (IJSR) (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Cauvery College for Women, Tiruchirappalli, 620018, India
A. Sahaya Jenitha & V. Sinthu Janita Prakash

Authors

A. Sahaya Jenitha
View author publications
You can also search for this author in PubMed Google Scholar
V. Sinthu Janita Prakash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Sahaya Jenitha .

Editor information

Editors and Affiliations

School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, India
Suresh Chandra Satapathy
Sabar Institute of Technology, Gujarat Technological University, Ahmedabad, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sahaya Jenitha, A., Sinthu Janita Prakash, V. (2019). An Effective Content-Based Strategy Analysis for Large-Scale Deduplication Using a Multi-level Pattern-Matching Algorithm. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore. https://doi.org/10.1007/978-981-13-1747-7_23

Download citation

DOI: https://doi.org/10.1007/978-981-13-1747-7_23
Published: 15 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1746-0
Online ISBN: 978-981-13-1747-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics