Improving Restore Performance of Deduplication Systems by Leveraging the Chunk Sequence in Backup Stream

Yang, Ru; Deng, Yuhui; Hu, Cheng; Si, Lei

doi:10.1007/978-3-030-05051-1_26

Ru Yang¹⁶,
Yuhui Deng^16,17,
Cheng Hu¹⁸ &
…
Lei Si¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11334))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1515 Accesses

Abstract

Traditional deduplication based backup systems normally employ containers to reduce the chunk fragmentation, thus improving the restore performance. However, the shared chunks belonging to a single backup grows with the increase of the number of backups. Those shared chunks are normally distributed across multiple containers. This feature increases chunk fragmentation and significantly degrades the restore performance. In order to improve the restore performance, some schemes are proposed to optimize the replacement strategy of the restore cache, such as the ones using LRU and OPT. However, LRU is inefficient and OPT consumes additional computational overhead. By analyzing the backup and restore process, we observe that the sequence of the chunks in the backup stream is consistent to that in the restore stream. Based on this observation, this paper proposes an off-line optimal replacement strategy—OFL for the restore cache. The OFL records the chunk sequence of backup process, and then uses this sequence to calculate the exact information of the required chunks in advance for the restore process. Finally, accurate prefetch will be employed by leveraging the above information to reduce the impact of chunk fragmentation. Real data sets are employed to evaluate the proposed OFL. The experimental results demonstrate that OFL improves the restore performance over 8% in contrast to the traditional LRU and OPT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dubois, L., Amaldas, M., Sheppard, E.: Key considerations as deduplication evolves into primary storage. White Paper (2011)
Google Scholar
Deng, Y.: What is the future of disk drives, death or rebirth? ACM Comput. Surv. 43(3), 23:1–23:27 (2011)
Article Google Scholar
Zhou, K., Hu, S., Huang, P., Zhao, Y.: LX-SSD: enhancing the lifespan of NAND flash-based memory via recycling invalid pages. In: Proceedings of the 33rd International Conference on Massive Storage Systems and Technology, MSST 2017 (2017)
Google Scholar
Wei, J., Jiang, H., Zhou, K., Feng, D.: Efficiently representing membership for variable large data sets. IEEE Trans. Parallel Distrib. Syst. 25(4), 960–970 (2014)
Article Google Scholar
Benjamin, Z., Kai, L., Patterson, R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST 2008, vol. 8, pp. 269–282 (2008)
Google Scholar
Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of the 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)
Google Scholar
Mark, L., Kave, E., Deepavali, B., Vinay, D., Greg, T., Peter, C.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USENIX Conference on File and Storage Technologies, Fast 2009, vol. 9, pp. 111–123 (2009)
Google Scholar
Wen, X., Hong, J., Dan, F., Yu, H.: SiLo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC 2011, pp. 26–28 (2011)
Google Scholar
Zhou, Y., Deng, Y., Yang, L.T., Yang, R., Si, L.: LDFS: a low latency in-line data deduplication file system. IEEE Access 6, 15 743–15 753 (2018)
Article Google Scholar
Erik, K., Cristian, U., Cezary, D.: Bimodal content defined chunking for backup streams. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST 2010, pp. 239–252 (2010)
Google Scholar
Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the Conference on File Storage Technologies, FAST 2002, vol. 2, pp. 89–101 (2002)
Google Scholar
Athicha, M., Benjie, C., David, M.: A low-bandwidth network file system. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles, vol. 35, no. 5, pp. 174–187. ACM (2001)
Google Scholar
Nam, Y.J., Park, D., Du, D.H.: Assuring demanded read performance of data deduplication storage with backup datasets. In: Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012, pp. 201–208. IEEE (2012)
Google Scholar
Deng, Y., Huang, X., Song, L., Zhou, Y., Wang, F.: Memory deduplication: an effective approach to improve the memory system. J. Inf. Sci. Eng. 33(5), 1103–1120 (2017)
Google Scholar
Deng, Y., Hu, Y., Meng, X., Zhu, Y., Zhang, Z., Han, J.: Predictively booting nodes to minimize performance degradation of a power-aware web cluster. Cluster Comput. 17(4), 1309–1322 (2014)
Article Google Scholar
Qu, Z., Chen, Y.: Efficient data restoration for a disk-based network backup system. In: Proceedings of the IEEE International Conference, vol. 1, pp. 584–590 (2004)
Google Scholar
Schulman, R.R.: Disaster recovery issues and solutions. Hitachi Data Systems White Paper, p. 23 (2004)
Google Scholar
Xie, J., Deng, Y., Min, G., Zhou, Y.: An incrementally scalable and cost-efficient interconnection structure for datacenters. IEEE Trans. Parallel Distrib. Syst. 28(6), 1578–1592 (2017)
Article Google Scholar
Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication. In: Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR 2012, pp. 15:1–15:12 (2012)
Google Scholar
Lillibridge, M., Eshghi, K., Bhagwat, D.: Improving restore speed for backup systems that use inline chunk-based deduplication. In: Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST 2013, pp. 183–198 (2013)
Google Scholar
Fu, M., et al.: Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC 2014, pp. 181–192 (2014)
Google Scholar
Srinivasan, K., Bisson, T., Goodson, G.R., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, vol. 12, pp. 1–14 (2012)
Google Scholar
EMC: Achieving storage efficiency through EMC celerra data deduplication. White Paper (2010)
Google Scholar
Adlercohen, C., Czarnowicki, T., Dreiher, J., Ruzicka, T., Ingber, A., Harari, M.: NetApp deduplication for FAS and V-series deployment and implementation guide. Technical report, vol. 2009, no. 1, pp. 141 753–141 753 (2011)
Google Scholar
Min, F., et al.: Design tradeoffs for data deduplication performance in backup workloads. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, pp. 331–344 (2015)
Google Scholar
Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966)
Article Google Scholar
Meister, D., Brinkmann, A., Süß, T.: File recipe compression in data deduplication systems. In: Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST 2013, pp. 175–182 (2013)
Google Scholar
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R.: A five-year study of file-system metadata. Trans. Storage 3(3), 9 (2007)
Article Google Scholar
Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. Trans. Storage 7(4), 14:1–14:20 (2012)
Article Google Scholar
Rabin, M.: Fingerprinting by random polynomials (1981)
Google Scholar

Download references

Acknowledgments

This work is supported by the NSFC (No. 61572232), in part by the Science and Technology Planning Project of Guangzhou (No. 201802010028, and No. 201802010060), in part by the Science and Technology Planning Project of Nansha (No. 2017CX006), and in part by the Open Research Fund of Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences under Grant CARCH201705.

Author information

Authors and Affiliations

Department of Computer Science, Jinan University, Guangzhou, 510632, People’s Republic of China
Ru Yang, Yuhui Deng & Lei Si
State Key Laboratory of Computer Architecture, Institute of Computing, Chinese Academy of Sciences, Beijing, 100190, China
Yuhui Deng
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
Cheng Hu

Authors

Ru Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Deng
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Hu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Si
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhui Deng .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Jaideep Vaidya
Guangzhou University, Guangzhou, China
Jin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, R., Deng, Y., Hu, C., Si, L. (2018). Improving Restore Performance of Deduplication Systems by Leveraging the Chunk Sequence in Backup Stream. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11334. Springer, Cham. https://doi.org/10.1007/978-3-030-05051-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-05051-1_26
Published: 07 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05050-4
Online ISBN: 978-3-030-05051-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics