Scalable high performance de-duplication backup via hash join

Yang, Tian-ming; Feng, Dan; Niu, Zhong-ying; Wan, Ya-ping

doi:10.1631/jzus.C0910445

Scalable high performance de-duplication backup via hash join

Published: 07 May 2010

Volume 11, pages 315–327, (2010)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

124 Accesses
3 Altmetric
Explore all metrics

Abstract

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Prefetch-aware fingerprint cache management for data deduplication systems

Article 09 June 2018

Large-Scale Data Management System Using Data De-duplication System

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R., 2007. A Five-Year Study of File-System Metadata. Proc. 5th USENIX Conf. on File and Storage Technologies, p. 31–45. [doi:10.1145/1288783]
Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M., 2009. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. Proc. 17th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. [doi:10.1109/MASCOT.2009.5366623]
Blomer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M., Zuckerman, D., 1995. A XOR-Based Erasure Resilient Coding Scheme. International Computer Science Institute Technical Report, No. TR-95-048.
Bloom, B., 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM., 13(7):422–426. [doi:10.1145/362686.362692]
Article MATH Google Scholar
Broder, A., Mitzenmacher, M., 2004. Network applications of Bloom filters: a survey. Internet Math., 1(4):485–509.
MATH MathSciNet Google Scholar
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M., 2009. Hydrastor: a Scalable Secondary Storage. Proc. 7th USENIX Conf. on File and Storage Technologies, p.197–210.
Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R., 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. Proc. 5th USENIX Conf. on File and Storage Technologies, p.22–37.
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P., 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proc. 7th USENIX Conf. on File and Storage Technologies, p.111–123.
Muthitacharoen, A., Chen, B., Mazieres, D., 2001. A Low-Bandwidth Network File System. Proc. 18th ACM Symp. on Operating Systems Principles, p.174–187. [doi:10.1145/502034.502052]
Quinlan, S., Dorward, S., 2002. Venti: a New Approach to Archival Storage. Proc. USENIX Conf. on File and Storage Technologies, p.89–101.
Rhea, S., Cox, R., Pesterev, A., 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. Proc. USENIX Annual Technical Conf., p.143–156.
Secure Hash Standard, 1995. Department of Commerce/NIST, National Technical Information Service, Springfield, VA, USA.
Google Scholar
Shapiro, L.D., 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3): 239–264. [doi:10.1145/6314.6315]
Article Google Scholar
Tanenbaum, A.S., Herder, J.N., Bos, H., 2006. File size distribution on UNIX systems: then and now. ACM SIGOPS Oper. Syst. Rev., 40(1):100–104. [doi:10.1145/1113361.1113364]
Article Google Scholar
You, L., Pollack, K., Long, D.D.E., 2005. Deep Store: an Archival Storage System Architecture. Proc. 21st Int. Conf. on Data Engineering, p.804–815. [doi:10.1109/ICDE.2005.47]
Zeng, L.F., Zhou, K., Shi, Z., Feng, D., Wang, F., Xie, C.S., Li, Z.T., Yu, Z.W., Gong, J.Y., Cao, Q., et al., 2006. HUSt: a Heterogeneous Unified Storage System for GIS Grid. Proc. ACM/IEEE Conf. on Supercomputing, p.325–338. [doi:10.1145/1188455.1188798]
Zhu, B.J., Li, H., Patterson, H., 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Proc. 6th USENIX Conf. on File and Storage Technologies, p.269–282.

Download references

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Tian-ming Yang, Dan Feng, Zhong-ying Niu & Ya-ping Wan

Authors

Tian-ming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-ying Niu
View author publications
You can also search for this author in PubMed Google Scholar
Ya-ping Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Feng.

Additional information

Project supported by the National Basic Research Program (973) of China (No. 2004CB318201), the National High-Tech Research and Development Program (863) of China (No. 2008AA01A402), and the National Natural Science Foundation of China (Nos. 60703046 and 60873028)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Tm., Feng, D., Niu, Zy. et al. Scalable high performance de-duplication backup via hash join. J. Zhejiang Univ. - Sci. C 11, 315–327 (2010). https://doi.org/10.1631/jzus.C0910445

Download citation

Received: 22 July 2009
Accepted: 01 November 2009
Published: 07 May 2010
Issue Date: May 2010
DOI: https://doi.org/10.1631/jzus.C0910445

Key words

CLC number

TP309.3

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Scalable high performance de-duplication backup via hash join

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Prefetch-aware fingerprint cache management for data deduplication systems

Large-Scale Data Management System Using Data De-duplication System

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Subscribe and save

Buy Now

Navigation

Scalable high performance de-duplication backup via hash join

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Prefetch-aware fingerprint cache management for data deduplication systems

Large-Scale Data Management System Using Data De-duplication System

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now

Search

Navigation