Abstract
Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
Similar content being viewed by others
References
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R., 2007. A Five-Year Study of File-System Metadata. Proc. 5th USENIX Conf. on File and Storage Technologies, p. 31–45. [doi:10.1145/1288783]
Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M., 2009. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. Proc. 17th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. [doi:10.1109/MASCOT.2009.5366623]
Blomer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M., Zuckerman, D., 1995. A XOR-Based Erasure Resilient Coding Scheme. International Computer Science Institute Technical Report, No. TR-95-048.
Bloom, B., 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM., 13(7):422–426. [doi:10.1145/362686.362692]
Broder, A., Mitzenmacher, M., 2004. Network applications of Bloom filters: a survey. Internet Math., 1(4):485–509.
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M., 2009. Hydrastor: a Scalable Secondary Storage. Proc. 7th USENIX Conf. on File and Storage Technologies, p.197–210.
Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R., 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. Proc. 5th USENIX Conf. on File and Storage Technologies, p.22–37.
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P., 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proc. 7th USENIX Conf. on File and Storage Technologies, p.111–123.
Muthitacharoen, A., Chen, B., Mazieres, D., 2001. A Low-Bandwidth Network File System. Proc. 18th ACM Symp. on Operating Systems Principles, p.174–187. [doi:10.1145/502034.502052]
Quinlan, S., Dorward, S., 2002. Venti: a New Approach to Archival Storage. Proc. USENIX Conf. on File and Storage Technologies, p.89–101.
Rhea, S., Cox, R., Pesterev, A., 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. Proc. USENIX Annual Technical Conf., p.143–156.
Secure Hash Standard, 1995. Department of Commerce/NIST, National Technical Information Service, Springfield, VA, USA.
Shapiro, L.D., 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3): 239–264. [doi:10.1145/6314.6315]
Tanenbaum, A.S., Herder, J.N., Bos, H., 2006. File size distribution on UNIX systems: then and now. ACM SIGOPS Oper. Syst. Rev., 40(1):100–104. [doi:10.1145/1113361.1113364]
You, L., Pollack, K., Long, D.D.E., 2005. Deep Store: an Archival Storage System Architecture. Proc. 21st Int. Conf. on Data Engineering, p.804–815. [doi:10.1109/ICDE.2005.47]
Zeng, L.F., Zhou, K., Shi, Z., Feng, D., Wang, F., Xie, C.S., Li, Z.T., Yu, Z.W., Gong, J.Y., Cao, Q., et al., 2006. HUSt: a Heterogeneous Unified Storage System for GIS Grid. Proc. ACM/IEEE Conf. on Supercomputing, p.325–338. [doi:10.1145/1188455.1188798]
Zhu, B.J., Li, H., Patterson, H., 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Proc. 6th USENIX Conf. on File and Storage Technologies, p.269–282.
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Basic Research Program (973) of China (No. 2004CB318201), the National High-Tech Research and Development Program (863) of China (No. 2008AA01A402), and the National Natural Science Foundation of China (Nos. 60703046 and 60873028)
Rights and permissions
About this article
Cite this article
Yang, Tm., Feng, D., Niu, Zy. et al. Scalable high performance de-duplication backup via hash join. J. Zhejiang Univ. - Sci. C 11, 315–327 (2010). https://doi.org/10.1631/jzus.C0910445
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/jzus.C0910445