Abstract
The exponential growth of digital data in cloud storage systems is a critical issue presently as a large amount of duplicate data in the storage systems exerts an extra load on it. Deduplication is an efficient technique that has gained attention in large-scale storage systems. Deduplication eliminates redundant data, improves storage utilization and reduces storage cost. This paper presents a broad methodical literature review of existing data deduplication techniques along with various existing taxonomies of deduplication techniques that have been based on cloud data storage. Furthermore, the paper investigates deduplication techniques based on text and multimedia data along with their corresponding taxonomies as these techniques have different challenges for duplicate data detection. This research work is useful to identify deduplication techniques based on text, image and video data. It also discusses existing challenges and significant research directions in deduplication for future researchers, and article concludes with a summary of valuable suggestions for future enhancements in deduplication.













Similar content being viewed by others
References
Gu M, Li X, Cao Y (2014) Optical storage arrays: a perspective for future big data storage. Light Sci Appl 3(5):e177. https://doi.org/10.1038/lsa.2014.58
Tian Y, Khan SM, Jiménez DA, Loh GH (2014) Last-level cache deduplication. In: Proceedings of the 28th ACM International Conference on Supercomputing, pp 53–62. https://doi.org/10.1145/2597652.2597655
Hovhannisyan H, Qi W, Lu K, Yang R, Wang J (2016) Whispers in the cloud storage: a novel cross-user deduplication-based covert channel design. Peer-to-Peer Networking and Applications, pp 1–10. https://doi.org/10.1007/s12083-016-0483-y
Mandagere N, Zhou P, Smith MA, Uttamchandani S (2008) Demystifying data deduplication. In: Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp 12–17. https://doi.org/10.1145/1462735.1462739
Paulo J, Pereira J (2014) A survey and classification of storage deduplication systems. ACM Comput Surv (CSUR) 47(1):1–30. https://doi.org/10.1145/2611778
Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-based storage systems in the cloud. In: ACM Transactions on Storage (TOS), vol 10(2). https://doi.org/10.1145/2512348
Di Pietro R, Sorniotti A (2016) Proof of ownership for deduplication systems: a secure, scalable, and efficient solution. Comput. Commun. 82:71–82. https://doi.org/10.1016/j.comcom.2016.01.011
Wang J, Chen X (2016) Efficient and secure storage for outsourced data: a survey. Data Sci Eng 1(3):178–188. https://doi.org/10.1007/s41019-016-0018-9
Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015
Venish A, Sankar KS (2015) Framework of data deduplication: a survey. Indian J Sci Technol. https://doi.org/10.17485/ijst/2015/v8i26/80754
Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y (2016) A comprehensive study of the past present and future of data deduplication. Proc IEEE 104(9):1681–1710. https://doi.org/10.1109/JPROC.2016.2571298
Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46
Xia W, Jiang H, Feng D, Tian L, Fu M, Zhou Y (2014) Ddelta: a deduplication-inspired fast delta compression approach. Perform Eval 79:258–272. https://doi.org/10.1016/j.peva.2014.07.016
Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76
Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34
Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–40. https://doi.org/10.1145/214762.214771
Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–83. https://doi.org/10.1016/j.jss.2006.07.009
Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technology 51(1):7–15. https://doi.org/10.1016/j.infsof.2008.09.009
IDC REPROT ON EXPONENTIAL DATA Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. In: IDC iView: IDC Analyze the Future,pp 1–6. http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf
Reed DA, Dongarra J (2015) Exascale computing and big data. Commun ACM 58(7):56–68. https://doi.org/10.1145/2699414
Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6
Meyer DT, Bolosky WJ (2012) A study of practical deduplication. ACM Trans Storage (TOS). https://doi.org/10.1145/2078861.2078864
Borges EN, de Carvalho MG, Galante R, Gonçalves MA, Laender AH (2011) An unsupervised heuristic-based approach for bibliographic metadata deduplication. Inf Process Manag 47(5):706–718. https://doi.org/10.1016/j.ipm.2011.01.009
Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505
Xu J, Zhang W, Zhang Z, Wang T, Huang T (2016) Clustering-based acceleration for virtual machine image deduplication in the cloud environment. J Syst Softw 121:144–156. https://doi.org/10.1016/j.jss.2016.02.021
Paulo J, Pereira J (2014) Distributed Exact Deduplication for Primary Storage Infrastructures. In Magoutis K., Pietzuch P. (eds) Distributed applications and interoperable systems DAIS 2014, vol 8460, LNCS Springer, Heidelberg. https://doi.org/10.1007/978-3-662-43352-2_5
Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368
He Q, Li Z, Zhang X (2010) Data deduplication techniques. IEEE Int Conf Future Inf Technol Manag Eng (FITME) 1:430–433. https://doi.org/10.1109/FITME.2010.5656539
Zhou R, Liu M, Li T (2013) Characterizing the efficiency of data deduplication for big data storage management. In: IEEE International Symposium on Workload Characterization (IISWC), pp 98–108: https://doi.org/10.1109/IISWC.2013.6704674
Ahmad RW, Gani A, Ab. Hamid SH et al (2015) Virtual machine migration in cloud data centers: a review, taxonomy, and open research issue. J Supercomput 71(7):2473–2515. https://doi.org/10.1007/s11227-015-1400-5
Hu Y, Li C, Liu L, Li T (2016) Hope: enabling efficient service orchestration in software-defined data centers. In: Proceedings of the 2016 International Conference on Supercomputing, p 10 ACM. https://doi.org/10.1145/2925426.2926257
Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24
Mao B, Jiang H, Wu S, Tian L (2016) Leveraging data deduplication to improve the performance of primary storage systems in the cloud. IEEE Trans Comput 65(6):1775–1788. https://doi.org/10.1109/TC.2015.2455979
Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123
Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14
Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210
Li YK, Xu M, Ng CH, Lee PP (2015) Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans Storage (TOS) 11(1):1–21. https://doi.org/10.1145/2641572
Xia W, Jiang H, Feng D, Hua Y (2015) Similarity and locality based indexing for high performance data deduplication. IEEE Trans Comput 64(4):1162–1176. https://doi.org/10.1109/TC.2014.2308181
Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99
Zhao X, Zhang Y, Wu Y, Chen K, Jiang J, Li K (2013) Liquid: a scalable deduplication file system for virtual machine images. IEEE Trans Parallel Distrib Syst 25(5):1257–1266. https://doi.org/10.1109/TPDS.2013.173
Waldspurger CA (2002) Memory resource management in VMware ESX server. In: ACM Proceedings of the 5th Symposium on Operating Systems Design and Implementation SIGOPS, vol 36(SI), pp 181–194. https://doi.org/10.1145/844128.844146
Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114
Anand A, Sekar V, Akella A (2009) SmartRE: an architecture for coordinated network-wide redundancy elimination. ACM SIGCOMM Comput Commun Rev 39(4):87–98. https://doi.org/10.1145/1594977.1592580
Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432
Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5
Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512
Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377
Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90
Vrable M, Savage S, Voelker GM (2009) Cumulus: filesystem backup to the cloud. ACM Trans Storage (TOS) 5(4):1–14. https://doi.org/10.1145/1629080.1629084
Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35
Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans Storage. https://doi.org/10.1145/2512348
Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277
Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18
Nie Z, Hua Y, Feng D, Li Q, Sun Y (2014) Efficient storage support for real-time near-duplicate video retrieval. In: Sun X et al (eds) Algorithms and architectures for parallel processing ICA3PP LNCS, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_24
Chen M, Wang S, Tian L (2013) A high-precision duplicate image deduplication approach. J Comput 8(11):2768–2775. https://doi.org/10.4304/jcp.8.11.2768-2775
Wang G, Chen S, Lin M, Liu X (2014) SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection. Expert Syst Appl 41(5):2415–2423. https://doi.org/10.1016/j.eswa.2013.09.040
Bobbarjung DR, Jagannathan S, Dubnicki C (2006) Improving duplicate elimination in storage systems. ACM Trans Storage (TOS) 2(4):424–48. https://doi.org/10.1145/1210596.1210599
Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252
Lim SH (2011) DeFFS: Duplication-eliminated flash file system. Comput Electr Eng 37(6):1122–1136. https://doi.org/10.1016/j.compeleceng.2011.06.007
Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C (2012) Reducing impact of data fragmentation caused by in-line deduplication. In: Proceedings of the 5th Annual International Systems and Storage Conference ACM, pp 1–12. https://doi.org/10.1145/2367589.2367600
Wildani A, Miller EL, Rodeh O (2013) Hands: A heuristically arranged non-backup in-line deduplication system. In: IEEE 29th International Conference on Data Engineering (ICDE), pp 446–457. https://doi.org/10.1109/ICDE.2013.6544846
Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32
Park D, Fan Z, Nam YJ, Du DH (2017) A lookahead read cache: improving read performance for deduplication backup storage. J Comput Sci Technol 32(1):26–40. https://doi.org/10.1007/s11390-017-1680-8
Xia W, Jiang H, Feng D, Tian L (2016) DARE: a deduplication-aware resemblance detection and elimination scheme for data reduction with low overheads. IEEE Trans Comput 65(6):1692–1705. https://doi.org/10.1109/TC.2015.2456015
Fu M, Feng D, Hua Y, He X, Chen Z, Liu J, Xia W, Huang F, Liu Q (2016) Reducing fragmentation for in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEE Trans Parallel Distrib Syst 27(3):855–868. https://doi.org/10.1109/TPDS.2015.2410781
Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373
Rabin MO (1981) Fingerprinting by random polynomials. Harvard Aiken Computational Laboratory TR-15-81. URL: http://cr.yp.to/bib/entries.html
Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623
Yang TM, Feng D, Niu ZY, Wan YP (2010) Scalable high performance de-duplication backup via hash join. J Zhejiang Uni Sci C Springer 11(5):315–327. https://doi.org/10.1631/jzus.C0910445
Min J, Yoon D, Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEE Trans Comput 60(6):824–840. https://doi.org/10.1109/TC.2010.263
Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference
Barreto J, Veiga L, Ferreira P (2012) Hash challenges: stretching the limits of compare-by-hash in distributed data deduplication. Inf Process Lett 112(10):380–385. https://doi.org/10.1016/j.ipl.2012.01.012
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555. https://doi.org/10.1109/TKDE.2011.127
Fu Y, Jiang H, Xiao N, Tian L, Liu F, Xu L (2014) Application-aware local-global source deduplication for cloud backup services of personal storage. IEEE Trans Parall Distrib Syst 25(5):1155–1165. https://doi.org/10.1109/TPDS.2013.167
Harnik D, Pinkas B, Shulman-Peleg A (2010) Side channels in cloud services: deduplication in cloud storage. IEEE Secur Priv 8(6):40–47. https://doi.org/10.1109/MSP.2010.187
Li J, Chen X, Li M, Li J, Lee PP, Lou W (2014) Secure deduplication with efficient and reliable convergent key management. IEEE Trans Parallel Distrib Syst 25(6):1615–1625. https://doi.org/10.1109/TPDS.2013.284
Liu C, Liu X, Wan L (2013) Policy-based de-duplication in secure cloud storage. In: Yuan Y, Wu X, Lu Y (eds) Trustworthy Computing and Services. ISCTCS communications in computer and information science, vol 320. Springer, Berlin, pp 250–262. https://doi.org/10.1007/978-3-642-35795-4_32
Storer MW, Greenan K, Long DD, Miller EL (2008) Secure data deduplication. In: Proceedings of the 4th ACM International Workshop on Storage Security and Survivability, pp 1–10. https://doi.org/10.1145/1456469.14
Li J, Chen X, Huang X, Tang S, Xiang Y, Hassan MM, Alelaiwi A (2015) Secure distributed deduplication systems with improved reliability. IEEE Trans Comput 64(12):3569–3579. https://doi.org/10.1109/TC.2015.2401017
Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117
Bibawe CB, Baviscar V (2017) Secure authorized deduplication for data reduction with low overheads in hybrid cloud. Int J Innov Res Comput Commun Eng 5(2):1797–1804. https://doi.org/10.15680/IJIRCCE.2017.0502105
Wu S, Li KC, Mao B, Liao M (2016) DAC: improving storage availability with deduplication-assisted cloud-of-clouds. Future Gener Comput Syst 74:190–198. https://doi.org/10.1016/j.future.2016.02.001
Wang J, Zhao Z, Xu Z, Zhang H, Li L, Guo Y (2015) I-sieve: an inline high performance deduplication system used in cloud storage. Tsinghua Sci Technol 20(1):17–27. https://doi.org/10.1109/TST.2015.7040510
Leesakul W, Townend P, Xu J (2014) Dynamic data deduplication in cloud storage. In: IEEE 8th International Symposium on Service Oriented System. Engineering, pp 320–325: https://doi.org/10.1109/SOSE.2014.46
Sun Z, Shen J, Yong J (2013) A novel approach to data deduplication over the engineering-oriented cloud systems. Integr Comput Aided Eng 20(1):45–57. https://doi.org/10.3233/ICA-120418
Neelaveni P, Vijayalakshmi M (2016) FC-LID: file classifier based linear indexing for deduplication in cloud backup services. In: Bjørner N, Prasad S, Parida L (eds) Distributed computing and internet technology. LNCS, vol 9581. Springer, Cham, pp 213–222. https://doi.org/10.1007/978-3-319-28034-9_28
Li J, Chen X, Xhafa F, Barolli L (2015) Secure deduplication storage systems supporting keyword search. J Comput Syst Sci 81(8):1532–1541. https://doi.org/10.1016/j.jcss.2014.12.026
Shin Y, Koo D, Hur J (2017) A survey of secure data deduplication schemes for cloud storage systems. ACM Comput Surv (CSUR) 49(4):1–38. https://doi.org/10.1145/3017428
Pokale MS, Dhok S, Kasbe V, Joshi G, Shinde N (2017) Data deduplication and load balancing techniques on cloud systems. Int J Adv Res Comput Commun Eng 6(3):878–883. https://doi.org/10.17148/IJARCCE.2017.63205
Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16
Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29
Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669
Ramaiah NP, Mohan CK (2011) De-duplication of photograph images using histogram refinement. In Recent Advances in Intelligent Computational Systems (RAICS) IEEE 391-395. https://doi.org/10.1109/RAICS.2011.6069341
Zargar AJ, Singh N, Rathee G, Singh AK (2015) Image data-deduplication using the block truncation coding technique. In: Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE) International Conference on IEEE, pp 154–158. https://doi.org/10.1109/ABLAZE.2015.7154986
Hua Y, He W, Liu X, Feng D (2015) SmartEye: real-time and efficient cloud image sharing for disaster environments. In: IEEE Conference on Computer Communications (INFOCOM), pp 1616–1624: https://doi.org/10.1109/INFOCOM.2015.7218541
Li X, Li J, Huang F (2016) A secure cloud storage system supporting privacy-preserving fuzzy deduplication. Soft Comput 20(4):1437–1448. https://doi.org/10.1007/s00500-015-1596-6
Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210
Rashid F, Miri A, Woungang I (2016) Secure image deduplication through image compression. J Inf Secur Appl 27:54–64. https://doi.org/10.1016/j.jisa.2015.11.003
Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media center with secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp 63–72. https://doi.org/10.1145/2714576.271462
Yang X, Zhu Q, Cheng KT (2009) Near-duplicate detection for images and videos. In: Proceedings of the First ACM workshop on Large-Scale Multimedia Retrieval and Mining, pp 73–80: https://doi.org/10.1145/1631058.1631073
Naturel X, Gros P (2005) A fast shot matching strategy for detecting duplicate sequences in a television stream. In: ACM Proceedings of the 2nd International Workshop on Computer Vision Meets Databases, pp 21–27. https://doi.org/10.1145/1160939.1160947
Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43
Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10)
Li L (2014) Image matching algorithm based on feature-point and DAISY descriptor. J Multim 9(6):829–834. https://doi.org/10.4304/jmm.9.6.829-834
Lei Y, Qiu G, Zheng L, Huang J (2014) Fast near-duplicate image detection using uniform randomized trees. ACM Trans Multim Comput Commun Appl (TOMM) 10(4):1–15. https://doi.org/10.1145/2602186
Dong W, Wang Z, Charikar M, Li K (2012) High-confidence near-duplicate image detection. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval ACM, NY, USA. https://doi.org/10.1145/2324796.2324798
Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1)
Thomee B, Huiskes MJ, Bakker EM, Lew MS (2013) An evaluation of content-based duplicate image detection methods for web search. In: IEEE International Conference on Multimedia and Expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2013.6607451
Foo JJ, Sinha R, Zobel J (2007) SICO: a system for detection of near-duplicate images during search. In: IEEE International Conference Multimedia and Expo, pp 595–598. https://doi.org/10.1109/ICME.2007.4284720
Chum O, Philbin J, Zisserman A (2008) Near Duplicate Image Detection: min-Hash and tf-idf Weighting. In: BMVC British Machine Vision Conference, vol 810, pp 812–815. https://doi.org/10.5244/C.22.50
Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565
Seo JS, Haitsma J, Kalker T, Yoo CD (2004) A robust image fingerprinting system using the Radon transform. Signal Process Image Commun 19(4):325–39. https://doi.org/10.1016/j.image.2003.12.001
Yu X, Huang T (2008) An image fingerprinting method robust to complicated image modifications. In: IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), pp 227–230. https://doi.org/10.1109/IIH-MSP.2008.93
Gavrielides MA, Sikudova E, Pitas I (2006) Color-based descriptors for image fingerprinting. IEEE Trans Multimed 8(4):740–748. https://doi.org/10.1109/TMM.2006.876290
Nikolaidis N, Pitas I (2009) Still image and video fingerprinting. In: IEEE Seventh International Conference on Advances in Pattern Recognition (ICAPR), pp 3–8. https://doi.org/10.1109/ICAPR.2009.83
Nian F, Li T, Wu X, Gao Q, Li F (2016) Efficient near-duplicate image detection with a local-based binary representation. Multimed Tools Appl 75(5):2435–2452. https://doi.org/10.1007/s11042-015-2472-1
Srinivasan SH, Sawant N (2008) Finding near-duplicate images on the web using fingerprints. In: Proceedings of the 16th ACM International Conference on Multimedia, pp 881–884. https://doi.org/10.1145/1459359.1459512
Yao J, Yang B, Zhu Q (2015) Near-duplicate image retrieval based on contextual descriptor. IEEE Signal Process Lett 22(9):1404–1408. https://doi.org/10.1109/LSP.2014.2377795
Leutenegger S, Chli M, Siegwart RY (2011) BRISK: Binary robust invariant scalable keypoints. In: IEEE International Conference on Computer Vision (ICCV), pp 2548–2555: https://doi.org/10.1109/ICCV.2011.6126542
Chen CC, Hsieh SL (2015) Using binarization and hashing for efficient SIFT matching. J Vis Commun Image Represent 30:86–93. https://doi.org/10.1016/j.jvcir.2015.02.014
Huang F, Zhou Z, Liu T, Liu X (2016) Original image tracing with image relational graph for near-duplicate image elimination. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud Computing and Security ICCCS. LNCS, vol 10040. Springer, Cham, pp 322–336. https://doi.org/10.1007/978-3-319-48674-1_29
Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436
Zhao J, Xue LJ, Men GZ (2010) Optimization matching algorithm based on improved Harris and SIFT. In: IEEE International Conference on Machine Learning and Cybernetics (ICMLC), vol 1, pp 258–261. https://doi.org/10.1109/ICMLC.2010.5581057
Lu CS, Hsu CY (2005) Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication. Multimed Syst 11(2):159–173. https://doi.org/10.1007/s00530-005-0199-y
Lei Y, Wang Y, Huang J (2011) Robust image hash in Radon transform domain for authentication. Signal Process Image Commun 26(6):280–288. https://doi.org/10.1016/j.image.2011.04.007
Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://doi.org/10.1109/SC.2014.67
Ma J, Stones RJ, Ma Y, Wang J, Ren J, Wang G, Liu X (2017) Lazy exact deduplication. ACM Trans Storage (TOS) 13(2):1–26. https://doi.org/10.1145/3078837
Acknowledgements
This research was supported by Department of Science and Technology, Government of India under WOS (Women Scientists Scheme) sponsored research Project entitled “Distributed Data Deduplication Technique for efficient Cloud Based Storage System” under File No: SR/WOS-A/ET-119/2016.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kaur, R., Chana, I. & Bhattacharya, J. Data deduplication techniques for efficient cloud storage management: a systematic review. J Supercomput 74, 2035–2085 (2018). https://doi.org/10.1007/s11227-017-2210-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2210-8