AStudy on Secure Data Deduplication Ritwik Jadhav, Rohan Gandhi, Dhanraj Gade, Chaitanya AtreDepartment of Information Technology, MITCOEPune, Maharashtra, India.Abstract— Nowadays, there is a drastic increase in demandfor data storage. Due to increasing demand for data storage, the concept ofcloud computing is on the rise. This enormous amount of data can be backed-upon the cloud storage but, the problem is that it significantly increases thecost of the storage and its performance. Traditional storage process of dataintroduces redundancies and hence concept of Data deduplication is developed.Data deduplication is an effective solution for eliminating the redundancies.It uses the concepts of hash values and index tables which help in removal ofduplicate data.
To prevent unauthorized access of data, encryption is used.With the use of Data deduplication process and eliminating the possiblesecurity issues, an effective performance increase and reduction in cost ofstorage can be observed. In this paper, the concept of Data deduplication isintroduced, then how it can be classified into various subtypes, its algorithmsare discussed.Keywords—Data Deduplication, Storage, Hashing, Chunking, Redundant data. I. Introduction Nowadays, due to advancements in the Technology, the amount ofdigital data generated by the applications is increasing at a faster rate.
Asthe storage systems have a limited capacity, storing such huge amount of datahas posed challenge. Recent International Data Corporation (IDC) studiesindicate that in past five years the volume of data has increased by almostnine times to 7 ZB per year and a more explosive growth is expected in next tenyears 2. Such a massive growth in storage is controlled by the technique ofData Deduplication 1.
This process identifies duplicate contents at the chunklevel using hash values and deduplicates them 1. Data Deduplication can beimplemented at File Level and at Chunk Level. Chunk level Deduplication isalways preferred over the file deduplication as in the process of chunk leveldeduplication, the fingerprints of various chunks are compared and theredundant ones are deduplicated. As against in the File Level Deduplication,the whole file is compared with the other file by checking its metadata.
Chunking plays a significant role in determining the efficiency of the datadeduplication algorithm. The performance of the data deduplication algorithmcan be computed by analyzing the size and number of chunks 1. Data deduplication saves a lot of storage space and money spent onthe storage of the data by optimizing the storage space and bandwidth costs1.
Hence, a greener environment can be obtained as fewer spaces are requiredto house the data in primary and remote storage. As we are maintaining lessstorage, fast return on investment can be obtained. This process also helps inincreasing the network bandwidth and improving the network efficiency 1.
II. Different approaches of Data DeduplicationData deduplication – often called intelligentcompression or single-instance storage – is a process that eliminates redundantcopies of data and reduces storage overhead. Data deduplication techniques ensure thatonly one unique instance of data is retained on storage media. In the processof Data deduplication, we divide the file or any block of data into multiplechunks and a hash value of these chunks is calculated by using hash techniqueslike SHA-1 or MD5 3.
Using these hash values, we can compare a chunk of datawith the incoming data chunk and if a match is found, then, we can concludethat a similar data chunks exists in the storage system and hence, we need toreplace the duplicate data chunk with the reference of existing data chunk. Fig 1. Diagram of Illustrating Data DeduplicationProcess A. Steps of Data Deduplication 1 Creation of Chunks: The File is divided into chunks using any of the chunking methods – Fixed length chunking or Variable length chunking. Hash Value Computation: Depending upon the chunks formed, the hash values of these chunks are computed by using any of the various available hashing algorithms – SHA-1, MD5, SHA-2.Deduplication Process: Now, the hash values of the data chunks which can be called as fingerprints are stored in an index table.
Duplicate entries of data can be checked by making use of the index tables. If the hash values are found to be matching, duplicate data can be replaced with the reference of the original data chunk. B. Hash AlgorithmsHash Algorithms are mainly used in the process of Datadeduplication. Hash values of data chunks are computed which are generallycalled as “Fingerprints” are used for eliminating the redundancies in the data.Commonly used hash algorithm for the deduplication process is the SHA-1. C.
SHA-1SHA-1 is a cryptographic security algorithmused in deduplication process for computing the fingerprints of the datachunks. SHA-1 is closely modelled after MD5. SHA-1 produces a message digest of160 bits by dividing the input data in blocks of 512 bits each 4. This 160bits value computed for each data chunk is unique and used for eliminatingredundancies in the data. D. Source Based DeduplicationSource deduplication is the removal of redundancies from data beforetransmission to the target server. Source deduplication offers a number ofbenefits, including reduced bandwidth and storage usage.
But, Sourcededuplication can be slower than target based deduplication, considering largeamount of data. Source deduplication works through client software thatcommunicates with the server to compare new data chunks with previously storeddata chunks. If the server has previously stored a data chunk, the softwaredoes not send that chunk and instead notes that there is a copy of that datachunk at that client. E. Target Based DeduplicationTarget deduplication is the removal of redundancies from a backuptransmission as it passes through an appliance sitting between the source andthe target server. Target deduplication reduces the amount of storage requiredbut, unlike source deduplication, it does not reduce the amount of data thatmust be sent across a LAN or WAN during the storage. F.
Inline Data DeduplicationInline Deduplication refers to eliminating thedata redundancies as the data enter the system. This process cuts down on thebulk of data and makes the system efficient. The benefit of the inline deduplicationprocess is that the calculation of hash values and search process for redundantdata is done before the data is actually stored in the system. G. Post Process DeduplicationPost-processing deduplication, also known asasynchronous de-duplication, is the analysis and removal of redundant dataafter the data is written to storage system. Post-process deduplication has anumber of benefits which includes efficient lookup process as the calculationhash values and search process for redundant data is done after the data filesget stored on the storage system.
H. File Based DeduplicationThe duplicate removal algorithm can be applied on full file. Fullfile level duplicates can easily be eliminated by calculating single checksumof the complete file data and comparing it against existing checksums of thealready-stored files. It’s simple and fast, but the extent of deduplication isless, as this process does not address the problem of duplicate contentfound inside different files. I.
Sub-File Based DeduplicationThe sub-file level deduplication techniquebreaks the file into smaller fixed or variable size blocks, and then uses astandard hash-based algorithm to find similar blocks. J. Fixed Length and Variable Length DeduplicationFixed Length chunking method splits files intoequally sized chunks. The chunk boundaries are based on offsets like 4, 8, 16kB, etc. It uses a simple checksum based approach to find the duplicates. Theprocess is highly constrained and offers limited advantages. Fig 3. Fixed v/s Variable Length ChunksThe process of variable length chunking states that the files can bebroken into multiple chunks of variable sizes by breaking them up based on thecontent rather than on the fixed size of the files 5.
This method is used asan alternative to fixed length chunking. In variable length chunking,boundaries of the chunks are based on the content present in the file. Hence,whenever, any data gets updated, it is not necessary to alter the entire wholefile. With data broken down based on the content, efficiency of deduplicationprocess increases as it is easy to recognize and eliminate redundant datachunks 5.
III. Variable Length Chunking MechanismsA. Rabin Karp Fingerprint AlgorithmRabin Karp is a variable length chunking algorithm that uses theconcept of hash functions and rolling hash technique.
A hash function isessentially a function that maps one thing to a value. In particular, hashingcan map data of arbitrary size to a value of fixed size. A rolling hash (alsoknown as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.A rolling hash allows an algorithm to calculate a hash value without having therehash the entire string. For example, when searching for a word in a text, asthe algorithm shifts one letter to the right (as in the animation forbrute-force) instead of having to calculate the hash of the section of text1:3 as it shifts from text 0:2, the algorithm can use a rolling hash to doan operation on the hash to get the new hash from the old hash. To find the chunking pattern we have to use a rolling hash.
The ideais to define a data window on the data stream, which has finite and predefinedsize of say “N”. Now using a hashing technique calculate a hash onthe window. Check if the hash matches a predefined “fingerprint”.If the “fingerprint” does match the calculated rolling hash on the window, mark the Nth element as the chuck boundary.If the “fingerprint” doesn’t match the calculated rolling hash on the window, slide the window by one element and recalculate the rolling hash. Repeat this until you find a match.
.B. Two Threshold Two Divisors Algorithm 6The TTTDalgorithm was proposed by HP laboratory at Palo Alto, California. Thisalgorithm use same idea as the BSW algorithm does.
The TTTD algorithm uses fourparameters, the maximum threshold, the minimum threshold, the main divisor, andthe second divisor. The maximum threshold parameter is used for eliminatingvery large sized chunks while minimum threshold is used for eliminating verysmall sized chunks. These two parameters are necessary to control thevariations in chunk sizes. The main divisor can be used to make the chunk sizeclose to our expected chunk size. The second divisor finds the breakpoint whenthe main divisor cannot find it. The breakpoint found by second divisor arelarge and close to maximum thresholds 7. IV. Enhanced MethodAs explained earlier,Data deduplication is a technique which stores only unique data chunks in the storageand removes the redundant data chunks.
Hence, due to such efficient storageprocess, technique of data deduplication is preferred over the traditionalstorage systems. There are number of techniques available for datadeduplication. This paper focuses on presenting the studied data deduplicationtypes. We have enhanced the existing method for data deduplication whichdepends on eliminating the redundant data chunks while it is stored in thestorage or backup. Figure shows the flow diagram of the method. Figure4. Flowchart of the Deduplication Method In thismethod, the data (text file) which the client intends to upload to the storage,is divided into chunks by using any of the variable length chunking algorithm,preferably Rabin Karp Fingerprint.
A hash value of the created chunks will becomputed. SHA-1 algorithm will be used for computing the hash value of the datachunks. The reason for selecting SHA-1 over any other hashing algorithm likeMD5 is that SHA-1 is efficient and more secure as its message digest is 160bits. These unique chunks whose hash values are computed will be stored in thestorage and hash values will be stored in the index tables. When any new fileis ready to upload, the hash values of this new data chunks will be comparedwith the values stored in the index tables. If any match is found, a referenceof original data chunk will be stored and value in index table will be updatedand if any match is not found, the data chunk will be directly stored in thestorage. Now for enhancingthe security, the data chunk will be encrypted before it is stored in thebackup. Basically, the need for security arises when deduplication system isimplemented in the cloud.
Security issues can be observed when information isprocessed on the cloud platform. The user who uploads the file does not havethe right over where the file is stored. Hence, there might be the possibilitythat the cloud service provider or any third party application can handle andaccess the data. Hence, the need for encryption arises. Before storing thedata, the text will be encrypted and then, it will be stored in the backup ortarget storages. The algorithm which will be used for encryption is AdvancedEncryption Standard (AES). Before downloading the file, all the data chunkswill be decrypted with the key provided, merged into a single file and then, itwill be downloaded.
V. PerformanceAnalysis Whencomparing File level chunking, fixed length chunking and variable lengthchunking, it is observed that variable length chunking provides good results.In variable length chunking, files are broken down into chunks of variablesizes based on the content. Table below represents differentdeduplicate/undeduplicate size (8, 16, 32, 64 Kb) in % with different chunkingalgorithms namely File Level, Fixed Length and variable length chunking5.
8 Kb 16 Kb 32 Kb 64 Kb File Level 46 47 49 50 Fixed Length 40 42 43 45 Rabin Karp 31 32 33 34 In Fixed Length chunking, each chunk is of particular fixed length.However, problem arises when the file needs to be updated. By inserting a lineof text in the file, the chunking algorithm shifts all the subsequent chunks,also no chunk following the first chunk is preserved. Hence, simple smallupdates radically change the chunks and deduplication becomes an issue.As againstin Variable Length chunking, chunks are content based and are not of uniformlength. Similarly, in variable chunking method, when updation process takesplace, only the chunk where any new line of text is inserted or deleted ischanged. Remaining chunks are unaltered and this is biggest benefit of variablelength chunking. Hence, the method proposed uses variable length chunkingalgorithm.
Metrics File Level Fixed Size Variable Size Deduplication Ratio Low Better Good Processing Time Medium Less High Table aboveshows the comparison of different deduplication algorithms with differentperformance metrics. Deduplication ratio indicates that how many redundanciesare removed and variable sized deduplication is much better than others. Interms of processing time, variable sized deduplication is worst owing toexpensive variable length chunking 9. As chunks are of variable sizes, timetaken to process is high.
VI. ConclusionThis paper focuses on study of the Data Deduplication process forstorage systems. Data Deduplication methods are used to achieve cost effectivestorage and effective network bandwidth. The actual concept lies in removingthe redundancies present in the data items. It is one of the emerging conceptswhich is currently being implemented by cloud providers. The future scope inthis process would be to improve its performance by enhancing the speed ofstoring data and providing higher security.
References 1 Subhanshi Singhal and Naresh Kumar, “A Survey on Data Deduplication” in International Journal on Recent and Innovation Trends in computing and communication, May 2017. 2 Bo Mao, Hong Jiang, Suzhen Wu, Lei Tian, “Leveraging Data Deduplication to improve performance of Primary Storage Systems in the Cloud” in IEEE Transactions on Computer, Vol 65, No. 6, June 2016. 3 Golthi Tharunn, Gowtham Kommineni, Sarpella Sasank Varma, Akash Singh Verma, “Data Deduplication in Cloud Storage” in International Journal of Advanced Engineering and Global Technology, Vol 03, Issue 08, August 2015.
4 Chaitya B. Shah, Drashti R. Panchal, “Secured Hash Algorithm-1 Review Paper” in International Journal for Advanced Research in Engineering and Technology, Vol 02, Oct 2014.
5 A Venish and K. Shiva Shankar, “Study of Chunking Algorithm in Data Deduplication” 6 BingChun Chang,” A Running Time Improvement for Two Threshold Two Divisors Algorithm”, MS, SJSU Scholar Works, 2009. 7 J. Malhotra and J.
Bakal, “A Survey and Comparative Study of Data Deduplication Techniques”, in International Conference on Pervasive Computing, Pune 2015. 8 Zuhair S. Al-Sagar, Mohammed S. Saleh, Aws Zuhair Sameen, “Optimizing Cloud Storage by Data Deduplication: A Study”, in International Research Journal of Engineering and Technology, Vol 02, Issue 09, Dec 2015.
9 Daehee Kim, Sejun Song, Baek-Young Choi, “Data Deduplication for Data optimization for Storage and Network Systems”.