Study on Secure Data Deduplication
Ritwik Jadhav, Rohan Gandhi, Dhanraj Gade, Chaitanya Atre
Department of Information Technology, MITCOE
Pune, Maharashtra, India.
Abstract— Nowadays, there is a drastic increase in demand
for data storage. Due to increasing demand for data storage, the concept of
cloud computing is on the rise. This enormous amount of data can be backed-up
on the cloud storage but, the problem is that it significantly increases the
cost of the storage and its performance. Traditional storage process of data
introduces redundancies and hence concept of Data deduplication is developed.
Data deduplication is an effective solution for eliminating the redundancies.
It uses the concepts of hash values and index tables which help in removal of
duplicate data. To prevent unauthorized access of data, encryption is used.
With the use of Data deduplication process and eliminating the possible
security issues, an effective performance increase and reduction in cost of
storage can be observed. In this paper, the concept of Data deduplication is
introduced, then how it can be classified into various subtypes, its algorithms
Data Deduplication, Storage, Hashing, Chunking, Redundant data.
Nowadays, due to advancements in the Technology, the amount of
digital data generated by the applications is increasing at a faster rate. As
the storage systems have a limited capacity, storing such huge amount of data
has posed challenge. Recent International Data Corporation (IDC) studies
indicate that in past five years the volume of data has increased by almost
nine times to 7 ZB per year and a more explosive growth is expected in next ten
years 2. Such a massive growth in storage is controlled by the technique of
Data Deduplication 1. This process identifies duplicate contents at the chunk
level using hash values and deduplicates them 1.
Data Deduplication can be
implemented at File Level and at Chunk Level. Chunk level Deduplication is
always preferred over the file deduplication as in the process of chunk level
deduplication, the fingerprints of various chunks are compared and the
redundant ones are deduplicated. As against in the File Level Deduplication,
the whole file is compared with the other file by checking its metadata.
Chunking plays a significant role in determining the efficiency of the data
deduplication algorithm. The performance of the data deduplication algorithm
can be computed by analyzing the size and number of chunks 1.
Data deduplication saves a lot of storage space and money spent on
the storage of the data by optimizing the storage space and bandwidth costs
1. Hence, a greener environment can be obtained as fewer spaces are required
to house the data in primary and remote storage. As we are maintaining less
storage, fast return on investment can be obtained. This process also helps in
increasing the network bandwidth and improving the network efficiency 1.
II. Different approaches of Data Deduplication
Data deduplication – often called intelligent
compression or single-instance storage – is a process that eliminates redundant
copies of data and reduces storage overhead. Data deduplication techniques ensure that
only one unique instance of data is retained on storage media. In the process
of Data deduplication, we divide the file or any block of data into multiple
chunks and a hash value of these chunks is calculated by using hash techniques
like SHA-1 or MD5 3. Using these hash values, we can compare a chunk of data
with the incoming data chunk and if a match is found, then, we can conclude
that a similar data chunks exists in the storage system and hence, we need to
replace the duplicate data chunk with the reference of existing data chunk.
Fig 1. Diagram of Illustrating Data Deduplication
A. Steps of Data Deduplication 1
Creation of Chunks: The File is divided into chunks using any
of the chunking methods – Fixed length chunking or Variable length
chunking. Hash Value Computation: Depending upon the chunks formed, the
hash values of these chunks are computed by using any of the various
available hashing algorithms – SHA-1, MD5, SHA-2.Deduplication Process: Now, the hash values of the data chunks
which can be called as fingerprints are stored in an index table.
Duplicate entries of data can be checked by making use of the index
tables. If the hash values are found to be matching, duplicate data can be
replaced with the reference of the original data chunk.
B. Hash Algorithms
Hash Algorithms are mainly used in the process of Data
deduplication. Hash values of data chunks are computed which are generally
called as “Fingerprints” are used for eliminating the redundancies in the data.
Commonly used hash algorithm for the deduplication process is the SHA-1.
SHA-1 is a cryptographic security algorithm
used in deduplication process for computing the fingerprints of the data
chunks. SHA-1 is closely modelled after MD5. SHA-1 produces a message digest of
160 bits by dividing the input data in blocks of 512 bits each 4. This 160
bits value computed for each data chunk is unique and used for eliminating
redundancies in the data.
D. Source Based Deduplication
Source deduplication is the removal of redundancies from data before
transmission to the target server. Source deduplication offers a number of
benefits, including reduced bandwidth and storage usage. But, Source
deduplication can be slower than target based deduplication, considering large
amount of data. Source deduplication works through client software that
communicates with the server to compare new data chunks with previously stored
data chunks. If the server has previously stored a data chunk, the software
does not send that chunk and instead notes that there is a copy of that data
chunk at that client.
E. Target Based Deduplication
Target deduplication is the removal of redundancies from a backup
transmission as it passes through an appliance sitting between the source and
the target server. Target deduplication reduces the amount of storage required
but, unlike source deduplication, it does not reduce the amount of data that
must be sent across a LAN or WAN during the storage.
F. Inline Data Deduplication
Inline Deduplication refers to eliminating the
data redundancies as the data enter the system. This process cuts down on the
bulk of data and makes the system efficient. The benefit of the inline deduplication
process is that the calculation of hash values and search process for redundant
data is done before the data is actually stored in the system.
G. Post Process Deduplication
Post-processing deduplication, also known as
asynchronous de-duplication, is the analysis and removal of redundant data
after the data is written to storage system. Post-process deduplication has a
number of benefits which includes efficient lookup process as the calculation
hash values and search process for redundant data is done after the data files
get stored on the storage system.
H. File Based Deduplication
The duplicate removal algorithm can be applied on full file. Full
file level duplicates can easily be eliminated by calculating single checksum
of the complete file data and comparing it against existing checksums of the
already-stored files. It’s simple and fast, but the extent of deduplication is
less, as this process does not address the problem of duplicate content
found inside different files.
I. Sub-File Based Deduplication
The sub-file level deduplication technique
breaks the file into smaller fixed or variable size blocks, and then uses a
standard hash-based algorithm to find similar blocks.
J. Fixed Length and Variable Length Deduplication
Fixed Length chunking method splits files into
equally sized chunks. The chunk boundaries are based on offsets like 4, 8, 16
kB, etc. It uses a simple checksum based approach to find the duplicates. The
process is highly constrained and offers limited advantages.
Fig 3. Fixed v/s Variable Length Chunks
The process of variable length chunking states that the files can be
broken into multiple chunks of variable sizes by breaking them up based on the
content rather than on the fixed size of the files 5. This method is used as
an alternative to fixed length chunking. In variable length chunking,
boundaries of the chunks are based on the content present in the file. Hence,
whenever, any data gets updated, it is not necessary to alter the entire whole
file. With data broken down based on the content, efficiency of deduplication
process increases as it is easy to recognize and eliminate redundant data
III. Variable Length Chunking Mechanisms
A. Rabin Karp Fingerprint Algorithm
Rabin Karp is a variable length chunking algorithm that uses the
concept of hash functions and rolling hash technique. A hash function is
essentially a function that maps one thing to a value. In particular, hashing
can map data of arbitrary size to a value of fixed size. A rolling hash (also
known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.
A rolling hash allows an algorithm to calculate a hash value without having the
rehash the entire string. For example, when searching for a word in a text, as
the algorithm shifts one letter to the right (as in the animation for
brute-force) instead of having to calculate the hash of the section of text
1:3 as it shifts from text 0:2, the algorithm can use a rolling hash to do
an operation on the hash to get the new hash from the old hash.
To find the chunking pattern we have to use a rolling hash. The idea
is to define a data window on the data stream, which has finite and predefined
size of say “N”. Now using a hashing technique calculate a hash on
the window. Check if the hash matches a predefined “fingerprint”.
If the “fingerprint” does match the calculated
rolling hash on the window, mark the Nth element as the chuck boundary.If the “fingerprint” doesn’t match the calculated
rolling hash on the window, slide the window by one element and
recalculate the rolling hash. Repeat this until you find a match.
B. Two Threshold Two Divisors Algorithm 6
algorithm was proposed by HP laboratory at Palo Alto, California. This
algorithm use same idea as the BSW algorithm does. The TTTD algorithm uses four
parameters, the maximum threshold, the minimum threshold, the main divisor, and
the second divisor. The maximum threshold parameter is used for eliminating
very large sized chunks while minimum threshold is used for eliminating very
small sized chunks. These two parameters are necessary to control the
variations in chunk sizes. The main divisor can be used to make the chunk size
close to our expected chunk size. The second divisor finds the breakpoint when
the main divisor cannot find it. The breakpoint found by second divisor are
large and close to maximum thresholds 7.
IV. Enhanced Method
As explained earlier,
Data deduplication is a technique which stores only unique data chunks in the storage
and removes the redundant data chunks. Hence, due to such efficient storage
process, technique of data deduplication is preferred over the traditional
storage systems. There are number of techniques available for data
deduplication. This paper focuses on presenting the studied data deduplication
types. We have enhanced the existing method for data deduplication which
depends on eliminating the redundant data chunks while it is stored in the
storage or backup. Figure shows the flow diagram of the method.
4. Flowchart of the Deduplication Method
method, the data (text file) which the client intends to upload to the storage,
is divided into chunks by using any of the variable length chunking algorithm,
preferably Rabin Karp Fingerprint. A hash value of the created chunks will be
computed. SHA-1 algorithm will be used for computing the hash value of the data
chunks. The reason for selecting SHA-1 over any other hashing algorithm like
MD5 is that SHA-1 is efficient and more secure as its message digest is 160
bits. These unique chunks whose hash values are computed will be stored in the
storage and hash values will be stored in the index tables. When any new file
is ready to upload, the hash values of this new data chunks will be compared
with the values stored in the index tables. If any match is found, a reference
of original data chunk will be stored and value in index table will be updated
and if any match is not found, the data chunk will be directly stored in the
Now for enhancing
the security, the data chunk will be encrypted before it is stored in the
backup. Basically, the need for security arises when deduplication system is
implemented in the cloud. Security issues can be observed when information is
processed on the cloud platform. The user who uploads the file does not have
the right over where the file is stored. Hence, there might be the possibility
that the cloud service provider or any third party application can handle and
access the data. Hence, the need for encryption arises. Before storing the
data, the text will be encrypted and then, it will be stored in the backup or
target storages. The algorithm which will be used for encryption is Advanced
Encryption Standard (AES). Before downloading the file, all the data chunks
will be decrypted with the key provided, merged into a single file and then, it
will be downloaded.
comparing File level chunking, fixed length chunking and variable length
chunking, it is observed that variable length chunking provides good results.
In variable length chunking, files are broken down into chunks of variable
sizes based on the content. Table below represents different
deduplicate/undeduplicate size (8, 16, 32, 64 Kb) in % with different chunking
algorithms namely File Level, Fixed Length and variable length chunking
In Fixed Length chunking, each chunk is of particular fixed length.
However, problem arises when the file needs to be updated. By inserting a line
of text in the file, the chunking algorithm shifts all the subsequent chunks,
also no chunk following the first chunk is preserved. Hence, simple small
updates radically change the chunks and deduplication becomes an issue.
in Variable Length chunking, chunks are content based and are not of uniform
length. Similarly, in variable chunking method, when updation process takes
place, only the chunk where any new line of text is inserted or deleted is
changed. Remaining chunks are unaltered and this is biggest benefit of variable
length chunking. Hence, the method proposed uses variable length chunking
shows the comparison of different deduplication algorithms with different
performance metrics. Deduplication ratio indicates that how many redundancies
are removed and variable sized deduplication is much better than others. In
terms of processing time, variable sized deduplication is worst owing to
expensive variable length chunking 9. As chunks are of variable sizes, time
taken to process is high.
This paper focuses on study of the Data Deduplication process for
storage systems. Data Deduplication methods are used to achieve cost effective
storage and effective network bandwidth. The actual concept lies in removing
the redundancies present in the data items. It is one of the emerging concepts
which is currently being implemented by cloud providers. The future scope in
this process would be to improve its performance by enhancing the speed of
storing data and providing higher security.
Singhal and Naresh Kumar, “A Survey on Data Deduplication” in International
Journal on Recent and Innovation Trends in computing and communication, May
Hong Jiang, Suzhen Wu, Lei Tian, “Leveraging Data Deduplication to improve
performance of Primary Storage Systems in the Cloud” in IEEE Transactions
on Computer, Vol 65, No. 6, June 2016.
Golthi Tharunn, Gowtham Kommineni, Sarpella Sasank
Varma, Akash Singh Verma, “Data Deduplication in Cloud Storage” in International
Journal of Advanced Engineering and Global Technology, Vol 03, Issue 08,
Shah, Drashti R. Panchal, “Secured Hash Algorithm-1 Review Paper” in
International Journal for Advanced Research in Engineering and Technology,
Vol 02, Oct 2014.
and K. Shiva Shankar, “Study of Chunking Algorithm in Data Deduplication”
Chang,” A Running Time Improvement for Two Threshold Two Divisors Algorithm”,
MS, SJSU Scholar Works, 2009.
Malhotra and J. Bakal, “A Survey and Comparative Study of Data Deduplication
Techniques”, in International Conference on Pervasive Computing, Pune
Zuhair S. Al-Sagar, Mohammed S. Saleh, Aws Zuhair
Sameen, “Optimizing Cloud Storage by Data Deduplication: A Study”, in International
Research Journal of Engineering and Technology, Vol 02, Issue 09, Dec
Daehee Kim, Sejun Song, Baek-Young Choi, “Data
Deduplication for Data optimization for Storage and Network Systems”.