Solving Data De-Duplication Issues On Cloud Using Hashing and Md5 Techniques
Solving Data De-Duplication Issues On Cloud Using Hashing and Md5 Techniques
v
TABLE OF CONTENT
CHAPTER NO. TITLE PAGE.NO
ABSTRACT V
LIST OF FIGURES VII
LIST OF TABLES IX
1 INTRODUCTION 01
1.1 OUTLINE OF THE PROJECT 01
1.2 PROBLEM STATEMENT 03
1.3 SCOPE OF THE PROJECT 03
2 LITERATURE REVIEW 04
2.1 STATE OF ART 04
2.2 INFERENCES FROM LITERATURE 07
3 SYSTEM ANALYSIS 08
3.1 EXISTING SYSTEM 08
3.1.1 Disadvantages 08
3.2 PROPOSED SYSTEM 08
3.2.1 Advantages 09
3.3 SOFTWARE REQUIREMENTS 10
3.4 . NET FRAMEWORK 10
3.4.1 Languages Supported 12
vi
3.4.2 Objectives Of. Net Framework 15
3.4.3 Features Of .Net 16
3.4.4 Security 17
3.5 SQL SERVER 17
3.5.1 Data Storage 18
3.5.2 Form 20
3.6 MD5 AND HASH ALGORITHMS 21
3.6.1 Hash algorithm 21
3.6.2 Hash-based data deduplication 22
3.6.3 Finding duplicate records 23
3.6.4 Finding similar records 24
3.6.5 Md5 Algorithm 25
3.7 SPACE REDUCTION TECHNOLOGIES 26
4 SOFTWARE DEVELOPMENT 28
METHODOLOGY
4.1 METHODOLOGIES 28
4.1.1 OPTIMIZING STORAGE CAPACITY 28
4.2 UML DIAGRAM 30
4.2.1 Use Case Diagram 30
4.2.2 Class Diagram 31
4.2.3 Sequence Diagram 32
4.2.4 Collaboration Diagram 32
4.2.5 ER Diagram 33
4.2.6 Data Flow Diagram 33
4.3 ARCHITECTURE 34
4.4 MODULE DESCRIPTION 35
4.4.1 User Module 35
4.4.2 Server Start Up and Upload File 36
4.4.3 Secure De Duplication System 37
4.4.4 Download File 37
5 RESULTS AND DISCUSSION 38
vii
5.1 RESULT 38
6 CONCLUSION AND FUTURE 42
ENHANCEMENT
6.1 CONCLUSION 42
6.2 FUTURE ENHANCEMENT 42
REFERENCE 43
APPENDIX 44
A. SOURCE CODE 44
B. PUBLICATION WITH PLAGARISM 47
REPORT
LIST OF FIGURES vi
viii
5.1 SPACE REDUCTION RATIO 39
5.2 SPACE REDUCTION PERCENTAGES 40
5.3 HOME PAGE 40
5.4 UPLOAD PAGE 41
5.5 USER REQUEST PAGE 41
5.6 KEY GENERATION 42
LIST OF TABLES
ix
CHAPTER-1
INTRODUCTION
1
is a specialized data compression technique for eliminating duplicate copies of
repeating data in storage. This technique is used to improve storage utilization and
can also be applied to network data transfers to reduce the number of bytes that must
be sent. Instead of keeping multiple data copies with the same content, deduplication
eliminates redundant data by keeping only one physical copy and referring other
redundant data to that copy. Deduplication can take place at either the file level or the
block level. For file level deduplication, it eliminates duplicate copies of the same file.
Deduplication can also take place at the block level, which eliminates duplicate
blocks of data that occur in non-identical files. Although data deduplication brings a
lot of benefits, security and privacy concerns arise as users’ sensitive data are
susceptible to both insider and outsider attacks. Traditional encryption, while
providing data confidentiality, is incompatible with data deduplication. Specifically,
traditional encryption requires different users to encrypt their data with their own keys.
Thus, identical data copies of different users will lead to different ciphertexts, making
deduplication impossible. Convergent encryption has been proposed to enforce data
confidentiality while making deduplication feasible. It encrypts/decrypts a data copy
with a convergent key, which is obtained by computing the cryptographic hash value
of the content of the data copy.
After key generation and data encryption, users retain the keys and send the
ciphertext to the cloud. Since the encryption operation is Deterministic and is derived
from the data content, identical data copies will generate the same convergent key
and hence the same cipher text. To prevent unauthorized access, a secure proof of
ownership protocol is also needed to provide the proof that the user indeed owns
the same file when a duplicate is found. After the proof, subsequent users with the
same file will be provided a pointer from the server without needing to upload the
same file. A user can download the encrypted file with the pointer from the server,
which can only be decrypted by the corresponding data owners with their convergent
keys. Thus, convergent encryption allows the cloud to perform deduplication on the
cipher texts and the proof of ownership prevents the unauthorized user to access
the file. However, previous deduplication systems cannot support differential
authorization duplicate check, which is important in many applications. In such an
authorized deduplication system, each user is issued a set of privileges during
system initialization. Each file uploaded to the cloud is also bounded by a set of
privileges to specify which kind of users is allowed to perform the duplicate check
and access the files. Before submitting his duplicate check request for some file, the
user needs to take this file and his own privileges as inputs. The user is able to find
2
a duplicate for this file if and only if there is a copy of this file and a matched privilege
stored in cloud. For example, in a company, many different privileges will be
assigned to employees. In order to save cost and efficiently management, the data
will be moved to the storage server provider (SCSP) in the public cloud with specified
privileges and the deduplication technique will be applied to store only one copy of
the same file. Because of privacy consideration, some files will be encrypted and
allowed the duplicate check by employees with specified privileges to realize the
access control. Traditional de duplication systems based on convergent encryption,
although providing confidentiality to some extent; do not support the duplicate check
with differential privileges. In other words, no differential privileges have been
considered in the deduplication based on convergent encryption technique. It seems
to be contradicted if we want to realize both deduplication and differential
authorization duplicate check at the same time.
The main goal is to enable de duplication and distributed storage of the data
across multiple storage servers and save storage as there are n number of users
producing data every day. One critical challenge of cloud storage services is the
management of the ever-increasing volume of data. By using the cloud computing
there is no de-duplication process in the existing so that we can’t avoid duplication
in older file or block level. This paper makes the first attempt to formally address the
problem of authorized data de duplication.
3
CHAPTER -2
LITERATURE SURVEY
Author Yang Tang, Patrick P.C Lee, John C.S.Lui and Radia Perlman in ―Secure
overlay cloud storage with access control and Assured Deletion‖ states that We can
now outsource data backups off-site to third-party cloud storage services so as to
reduce data management costs. However, we must provide security guarantees for
the outsourced data, which is now maintained by third parties. We design and
implement FADE, a secure overlay cloud storage system that achieves fine-grained,
policy-based access control and file assured deletion. It associates outsourced files
with file access policies, and assuredly deletes files to make them unrecoverable to
anyone upon revocations of file access policies. To achieve such security goals,
FADE is built upon a set of cryptographic key operations that are self-maintained by
a quorum of key managers that are independent of third-party clouds. In particular,
FADE acts as an overlay system that works seamlessly atop today’s cloud storage
services. We implement a proof-of-concept prototype of FADE atop Amazon S3,
one of today’s cloud storage services. We conduct extensive empirical studies, and
demonstrate that FADE provides security protection for outsourced data, while
introducing only minimal performance and monetary cost overhead. Our work
provides insights of how to incorporate valueadded security features into today’s
cloud storage services [1].
Author Huiqi Xu, Shumin Guo and Keke Chen in ―Building confidential and efficient
query services in the cloud with RASP data perturbation‖ states that with the wide
deployment of public cloud computing infrastructures, using clouds to host data
query services has become an appealing solution for the advantages on scalability
and cost-saving. However, some data might be sensitive that the data owner does
not want to move to the cloud unless the data confidentiality and query privacy are
guaranteed. On the other hand, a secured query service should still provide efficient
query processing and significantly reduce the in-house workload to fully realize the
benefits of cloud computing. We propose the random space perturbation (RASP)
4
data perturbation method to provide secure and efficient range query and kNN query
services for protected data in the cloud. The RASP data perturbation method
combines order preserving encryption, dimensionality expansion, random noise
injection, and random projection, to provide strong resilience to attacks on the
perturbed data and queries. It also preserves multidimensional ranges, which allows
existing indexing techniques to be applied to speedup range query processing. The
kNN-R algorithm is designed to work with the RASP range query algorithm to
process the KNN queries. We have carefully analysed the attacks on data and
queries under a precisely defined threat model and realistic security assumptions.
Extensive experiments have been conducted to show the advantages of this
approach on efficiency and security [2].
Author J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou in ‖ Secure deduplication with
efficient and reliable convergent key management‖ states that Data de duplication
5
is a technique for eliminating duplicate copies of data, and has been widely used in
cloud storage to reduce storage space and upload bandwidth. Promising as it is, an
arising challenge is to perform secure de duplication in cloud storage. Although
convergent encryption has been extensively adopted for secure de duplication, a
critical issue of making convergent encryption practical is to efficiently and reliably
manage a huge number of convergent keys. This paper makes the first attempt to
formally address the problem of achieving efficient and reliable key management in
secure de duplication. We first introduce a baseline approach in which each user
holds an independent master key for encrypting the convergent keys and
outsourcing them to the cloud. However, such a baseline key management scheme
generates an enormous number of keys with the increasing number of users and
requires users to dedicatedly protect the master keys. To this end, we propose key,
a new construction in which users do not need to manage any keys on their own but
instead securely distribute the convergent key shares across multiple servers.
Security analysis demonstrates that De-key is secure in terms of the definitions
specified in the proposed security model. As a proof of concept, we implement De-
key using the Ramp secret sharing scheme and demonstrate that De-key incurs
limited overhead in realistic environments [4].
Author Chaoling li, Yue Chen and Yanzhou Zhou in ―a data assured deletion scheme
in clous storage‖ states that In order to provide a practicable solution to data
confidentiality in cloud storage service, a data assured deletion scheme, which
achieves the fine-grained access control, hopping and sniffing attacks resistance,
data dynamics and de-duplication, is proposed. In our scheme, data blocks are
encrypted by a two-level encryption approach, in which the control keys are
generated from a key derivation tree, encrypted by an All-Or-Nothing algorithm and
then distributed into DHT network after being partitioned by secret sharing. This
guarantees that only authorized users can recover the control keys and then decrypt
the outsourced data in an owner specified data lifetime. Besides confidentiality, data
dynamics and deduplication are also achieved separately by adjustment of key
derivation tree and convergent encryption [5].
Author W. K. Ng, Y. Wen, and H. Zhu states that Private data deduplication protocols
in cloud storage in this paper, a new notion which we call private data deduplication