A Secure and Dynamic Multi-Keyword Ranked Search Scheme Over Encrypted Cloud Data
A Secure and Dynamic Multi-Keyword Ranked Search Scheme Over Encrypted Cloud Data
ABSTRACT:
Due to the increasing popularity of cloud computing, more and more
data owners are motivated to outsource their data to cloud servers for
great convenience and reduced cost in data management. However,
sensitive data should be encrypted before outsourcing for privacy
requirements, which obsoletes data utilization like keyword-based
document retrieval. In this paper, we present a secure multi-keyword
ranked search scheme over encrypted cloud data, which simultaneously
supports dynamic update operations like deletion and insertion of
documents. Specifically, the vector space model and the widely-used TF
x IDF model are combined in the index construction and query
generation. We construct a special tree-based index structure and
propose a Greedy Depth-first Search algorithm to provide efficient
multi-keyword ranked search. The secure kNN algorithm is utilized to
encrypt the index and query vectors, and meanwhile ensure accurate
relevance score calculation between encrypted index and query vectors.
In order to resist statistical attacks, phantom terms are added to the index
vector for blinding search results. Due to the use of our special tree-
INTRODUCTION:
CLOUD computing has been considered as a new model of enterprise IT
infrastructure, which can organize huge resource of computing, storage
and applications, and enable users to enjoy ubiquitous, convenient and
on-demand network access to a shared pool of configurable computing
resources with great efficiency and minimal economic overhead .
Attracted by these appealing features, both individuals and enterprises
are motivated to outsource their data to the cloud, instead of purchasing
software and hardware to manage the data themselves. Despite of the
various advantages of cloud services, outsourcing sensitive information
(such as e-mails, personal health records, company finance data,
government documents, etc.) to remote servers brings privacy concerns.
The cloud service providers (CSPs) that keep the data for users may
access users sensitive information without authorization. A general
approach to protect the data confidentiality is to encrypt the data before
out sourcing. However, this will cause a huge cost in terms of data
usability. For example, the existing techniques on keyword-based
WORKING MODULE:
Searchable encryption schemes enable the clients to store the encrypted
data to the cloud and execute keyword search over cipher text domain.
Due to different cryptography primitives, searchable encryption schemes
can be constructed using public key based cryptography or symmetric
key based cryptography Song et al. proposed the first symmetric
searchable encryption (SSE) scheme, and the search time of their
scheme is linear to the size of the data collection. Goh proposed formal
security definitions for SSE and designed a scheme based on Bloom
filter. The search time of Gohs scheme is O (n), where n is the
cardinality of the document collection. Curtmola et al. proposed two
schemes which achieve the optimal search time. Their SSE-1 scheme is
secure against chosen-keyword attacks and SSE-2 I secure against
does not consider the importance of the different keywords, and thus is
not accurate enough. In addition, the search efficiency of the scheme is
linear with the cardinality of document collection. Sun et al. presented a
secure multi-key word search scheme that supports similarity-based
ranking. The authors constructed a searchable index tree based on vector
space model and adopted cosine measure together with TFIDF to
provide ranking results. Sun etal. s search algorithm achieves betterthan-linear search efficiency but results in precision loss. O rencik et al.
proposed a secure multi-keyword search method which utilized local
sensitive hash (LSH) functions to cluster the similar documents. The
LSH algorithm is suitable or similar search but cannot provide exact
ranking. In Zhang et al. proposed a scheme to deal with secure multikeyword ranked search in a multi-owner model. In this scheme, different
data owners use different secret keys to encrypt their documents and
keywords while authorized data users can query without knowing keys
of these different data owners. The authors proposed an Additive Order
Preserving Function to retrieve the most relevant search results.
However, these works dont support dynamic operations. Practically, the
data owner may need to update the document collection after he up load
the collection to the cloud server. Thus, the SE schemes are expected to
support the insertion and deletion of the documents. There are also
several dynamic searchable encryption schemes .In the work of Song et
al. , the each document is considered as a sequence of fixed length
The system model in this paper involves three different entities: data
owner, data user and cloud server, as illustrated.
DATA OWNER:
Has a collection of documents F = {f1; f2: fn} that he wants to outsource
to the cloud server in encrypted form while still keeping the capability to
search on them for effective utilization. In our Scheme, the data owner
firstly builds a secure search able tree index I from document collection
F, and then generates an encrypted document collection C for F.
Afterwards, the data owner outsources the encrypted Collection C and
the secure index I to the cloud server, and securely distributes the key
information of trap do or generation and document decryption to the
authorized data users. Besides, the data owner is responsible for the
update operation of his documents stored in the cloud server. While
updating, the data owner generates the update information locally and
sends it to the server.
DATA USERS:
Are authorized ones to access the documents of data owner. With t query
keywords, the authorized user can generate a trapdoor TD according to
search control mechanisms to fetch k encrypted documents
from cloud server. Then, the data user can decrypt the documents with
the shared secret key. Cloud server stores the encrypted document
collection C and the encrypted searchable tree index I for data owner.
Upon receiving the trapdoor TD from the data user, the cloud server
executes search over the index tree I, and finally returns the
corresponding collection of top-k ranked encrypted documents. Besides,
upon receiving the update information from the data owner, the server
needs to update the index I and document collection C according to the
received information .The cloud server in the proposed scheme is
considered as honest-but-curious, which is employed by lots of works
on secure cloud data search. Specifically, the cloud server honestly and
correctly executes instructions in the designated protocol. Meanwhile, it
is curious to infer and analyze received data, which helps it acquire
PROPOSED SCHEME:
To enable secure, efficient, accurate and dynamic multi key word ranked
search over outsourced encrypted cloud data under the above models,
our system has the following design goals.
DYNAMIC:
The proposed scheme is designed to provide not only multi-keyword
query and accurate result ranking, but also dynamic update on document
collections.
SEARCH EFFICIENCY:
The scheme aims to achieve sub linear search efficiency by exploring a
special tree-based index and an efficient search algorithm.
PRIVACY-PRESERVING:
The scheme is designed to prevent the cloud server from learning
additional information about the document collection, the index tree, and
the query. The specific privacy requirements are summarized as follows,
INDEX CONFIDENTIALITY AND QUERY CONFIDENTIALITY:
The underlying plaintext information, including key words in the index
and query, TF values of key words stored in the index, and IDF values of
query keywords, should be protected from cloud server;
TRAP DOOR UNLINKS ABILITY:
The cloud server should not be able to determine whether two encrypted
queries (trapdoors) are generated from the same search request;
KEYWORD PRIVACY:
The cloud server could not identify the specific keyword in query, index
or document collection by analyzing the statistical information like term
frequency. Note that our proposed scheme is not designed to protect
access pattern, i.e., the sequence of returned documents.
DESIGN GOALS:
In this section, we firstly describe the unencrypted dynamic multi-key
word ranked search (UDMRS) scheme which is constructed on the basis
of vector space model and KBB tree. Based on the UDMRS scheme,
two secure search schemes (BDMRS and EDMRS schemes) are
constructed against two threat models, respectively.
process. Following are some other notations, and the GDFS algorithm is
described in Algorithm
R Score(Du;Q) The function to calculate the relevance score for
query vector Q and index vector Du stored in node u, which is defined in
Formula (1).
Kth score The smallest relevance score in current R List, which is
initialized as 0.
h child The child node of a tree node with higher relevance score.
l child The child node of a tree node with lower relevance score.
Since the possible largest relevance score of documents rooted by the
node u can be predicted, only a part of the nodes in the tree are accessed
during the search process shows an example of search process
with the document collection F = {fi|i = 1;:::; 6}, cardinality of the
dictionary m = 4, and query vector Q = (0; 0:92; 0; 0:38).
BDMRS SCHEME:
Based on the UDMRS scheme, we construct the basic dynamic multikeyword ranked search (BDMRS) scheme by using the secure kNN
algorithm [38]. The BDMRS scheme is designed to achieve the goal of
privacy preserving in the known cipher text model, and the four
algorithms included are described as follows:
SK Setup() Initially, the data owner generates the secret key set SK,
including 1) a randomly generated m-bit vector S where m is equal to the
cardinality of dictionary, and 2) two (mm) invertible matrices
M1 and M2. Namely, SK = {S;M1;M2}.
I Gen Index (F; SK) First, the unencrypted index tree T is built on
F
Build Index Tree (F). Secondly, the data owner generates two random
vectors {Du; Du} for index vector Du in each node u, according to the
secret vector S. Specifically, if S[i] = 0, Du[i] and Du[i] will be set
equal to Du[i]; if S[i] = 1, Du[i] and Du[i]will be set as two random
values whose sum equals to Du[i]. Finally, the encrypted index tree I is
built where the node u stores two encrypted index vectors Iu = {MT1 Du
; MT2 Du}.
TD Gen Trap door (Wq; SK) With keyword set Wq, the
unencrypted query vector Q with length of m is generated. If wi Wq,
Q[i] stores the normalized IDF value of wi; else Q[i] is set to 0.
Similarly, the query vector Q is split into two random vectors Q and Q.
The difference is that if S[i] = 0, Q[i] and Q[i] are set to two random
values whose sum
equals to Q[i]; else Q[i] and Q[i] are set as the same as Q[i]. Finally,
the algorithm returns the trapdoor TD = {M11 Q; M12 Q}.
Relevance Score SR Score (Iu;TD) With the trapdoor TD, the cloud
server computes the relevance score of node u in the index tree I to the
query. Note that the relevance score calculated from encrypted vectors is
equal to that from unencrypted vectors as follows
SECURITY ANALYSIS.
We analyze the BDMRS scheme according to the three predefined
privacy requirements in the design goals:
INDEX
CONFIDENTIALITY
AND
QUERY
CONFIDENTIALITY:
In the proposed BDMRS scheme, Iu and TD are an obfuscated vector,
which means the cloud server cannot infer the original vectors Du and Q
without the secret key set SK. The secret keys M1 and M2 are Gaussian
random matrices. According to , the attacker (cloud server) of COA
cannot calculate the matrices merely with cipher text. Thus, the BDMRS
scheme is resilient against cipher text-only attack (COA) and the index
confidentiality and the query confidentiality are well protected.
QUERY UNLINKS ABILITY:
The trapdoor of query vector is generated from a random splitting
operation, which means that the same search requests will be
transformed into different query trapdoors, and thus the query un link
ability is protected. However, the cloud server is able to link the same
search requests according to the same visited path and the same
relevance scores.
KEYWORD PRIVACY:
In this scheme, the confidentiality of the index and query are well
protected that the original vectors are kept from the cloud server. And
the search process merely introduces inner product computing of
encrypted vectors, which leaks no information about any specific
keyword. Thus, the keyword privacy is protected in the known cipher
text model. But in the known background model, the cloud server is
supposed to have more knowledge, such as the term frequency statistics
of keywords. This statistic information can be visualized as TF
distribution histograms which reveal show many documents are there for
every TF value of a specific keyword in the document collection. Then,
due to the specificity of the TF distribution histogram, like the graph
slope and value range, the cloud server could conduct TF statistical
attack to deduce/identify keywords .In the worst case, when there is only
one keyword in the query vector, i.e. the normalized IDF value for the
keyword is 1, the final relevance score distribution is exactly the
normalized TF distribution of this keyword, which is directly exposed to
cloud server. Therefore, the BDMRS scheme cannot resist TF statistical
attack in the known background model.
EDMRS SCHEME:
The security analysis above shows that the BDMRS scheme can protect
the Index Confidentiality and Query Confidentiality in the known cipher
text model. However, the cloud server is able to link the same search
requests by tracking path of visited nodes. In addition, in the known
background model, it is possible for the cloud server to identify a
keyword as the normalized TF distribution of the keyword can be
exactly obtained from the final calculated relevance scores. The primary
cause is that the relevance score calculated from Iu and TD is exactly
equal to that from Du and Q. A heuristic method to further improve the
security is to break such exact equality. Thus, we can introduce some
tunable randomness to disturb the relevance score calculation. In
addition, to suit different users preferences for higher accurate ranked
results or better protected key word privacy, the randomness are set
adjustable. The enhanced EDMRS scheme is almost the same as
BDMRS scheme except that:
SK Setup() In this algorithm, we set the secret vector S as a m-bit
vector, and set M1 and M2 are(m + m) (m + m) invertible matrices,
where m is the number of phantom terms.
I Gen Index (F; SK) Before encrypting the index vector Du, we
extend the vector Du to be a (m+m)-dimensional vector. Each extended
element Du[m+j], j = 1; :::;m, is set as a random number "j .
TD Gen Trap door (Wq; SK) The query vector Qis extended to be
a (m + m)-dimensional vector. Among the extended elements, a number
of m elements are randomly chosen to set as 1, and the rest are set as 0.
SECURITY ANALYSIS.
The security of EDMRS scheme is also analyzed according to the three
predefined privacy requirements in the design goals:
INDEX
CONFIDENTIALITY
AND
QUERY
CONFIDENTIALITY:
Inherited from BDMRS scheme, the EDMRS scheme can protect index
confidentiality and query confidentiality n the known background
model. Due to the utilization of phantom terms, the confidentiality is
further enhanced as the transformation matrices are harder to figure out.
QUERIES UNLINK ABILITY:
By introducing the random value ", the same search requests will
generate different query vectors and receive different relevance score
distributions. Thus, the query unlink ability is protected better. However,
since the proposed scheme is not designed to protect access pattern for
efficiency issues, the motivated cloud server can analyze the similarity
of search results to judge whether the retrieved results come from the
same requests. In the proposed EDMRS scheme, the data user can
control the level of unlink ability by adjusting
the value of "v. This is a trade-off between accuracy and privacy, which
is determined by the user.
KEYWORD PRIVACY:
The BDMRS scheme cannot resist TF statistical attack in the known
back ground model, as the cloud server is able to deduce/identify
keywords through analyzing the TF distribution histogram.
document identity i and updates the vector D of other nodes in sub tree
Ts, so as to generate the updated sub tree T s. In particular, if the
deletion of the leaf node breaks the balance of the binary index tree, we
replace the deleted node with a fake node whose vector is padded with 0
and file identity is null. Then, the data owner encrypts the vectors stored
in the sub tree T s with the key set SK to generate encrypted sub tree I
s, and set the output ci as null. If up type is equal to Ins, the data owner
generates a tree node u = Gen ID (); D; null; null; i for the document
fi, where D[j] = TF fi; wj for j = 1;:::; m. Then, the data owner inserts
this new node into the sub tree Ts as a leaf node and updates the vector
D of other nodes in sub tree Ts according to the Formula so as to
generate the new sub tree T s . Here, the data owner is always preferable
to replace the fake leaf nodes generated by Del operation with newly
inserted nodes, instead of directly inserting new nodes. Next, the data
owner encrypts the vectors stored in sub tree T s with the key set SK as
describe to generate encrypted sub tree Is.
Finally, the document fi is encrypted to ci.
{I; C} Up date (I; C; updtype; Is; ci) In this algorithm, cloud
server replaces the corresponding sub tree Is(the encrypted form of Ts)
with I
s, so as to generate a new index tree I. If updtype is equal to Ins, cloud
server inserts the encrypted document ci into C, obtaining a new
ranked results. The larger rank privacy denotes the higher security of the
scheme, which is illustrated. In the proposed scheme, data users can
accomplish different requirements on search precision and privacy by
adjusting the standard deviation _, which can be treated as a balance
parameter. We compare our schemes with a recent work proposed by
Sun et al. , which achieves high search efficiency. Note that our BDMRS
scheme retrieves the search results through exact calculation of
document vector and query vector. Thus, top-k search precision of the
BDMRS scheme is 100%. But as a similarity-based multi-keyword
ranked search scheme, the basic scheme in suffers from precision loss
due to the clustering of sub-vectors during index construction. The
precision test of basic scheme is presented In each test, 5 keywords are
randomly chosen as input, and the precision of returned top 100 results
is observed. The test is repeated 16 times, and the average precision is
91%.
EFFICIENCY:
INDEX TREE CONSTRUCTION:
The process of index tree construction for document collection F
includes two main steps: 1) building an unencrypted KBB tree based on
the document collection F, and 2) encrypting the index tree with
splitting operation and two multiplications of a (m m) matrix. The
index structure is constructed following a post order traversal of the tree
Suns code, we divide 4000 keywords into 50 levels. Thus, each level
contains 80 keywords. According to, the higher level the query
keywords reside, the higher the search efficiency is. In our experiment,
we choose ten keywords from the 1st level for search efficiency
comparison. That if the query keywords are chosen from the1st level,
our scheme obtains almost the same efficiency as when we start 4
threads. That the search efficiency of our scheme increases a lot when
we increase the number of threads from 1 to 4. However, when we
continue to increase the threads, the search efficiency is not increased
remarkably. Our search algorithm can be executed in parallel to improve
the search efficiency. But all the started threads will share one result list
R List in mutually exclusive manner. When we start too many threads,
the threads will spend a lot of time for waiting to read and write the R
List. An intuitive method to handle this problem is to construct multiple
result lists. However, in our scheme, it will not help to improve the
search efficiency a lot. It is because that we need to find k results for
each result
list and time complexity for retrieving each result list is O(_m log n=l).
In this case, the multiple threads will not save much time, and selecting
k results from the multiple result lists will further increase the time
consumption .In the Fig. 8, we show the time consumption when we
start multiple threads with multiple result lists. The experimental results
prove that our scheme will obtain better search efficiency when we start
multiple threads with only one result list.
UPDATE EFFICIENCY:
In order to update a leaf node, the data owner needs to update log n
nodes. Since it involves an encryption operation for index vector at each
node, which takes O(m2) time, the time complexity of update operation
is thus O(m2 log n). We illustrate the time cost for the deletion of a
document. Fig. 9(a) shows that when the size of dictionary is fixed, the
deletion of a document takes nearly logarithmic time with the size of
document collection. And Fig. 9(b) shows that the update time is
proportional to the size of dictionary when the document collection is
fixed .In addition, the space complexity of each node is O(m). Thus,
space complexity of the communication package of updating a document
is O (mlog n).
to further reduce the time cost. The security of the scheme is protected
against two threat models by using the secure kNN algorithm.
Experimental results demonstrate the efficiency of our proposed scheme.
There are still many challenge problems in symmetric SE schemes. In
the proposed scheme, the data owner is responsible for generating
updating information and sending them to the cloud server. Thus, the
data owner needs to store the unencrypted index tree and the information
that are necessary to recalculate the IDF values. Such an active data
owner may not be very suitable for the cloud computing model. It could
be a meaning full but difficult future work to design a dynamic
searchable encryption scheme whose updating operation can be
completed by cloud server only, meanwhile reserving the ability to
support multi-keyword ranked search. In addition, as the most of works
about searchable encryption, our scheme mainly considers the challenge
from the cloud server. Actually, there are many secure challenges in a
multi-user scheme. Firstly, all the users usually keep the same secure key
for trapdoor generation in asymmetric SE scheme. In this case, the
revocation of the user is big challenge. If it is needed to revoke a user in
this scheme, we need to rebuild the index and distribute the new secure
keys to all the authorized users. Secondly, symmetric SE schemes
usually assume that all the data users are trustworthy. It is not practical
and a dishonest data user will lead to many secure problems. For
example, a dishonest data user may search the documents and distribute