Published Version PDF

2016 IEEE International Conference on Big Data (Big Data)
S3C: An Architecture for Space-Efficient Semantic

Search over Encrypted Data in the Cloud
Jason Woodworth∗† , Mohsen Amini Salehi† , Vijay Raghavan∗
∗ TheCenter for Advanced Computer Studies
† High Performance Cloud Computing (HPCC) Laboratory
School of Computing and Informatics
University of Louisiana at Lafayette, Louisiana, USA
Email: {jww7675, amini, raghavan}@louisiana.edu
Abstract—The recent rapid growth in Internet speeds and with system users who may not remember exact keywords in
file storage requirements has made cloud storage an appealing the documents they are looking for or may want to retrieve
option on both a personal and enterprise level. Despite the many documents similar to what they are asking for. Users can
benefits offered by cloud storage, many potential users with
sensitive data refrain from fully utilizing this service due to valid potentially also require the ability to perform the search on
concerns about information privacy. An established solution to their thin client devices. One example of such organization
this concern is to perform encryption on the user side with the is a hospital with encrypted patient records on the cloud
key stored on a local machine, meaning the cloud will never see and staff who would like to be able to find patients with
the user’s plaintext data. However, by encrypting data on the similar diagnoses using a tablet. Another example is a police
user side data processing capabilities (e.g., searching) are lost.
In particular, the ability to semantically search is of the user’s organization with officers who would like to search over
interest in large datasets. In this paper, we present S3C, a system encrypted police records on government clouds while on the
that provides a semantic search functionality over encrypted data move with their PDAs. An organization with this desire would
in the cloud. S3C combines approaches from traditional keyword- need a system that provides security for their documents and
based searchable encryption and semantic web searching. It a searching mechanism in which the user only has to enter a
offers a user transparent experience that accepts a simple multi-
phrase query and returns a list of documents ranked by semantic plaintext search query, and receive search results ranked based
relevance to the query. Our proposed approach is space-efficient, on the document’s relevance to the entered query. Therefore
which makes it suitable for large scale datasets. Our minimal they can see the most relevant documents first and will not
processing also allows the system to be run on thin clients such have to comb through a list of all relevant documents.
as smart-phones or tablets. We evaluate the performance of
our system against various real-world datasets, and our results
Other solutions to the problem of searchable encryption
show that it produces accurate search results while maintaining often fall short in three ways that make them unsuitable for the
minimal storage overhead (∼0.3% of the dataset size). examples described above. First, they do not offer semantic
Index Terms—Cloud services, Searchable Encryption, Seman- searching, meaning the user would need to remember exact
tic Search. keywords in the documents they are searching for. This is
inappropriate for a big data environment as it is unlikely that
I. I NTRODUCTION the users would be able to remember exact keywords very
Cloud storage is an efficient and scalable solution for well [2]. Second, solutions that do offer semantic searching
companies and individuals who want to store large to huge either only make the system more forgiving of typos and words
numbers of files without the burden of maintaining their own with similar spellings, or utilize a large semantic networks
data center. Despite the advantages offered by these solutions, that need to be stored locally making them inappropriate for
many potential clients abstain from using them due to valid thin-clients (e.g., [19]). Third, solutions that offer semantic
concerns over the security of the files once they are on remote searching often do not rank the related files by their relevance
servers, and thus desire stronger cloud security [9]. Tradi- to the query (e.g., [12]), instead offering only a boolean search
tionally, cloud providers provide security by encrypting user which returns a potentially huge pool of all related files.
documents on their own servers and storing the encryption key Therefore, we can define our problem as needing to answer
remotely, allowing internal attackers to access unauthorized these four questions:
data. One proven solution that addresses this concern is to
perform the encryption locally on the user’s machine before • How to semantically search a multi-phrase query over
it is transferred to the cloud [18]. Unfortunately, this limits encrypted files stored in the cloud?
the user’s ability to interact with the data, most importantly • How to rank results of a search based on semantic
limiting the ability to search over it. Although solutions for relevance to the user’s query?
searchable encryption exist, they often do not consider the • How to do the search processing on the cloud without
semantic meaning of the user’s query, impose a large storage revealing data to the cloud?
overhead, or do not rank documents based on their relevance • How to provide an approach that imposes the minimum
to the query. This research is an attempt to solve these issues. storage and processing overhead?
Our motivation in this research is an organization with In this paper, we introduce Secure Semantic Search over
increasingly large amounts of data with sensitive information, encrypted data in the Cloud (S3C), a scalable system that
978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3722

performs a semantic search on locally encrypted data that lives queries, meaning they simply return a list of files containing
on the cloud. Our approach only parses and encrypts data on the query terms without ranking.
the client side so the user machine is the only part of the The second major approach is to utilize database and text
architecture that sees plaintext data. Documents are parsed retrieval techniques such as indexing to store selected data per
and indexed in a manner that takes constant storage space document in a separate data structure from the files, making
per document. The search system resides on the cloud server, the search operation generally quicker and well adapted to
relieving the client machine of the search processing. Users are big data scenarios. Goh [6] proposed an approach using bloom
able to upload documents to a remote storage location, perform filters which created a searchable index for each file containing
a semantic search over their encrypted data, and receive a trapdoors of all unique terms, but had the side effect of
list of documents ranked by their relevance to the query. returning false positives due to the choice of data structure.
Experiments that we have performed on real-world datasets Curtmola et al. [4] worked off of this approach, keeping a
demonstrate the accuracy, performance, and scalability of S3C. single hash table index for all documents, getting rid of false
In summary, the contributions of this paper are as follows: positives introduced by bloom filters. The hash table index for
• Proposing a secure method for searching a multi-phrase all documents contained entries where a trapdoor of a word
query over encrypted data in the cloud. which appeared in the document collection is mapped to a
• Providing a method for ranking search results based on set of file identifiers for the documents in which it appeared.
their semantic relevance to the user’s query. Van Liesdonk et al. further expanded on this in [21] with a
• Presenting a working prototype of our system. more efficient search by using an array of bits where each
• Analyzing the performance of this system, and discuss bit is either 0 or its position represents one of the document
tradeoffs between performance and security. identifiers. These methods are generally faster, taking constant
• Analyzing the relevance of the documents retrieved by time to access related files, but are less provably secure,
the system. opening up new amounts of data to potential threat. All of
• Analyzing the impact of query length on the performance the mentioned methods only offer an exact-keyword search,
of the system. leaving no room for user error through typos and cannot
The rest of the paper is organized as follows. Section II retrieve works related to terms in the query.
reviews related works in the literature, establishing the need for B. Semantic Search
our solution. Section III provides a concise formal formulation
of the problem and our system model. Section IV gives an Much of the work into searching semantically has been done
overview of our proposed system architecture. Section V gives in the context of searching the web [1], [13], [20]. Some of
an overview of our general method, while section VI goes into these works, such as RQL by Karvounarakis [10], require users
scheme-specific details. Section VII reviews the threat model to formulate queries using some formal language or form,
we are working with and provides a security analysis of our which leads to very precise searching that is inappropriate for
solution. Section VIII presents the results of our evaluations naı̈ve or everyday users. Others [5], [11] aim for a completely
using real-world datasets. And finally, section IX concludes user-transparent solution where the user needs only to write
the paper and shows a plan for future works and extensions. a simple query with possible tags, while others still [7], [8]
aim for a hybrid approach in which the system may ask a
II. R ELATED W ORK user for clarification on the meaning of their query. All of
We provide a review on other research works undertaken these methods use some form of query modification coupled
in the three fields most related to this work and position the with an ontology structure for defining related terms to achieve
contribution of our works against them. their semantic nature. In addition, these ontology structures
often need to be large and custom-tailored to their specific
A. Searchable Encryption use cases or domain, making them very domain-dependent and
Solutions for searchable encryption (SE) are imperative unadaptable to different areas. Surprisingly, few of the works
for privacy preservation on the cloud. The majority of SE in this field offer a ranking of results, instead having the user
solutions follow one of two main approaches, the first of which choose from a potentially large pool of related documents.
being to use cyryptographic algorithms to search the encrypted
text directly. This approach is generally chosen because it is C. Semantic Search over Encrypted Data
provably secure and requires no storage overhead on the server, Few works at the time of writing have combined the ideas
but solutions utilizing this method are generally slower [18], of semantic searching and searchable encryption. Works that
especially when operating on large storage blocks with large attempt to provide a semantic search often only consider word
files. This approach was pioneered by Song et al. [18], in similarity instead of true semantics.
which each word in the document is encrypted independently Li et al. proposed in [12] a system which could handle
and the documents are sequentially scanned while searching minor user typos through a fuzzy keyword search. Wang et al.
for tokens that match the similarly encrypted query. Boneh [22] used a similar approach to find matches for similar key-
et al. produced a similar system in [3] which utilized public words to the user’s query by using edit distance as a similarity
key encryption to write searchable encrypted text to a server metric, allowing for words with similar structures and minor
from any outside source, but could only be searched over by spelling differences to be matched. Amini et al. presented in
using a private key. While methods following this approach [17] a system for searching for regular expressions, though
are secure, they often only support equality comparison to the this still neglects true semantics for another form of similarity.
3723
Moataz et al. [15] used various stemming methods on terms
in the index and query to provide more general matching. Sun
et al. [19] presented a system which used an indexing method
over encrypted file metadata and data mining techniques to
capture semantics of queries. This approach, however, builds
a semantic network only using the documents that are given
to the set and only considers words that are likely to co-occur
as semantically related, leaving out many possible synonyms
or categorically related terms.
III. S YSTEM M ODEL
The problem of multi-phrase searching can be formally
represented using the following elements:
• A Vocabulary of plaintext words V = {v1 , v2 , v3 , . . . , vn }
which constitutes a language (e.g., English)
• A Document (represented as a set of words) di =
{di1 , di2 , di3 , . . . , din } where di j ∈ V
• A multi-phrase Query q = {q1 , q2 , q3 , . . . , qn } where
qi ∈ V
• A Collection of documents C = {d1 , d2 , d3 , . . . , dN }
• A list of Relevant Documents R(q) ⊆ C where R is a
function for determining relevance based on a query
The aim of the search system is to find R(q) using q as a
guide for what elements of C it should contain.
To ensure the results of R(q) are as relevant as possible, we
consider adding semantics to the searching process. This adds
the following elements:
• A modification process M(q) which enriches q with
semantic data.
• A modified query set Q = M(q) which contains additional
related terms and ideas related to q.
• A weighting system W (Q) to weight the terms in Q based
on their closeness to the original query.
Introducing semantic data to the search process allows the
system to return results that are more meaningfully related
to the original query. Weighting is utilized to ensure that
the original terms in a document contribute more to that Fig. (1) Overview of S3C architecture and processes. Parts
document’s ranking than a related term. within the solid-line group indicate items or processes on the
The introduction of encryption adds the following elements: client side which are considered trusted. Parts in the dashed-
• A ciphertext version of the original Vocabulary V 0 = line group indicate those in the cloud processing server. All
{H(v1 ), H(v2 ), H(v3 ), . . . , H(vn )} where H is a hash func- components in the cloud are considered untrusted.
tion
• A Collection of encrypted documents C0 =
{E(d1 ), E(d2 ), E(d3 ), . . . , E(dN )} where E is an network channels between all machines should be considered
encryption method open to both external and internal attacks. Figure 1 presents
• A list of relevant documents R0 (q) ⊆ C0 an overview of the three components and processes associated
with them in the system.
Formally, the challenge is to find the relevant list of elements
in C0 while still using a plaintext multi-phrase query, and to
A. Client Application
have R0 (q) be as similar to R(q) as possible.
The client application provides an interface for the user to
IV. S3C A RCHITECTURE perform a document upload or search over the data in the
S3C has three main components, namely the client ap- cloud. It is responsible for parsing and extracting keywords
plication, cloud processing server, and cloud storage. The from plaintext documents and encrypting them before they
lightweight client application is hosted on the user’s machine, are uploaded to the cloud.
and is the only system in the architecture that is assumed to When the user requests to search, S3C expands the query
be trusted. Both cloud units are expected to be maintained by based on the system’s semantic scheme and transforms the
a third party cloud provider and are thus considered “honest query into the secure query set (i.e., trapdoor) to be sent to
but curious”. In our threat model, both cloud systems and the the cloud. The user will then receive a ranked list of documents
3724
and can select a file for the system to download and decrypt. hashed term with a list of documents it appeared in. The size
of the uploaded document is also recorded within the index.
B. Cloud Processing Server Our system also supports batch uploading of many data files
The cloud server is responsible for constructing and updat- at once and processes them as a series of individual files with
ing the inverted index and other related data structures based linear complexity.
on the parsed and processed data sent from the client. The
structures are created entirely out of hashed tokens to keep C. Search Process
the server oblivious to the actual file content. The search process consists of two main phases: query
When the server detects that the client has requested to modification and index searching and ranking. The query
search, it will receive the trapdoor and perform the search over modification phase starts with the user entering a plaintext
its index (see section V) and gives each related document a query into the client application. The query is then modified
score. Once the highest ranking documents are determined, on the client side and then sent to the cloud processing server
the server can request to retrieve them from the cloud storage where index searching and ranking is performed. The process
and send them back to the client. of query modification takes in the original query q and expands
C. Cloud Storage into the modified query set Q. It involves three phases: query
splitting, semantic expansion, and weighting.
The cloud storage block is used to store the encrypted files
The goal of splitting the query is to break q into smaller
that the user uploads. It will not see any representation of the
components. This is done because a multi-phrase string hashes
user’s query. The storage can potentially span multiple clouds,
to a different value than the sum or concatenation of the hash
so long as the computing server knows where each document
values of its parts, and once on the cloud, the terms must
is stored and the index is updated accordingly.
match the entries in the hashed index exactly. Once this phase
V. OVERVIEW OF U PLOAD AND S EARCH P ROCESSES is complete, Q will consist of q and its split parts.
A. Overview In order to achieve semantic expansion, the system injects
semantic data through the use of online ontological networks.
The main idea behind our approach is that if we can use a The most naı̈ve approach to this is to perform a synonym
searching method that is agnostic towards the meaning of the lookup for each member of Q (termed Qi ) through an online
terms in the documents or the query and only considers their thesaurus and add the results to Q. This assures that the search
occurrence and frequency, then we can perform the search results will include documents containing terms synonymous
over encrypted data. For that purpose, we should assure that with, but not exactly matching, the user’s query.
we transform each occurrence of a distinct word in every However, this approach alone does not cover ideas that
document into the same token, and consequently apply the are semantically related to the user’s query, but are not
same transformation when that word appears in the search synonymous. To achieve this, our system pulls from more
query. Doing this ensures that a match is still produced during advanced ontological networks. For example, in this research
the search process. Hashing is a good way to achieve this. we use the contents of Q to pull entries from Wikipedia and
The Okapi BM25 algorithm [16], frequently used for stan- perform keyphrase extraction on them to get related terms and
dard text retrieval, is a term-frequency, inverse-document- phrases (hereafter referred to as related terms). These related
frequency model that works using an inverted index. The terms are then added to Q. The result of this is that the search
algorithm does not need to consider actual meaning of the can retrieve documents that contain concepts more abstractly
terms in the document, and only needs to know in which related to the user’s query (e.g., related diseases). In addition,
documents they exist. This feature makes it very applicable the use of online resources relieves the client of the need to
to our use case. store semantic networks locally.
S3C has two main functionalities: uploading documents The goal of weighting is to ensure that the search results are
from the client, and searching over documents in the cloud. more relevant to the user’s original query than the synonyms
We explain these functionalities in the rest of this section. and related terms. For example, a document that matches the
B. Upload Process entire original query should be weighted higher and considered
more relevant than a document that only matches synonyms.
The goal of the upload process is to parse the desired
To achieve this, we introduce the following weighting scheme
document into indexable information and encrypt it before
with weights ranging from 0 to 1:
being sent to the cloud. In general, a subset of terms from
the document (termed keywords) is selected to represent the • The original query q is weighted as 1.
semantics of that file. In addition, term frequency of the • Results of query splitting are weighted as 1/n where n is
keywords within that document is gathered, then the terms are the number of terms derived from splitting.
transformed individually into their hashed form and written to • Synonyms or related terms of a term Qi are weighted as
a temporary key file to be sent to the cloud along with the full W (Qi )/m, where W (Qi ) is the weight of Qi and m is the
encrypted text file. number of synonyms or related terms derived from Qi .
Once the cloud processing server receives the encrypted These weights are added to all members of Q to complete
document file and associated key file, it moves the encrypted the modified query set.
document into storage. Then the terms and frequencies in the Once the entirety of Q is built, its members are hashed to
key file will be added to the hashed index, which associates a create the trapdoor Q0 which is sent to the cloud to perform
3725
the index search and ranking. On the cloud processing server, secure scheme, however, as it leaves the full scope of each
the system goes through each member of Q0 and checks them document in the hashed index.
against the hashed index to compile a list of files that could be
considered related to the query. These related files are further B. Space-Efficient, Fully Secure Scheme: Selected Keyphrase
ranked using our modification of the BM25 equation described Semantic Search (SKSS)
in the following equations: SKSS creates a space-efficient index by running the doc-
ument through a keyphrase extractor to obtain a constant
number of the most important keywords and phrases within the
n
f (Q0i , di )· (α + 1)
r(di , Q0 ) = ∑ IDF(Q0i )· ·W (Q0i ) document (in our implementation, we considered collecting
i=1 f (Q0i , di ) + α· (1 − β + β· |dδi | 10 keyphrases). These phrases can be considered to convey
(1) general information on what the document is about. Because
IDF in this equation refers to the inverse document fre- they can contain more than one word, they are broken down
quency for the term, which can be defined as: into their individual distinct words, so that the key file sent
to the server contains both hashed representations of the full
N − n(Q0i ) + 0.5
IDF(Q0i ) = log (2) phrase and each word within it. The use of a constant number
n(Q0i ) + 0.5 of terms per document keeps storage overhead small and
We define the terms in these equations as follows: increases security, as much of the document is not put in the
• Qi - an individual term in the original plaintext query
hashed index.
• Q0i - the hashed version of Qi in the hashed query set.
In an effort to further increase security, term frequency is
• r(di , Q0 ) - the ranking score attributed to document di for
eliminated on the justification that each term is considered to
hashed query set Q0 be equally important to the meaning of the document, and thus
• f (Q0i , di ) - the frequency of term Qi in document di
can be considered equally frequent within the document.
• N - the total number of documents in the collection C
To split the query, SKSS splits not only into individual
• n(Q0i ) - the total number of documents containing the
words, but into all possible adjacent subsets. An example
query term Qi of this can be seen in Figure 2. While some of the phrases
• |di | - the length of document di in words
added to the set might be meaningless (“Failure Wireless
• δ - the average length of all documents in C
Sensor”, for example), others will carry meaning that will be
• W (Q0i ) - the weight associated with term Qi
important during the semantic lookup (“Sensor Networks”,
• α and β - constants (in this work we considered the values
for example). Once the splitting is complete, synonyms and
1.2 and 0.75, respectively) related terms are looked up for all of the resulting phrases in
the query set.
The cloud processing server computes this equation for all
documents in the collection and returns the list to the client,
“Failure in Wireless Sensor Networks”, “Failure
sorted by score in descending order. Wireless Sensor”, “Wireless Sensor Networks”,
“Failure Wireless”, “Wireless Sensor”, “Sensor
VI. S EARCH S CHEMES Networks”, “Failure”, “Wireless”, “Sensor”,
S3C considers three main schemes for how to implement “Networks”
our approach. The primary differences among the proposed Fig. (2) A sample of the query splitting done by
schemes are: how to select the subset of terms to represent the SKSS.
document, split the user search query, and perform ranking. When performing ranking, SKSS modifies Equation (1)
A. Naı̈ve Scheme: Full Keyword Semantic Search (FKSS) to compensate for the lack of frequency data. Because the
keyphrae extractor pulls a limited number of terms from
FKSS follows the naı̈ve method of selecting terms as the document, all extracted phrases are considered equally
keywords. It simply goes through the document and collects frequent. Thus, a 1 is put in place of f (qi , di ).
and counts the frequency of each individual word that is not
considered a stopword. This gives the hashed index the full C. Space-Efficient, Accuracy Driven Scheme: Keyphrase
scope of the document, as no meaningful text is left out, but Search With Frequency (KSWF)
bloats it with possibly unneeded terms. KSWF is a combination of the two previous schemes. The
FKSS also follows a naı̈ve method of splitting the query, keyphrase extractor is still used to obtain keywords for the
as it just divides it into singular words. This is all that is index, similar to SKSS, and the phrases are subsequently split
necessary, as the keyword selection for the hashed index only into their individual words. After this, it makes a second pass
considers single words. Thus splitting the query into larger through the document to collect the frequency information
groups of words would add no value. for each word and phrase, similar to FKSS, which is stored
Ranking for FKSS is performed with no modification from alongside the terms in the index.
Equation (1). The user query is split in the same manner as SKSS, with
Though FKSS follows a naı̈ve approach, it can be useful for each adjacent subset added to the overall query set. Because
scenarios in which small documents are used or it is integral the frequency data is now present for all of the terms and
for the full text to be considered. For example, searching over phrases, it uses the same ranking method as FKSS. This
encrypted media tags or social media updates. It is the least scheme was developed primarily to analyze the impact of
3726
utilizing term frequency with a method like SKSS. Intuitively, from the document, meaning the attacker would only be able
adding term frequency should bring up more relevant search to ascertain how many times those specific terms and phrases
results, as there is more accurate data for the ranking. were in the document. The SKSS scheme adds more security
The addition of frequency data to KSWF adds greater by removing those term frequencies.
accuracy to the ranking function. For this reason, it is useful An attacker monitoring the process during a search could
in scenarios in which the highest accuracy possible is desired see the resultant file identifiers that are associated with the
while maintaining minimal storage overhead. given Q0 . This would show an encrypted history as (C0 , Q0 ).
However, since the attacker would not be able to discern the
VII. S ECURITY A NALYSIS query (without the use of the above dictionary), this data would
S3C provides a trustworthy architecture for storing confi- be of little use.
dential information securely in clouds while maintaining the Attackers could also potentially attempt to alter data in
ability to search over them. The only trusted component of C0 . These attacks, however, could be recognized as the client
the architecture is the user machine, which has access to all would not be able to decrypt them.
sensitive information such as the full plaintext documents and
the document key files. Keeping the client machine trusted is VIII. P ERFORMANCE E VALUATION
a reasonable assumption in the real world, as it can be kept A. Overview
with minimal exposure to outside attackers. To evaluate the performance of S3C and provide proof of
Our threat model assumes that adversaries may intend to concept, we tested it with the Request For Comments (RFC)
attack the communication streams between client and cloud dataset, a set of documents containing technical notes about
processing server and between cloud processing server and the Internet from various engineering groups. The dataset has
cloud storage, as well as the cloud processing server and stor- a total size of 357 MB and is made up of 6,942 text files.
age machines themselves. To explain what exactly the attacker To evaluate our system under Big data scale datasets, we
could see or do, we will first introduce some definitions. utilized a second dataset, the Common Crawl Corpus from
History: For a multi-phrase query q on a collection of AWS, a web crawl composed of over five billion web pages
documents C, a history Hq is defined as the tuple (C, q). In We evaluated our system against the RFC using three types of
other words, this is a history of searches and interactions metrics: Performance, Overhead, and Relevance.
between client and cloud server.
View: The view is whatever the cloud can actually see B. Metrics for Evaluation
during any given interaction between client and server. For our 1) Relevance: We define relevance as how closely the
system, this includes the hashed index I over the collection C, returned results meet user expectations. To evaluate the rel-
the trapdoor of the search query terms (including its semantic evance of our schemes, we used the TREC-Style Average
expansion) Q0 , the number and length of the files, and the Precision (TSAP) method described by Mariappan et al. in
collection of encrypted documents C0 . Let V (Hq ) be this view. [14]. This method is a modification of the precision-recall
Trace: The trace is the precise information leaked about method commonly used for judging text retrieval systems. It
Hq . For S3C, this includes file identifiers associated with the is defined as follows:
search results of the trapdoor Q0 . It is our goal to allow the
attacker to infer as little information about Hq as possible. ∑Ni=0 ri
The view and trace encompass all that the attacker would Score = (3)
N
be able to see. For the sake of this analysis, we will assume Where i is the rank of the document determined by the
that the chosen encryption and hashing methods are secure, system and N is the cutoff number (10 in our case, hence the
and so C0 itself will not leak any information. I only shows term TSAP@10). ri takes three different values:
a mapping of a single hashed term or phrase to a set of file
• ri = 1/i if the document is highly relevant
identifiers with frequencies, meaning a distribution of hashes
• ri = 1/2i if the document is somewhat relevant
to files could be compiled, but minimal data could be gained
• ri = 0 if the document is irrelevant
from the construction. Similarly, Q0 only shows a listing of
hashed search terms with weights. The addition of the weights This allows for systems to be given a comparative score
could potentially enable the attacker to infer which terms in against other schemes in a relatively fast manner.
the trapdoor were part of the original query, but they would 2) Performance: We define performance as the time it takes
still only have a smaller set of hashed terms. to perform the search operation. The aspects of performance
However, we must consider the small possibility that, if we measure are as follows:
the attacker was able to know the hash function used on • Time it takes to process the user query in seconds. This
the client side, they could in theory build a dictionary of all includes semantic query modification and hashing into
words in the vocabulary V that the documents are comprised the trapdoor.
of mapped to their hashed counterparts, and reconstruct I • Time it takes to search over the index in the cloud in
in plaintext. In this scenario, the attacker could put together seconds. This includes retrieving the related files from
the terms that the documents are comprised of, but since I the index and ranking them based on the query.
carries no sense of term order, they could not reconstruct the • Total time to perform the search in seconds. This encap-
entire file. The KSWF scheme adds additional security by only sulates both of the steps above, plus any additional time
showing a small portion of the important terms and phrases taken with communication over the network.
3727
3) Overhead: We define overhead to be the imposed cloud
server storage space taken by the hashed index and the com-
puting involved with it. The aspects of overhead we measure
are as follows:
• Size of the inverted index, measured in the form of
number of entries.
• Time it takes to construct the index in seconds. This
operation reads the data files for the index and compiles
them into a hash table. It is only performed on the cloud
server startup.
C. Benchmarks
We derived a set of benchmark queries based on the
information presented in the dataset. For testing relevance,
we looked at two categories of queries which a user may
desire to search. In the first category we consider a user who Fig. (4) TSAP@10 score for the specified query for each
already knows which document they are looking for, but may system. Once the system has returned a ranked list of results,
not remember where the document is located in their cloud or a score is computed based off of a predetermined relevance
may not want to look through a large number of files to find it. each file has to the given query.
Such queries are typically specific and only a small number of
documents should directly pertain to them. The search system the same semantic processing but with no encryption or
is expected to bring up these most desired documents first. hashing. Due to their similarities in indexing, the SNSS and
In the second category we consider a user who wants to FKSS schemes can be seen as being grouped together, as they
find all of the documents related to an idea, such as the both consider the entirety of the document text. Similarly, the
nurse attempting to find all patients with a similar disease SKSS and KSWF schemes can be grouped together since they
or diagnosis in our motivation. Such queries would be broad both consider a small subset of the document text.
with many possible related documents, and the search system
should bring up the most relevant ones first. D. Evaluating Relevance
Figure 4 shows the TSAP scores of each of the four schemes
• Category 1 - Specific: searching with each of the benchmark queries.
IBM Research Report (IRR) For queries in category 1, the main desired results were
Licklider Transmission Protocol (LTP)
Multicast Listener Discovery Protocol (MLDP)
ranked the highest for all schemes. Our space-efficient
• Category 2 - Broad: schemes (the SKSS and KSWF), which might intuitively
Internet Engineering (IE) seem to suffer greatly in accuracy, only show to suffer a
Transmission Control Protocol (TCP) small amount when compared to the schemes that utilize the
Cloud Computing (CC) documents full text. For queries in category 2, the SKSS and
Encryption (EN) KSWF schemes showed to return just as relevant results, and
Fig. (3) Queries used for testing relevance. Queries in in some cases were more relevant. Most interestingly, the
category 1 target a small set of specific, known docu- KSWF scheme does not actually show much benefit from the
ments within the collection, while queries in category 2 addition of term frequency, meaning that when working with
target a broad set of documents not necessarily known a small subset of the document’s text, finding the frequency of
to the user. those key phrases may be unnecessary due to causing a longer
indexing time, unless the highest possible accuracy is desired.
To measure performance, we measured time for a small
(single word) query and a mid-size (three word) query. In E. Evaluating Performance
addition, to measure the effects of expanding the size of the In the experiments, we measured the performance of each
search query, we measured times for queries that expanded scheme with a small (one-word) and mid-sized (three-word)
from one word to four words, taking measurements at each query, gathering the total time it takes to perform the search.
single word increment. Due to the inherent variety in the In addition, we measured the two main components of the
performance results, we report the mean and 95% confidence total search time: the time taken for query modification and
interval of 50 rounds of running each experiment. the time taken to perform the index search and ranking on the
For our scalability tests, we measured search times and cloud.
storage overhead for several three word queries against in- Results can be seen in Figures 5, 6, and 7. All schemes
creasingly large portions of the dataset. Specifically, we tested can be seen to be reasonably similar in terms of total search
against datasets of sizes: 500 MB, 1 GB, 5 GB, 10 GB, 25 time. The majority of search time across all models is taken
GB, and 50 GB. up by the query processing phase, as our system needs to pull
As a baseline for performance testing, we implemented a information from across the Internet in the form of synonyms
standard non-secure (SNSS) version of the system, utilizing and Wikipedia entry downloads. SKSS and KSWF both take
3728
250000
Number of Entries in Hashed Index

200000
150000
100000
50000
0
SNSS FKSS SKSS KSWF
System
Fig. (5) Total search time in each scheme. This includes the Fig. (8) Size of the inverted index for each system. An entry
time taken to process the query, communicate between client denotes a hashed keyword mapped to a set of file identifiers.
and server, and perform searching over the index. The results
are averaged over 50 runs. slightly longer to process longer queries due to the addition
of the adjacent query subsets which need to be looked up as
well. Query processing time is thus linked to Internet speeds
and the size of the Wikipedia entry for each of the query
terms. The results indicate that under fast Internet speeds, the
performance time of this system will naturally improve. While
pulling information from the Internet does naturally increase
search times, it was included intentionally to reduce storage
size needed for the local client which would otherwise need
to house an onboard ontological network.
Most important to note is the difference in index searching
times. The space-efficient SKSS and KSWF schemes take
a near-negligible amount of time to search over the index.
This can be explained by the vastly decreased index size (as
shown in the next subsection) as only key phrases are stored,
meaning that the initial set of potentially relevant documents
is significantly smaller and the ranking equation must be run
Fig. (6) Time to process the query. This includes query
a lower number of times.
modification and hashing into the trapdoor. The results are Because the greatest amount of time is taken during query
averaged over 50 runs. processing and index search time is very small for the space-
efficient schemes, these two schemes can be scaled up to work
on larger datasets without facing a huge growth in search time.
F. Evaluating Overhead
To demonstrate space-efficiency in our evaluation, we mea-
sured the overhead for each scheme in terms of how many
entries were stored in the hashed index. These results can
be seen in figures 8 and 9. The two groups of schemes
show a vast difference in this regard, due to the number of
terms selected from each document. The linear growth per
document of the index guaranteed by the constant number of
key phrases extracted keeps the index small while maintaining
the relevance of search results (as shown previously).
In addition, we measured the effect that the size of the
inverted index had on the time it takes to construct the
Fig. (7) Time it takes to perform the search on the hashed index from the utility files on the server. The differences are
index on the cloud. This includes the time taken to find all again vast, with construction times being almost negligible
files in the hashed index that contain any hashed terms in the for SKSS and KSWF. It is worth noting that this operation
query trapdoor and rank them with the scheme’s respective needs only to be performed at startup of the cloud server, and
functions. The results are averaged over 50 runs. that additions to the index at runtime operate at near constant
3729
25
7
Index Construction Time (s)
20 6
5
Search Time (s)

15
4
total search time
10 3 query processing time
hashed index search time
2
5
1
0 0
SNSS FKSS SKSS KSWF 0 10 20 30 40 50
Systems Size of Dataset (GB)
Fig. (9) Time it takes to construct the hashed index upon Fig. (11) Time taken to search for different dataset sizes.
server startup. This operation includes sequentially reading an Resulting times are the mean of 50 runs performed with
index file hosted on the cloud server which contains all data multiple three word queries. The dotted line shows the time
for the inverted index and document sizes table and storing it taken to search on the hashed index in the cloud, the dashed
in hash tables. line shows the time taken for query modification, and the solid
line shows total time taken for the search (including query
modification and index searching).
terms in the query. This is because the majority of search time

is taken up by the query processing phase, which grows with
the number of terms in the query there are to be processed.
The SKSS and KSWF schemes can be shown to have a faster
growth due to the greater amount of query processing nec-
essary as the query expands. Interestingly, SKSS consistently
performs as well or better than the others despite the additional
query processing. This is due to its small index size and lack
of frequency data collection.
H. Evaluating Scalability
To test the scalability of our system, we ran searches against
Fig. (10) Total search time for an expanding query. This an increasingly large set of data. For simplicity, evaluations
includes the time to process the search query, communicate of this were only performed using our most space-efficient
between client and server, and rank in the cloud. The horizon- scheme, SKSS. The resulting search times are an average of
tal axis shows the number of words (minus stopwords) in the mid-sized (three word) queries. Figure 11 shows the results of
query. The results were averaged over 50 runs. this evaluation.
These results show that as the size of the dataset increases,
the time taken for query modification remains relatively con-
time regardless of the size of the dataset due to the hash table stant, while the time spent searching the hashed index on the
structure of the index. cloud increases linearly. As a result, the total search time
increases by only 30.8% as the dataset increases from 500
G. Evaluating the Impact of Query Length MB to 50 GB.
In addition to measuring search times for individual queries, In addition, to show the low overhead provided by our
we are interested in measuring the effect of expanding the system, we measured the size of the index at each size increase
size of a single query from one term to four. For example, during our test. The results are shown in figure 12. We
one query used in this experiment started as protocol which conclude that though the relation between dataset size and
expanded to transmission protocol which further ex- index size is linear, the slope is as low as 0.003. The index
panded to transmission control protocol which finally size always remains at ∼0.3% of the size of the dataset.
expanded to network transmission control protocol.
Figure 10 shows the results of this experiment, with queries IX. D ISCUSSION AND F UTURE W ORKS
grouped by the number of meaningful terms in them (query In this paper, we presented S3C, a system for securely and
length minus stopwords) in the horizontal axis. semantically searching data in the cloud, with three different
In these results, the time it takes to search (vertical axis) schemes fit for different use cases. Our system improves
can be seen to be linearly related to the number of meaningful upon existing encrypted data search techniques by providing
3730
IEEE International Conference on Data Science and Data Intensive
160 Systems, pages 38–45, Dec. 2015.
[2] Christian Bizer, Peter Boncz, Michael L. Brodie, and Orri Erling. The
140 meaningful use of big data: Four perspectives – four challenges. ACM
SIGMOD Record, 40(4):56–60, Jan. 2012.
120
Size of Index (in MB)
[3] Dan Boneh, Giovanni Di Crescenzo, Rafail Ostrovsky, and Giuseppe

Persiano. Public Key Encryption with Keyword Search, pages 506–522.
100 Springer, 2004.
[4] Reza Curtmola, Juan Garay, Seny Kamara, and Rafail Ostrovsky.
80 Searchable symmetric encryption: improved definitions and efficient
constructions. Journal of Computer Security, 19(5):895–934, Nov. 2011.
60 [5] Eric J. Glover, Steve Lawrence, William P. Birmingham, and C. Lee
40 Giles. Architecture of a metasearch engine that supports user infor-
mation needs. In Proceedings of the 8th International Conference on
20 Information and Knowledge Management, pages 210–216, Nov. 1999.
[6] Eu-Jin Goh et al. Secure indexes. Cryptology ePrint Archive, page 216,
0 2003.
0 10 20 30 40 50 [7] R. Guha, Rob McCool, and Eric Miller. Semantic search. In Proceedings
Size of Dataset (in GB) of the 12th International Conference on World Wide Web, WWW ’03,
pages 700–709, May 2003.
Fig. (12) Size of the index file for different dataset sizes. [8] Jeff Heflin and James Hendler. Searching the web with SHOE. In
The horizontal axis plots the size of the dataset used in Proceedings of the 17th Association for the Advancement of Artificial
Intelligence Workshop on AI for Web Search, AAAI ’00, pages 35–40,
gigabytes, while the vertical axis plots the associated index July 2000.
size in megabytes. [9] M. Javanmard, M. A. Salehi, and S. Zonouz. Tsc: Trustworthy and
scalable cytometry. In Proceedings of the 7th IEEE International
Symposium on Cyberspace Safety and Security, pages 1356–1360, Aug.
a solution that is space-efficient (i.e., SKSS) on both the 2015.
[10] Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris
cloud and client sides, considers the semantic meaning of Plexousakis, and Michel Scholl. RQL: A declarative query language for
the user’s query, and returns a list of documents accurately RDF. In Proceedings of the 11th International Conference on World
ranked by their similarity to the query. Further, the semantics Wide Web, pages 592–603, May 2002.
[11] Yuangui Lei, Victoria Uren, and Enrico Motta. Semsearch: A search
are achieved without the need for a highly specific semantic engine for the semantic web. In Proceedings of the 15th international
network to be built and maintained by the client. The system conference on Managing Knowledge in a World of Networks, pages
requires only a single plaintext query to be entered and is 238–245. Springer, Oct. 2006.
[12] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou. Fuzzy keyword
easily portable to thin-clients, making it simple and quick search over encrypted data in cloud computing. In Proceedings of
for users to use. The system is also shown to be secure and the 29th IEEE International Conference on Computer Communications,
resistant to attacks. INFOCOM ’10, pages 1–5, Mar. 2010.
[13] Christoph Mangold. A survey and classification of semantic search ap-
Our experiments with a working prototype of each of our proaches. International Journal of Metadata, Semantics and Ontologies,
schemes show that S3C is accurate and gives reasonable 2(1):23–34, 2007.
performance with low overhead. We argue that each of our [14] A. K. Mariappan, R. M. Suresh, and V. Subbiah Bharathi. A comparative
study on the effectiveness of semantic search engine over keyword
schemes could be tuned to certain use cases. SKSS could be search engine using tsap measure. International Journal of Computer
used for documents with a mid-sized amount of encrypted text Applications EGovernance and Cloud Computing Services, pages 4–6,
where the key phrase extraction can capture the meaning of Dec. 2012.
[15] T. Moataz, A. Shikfa, N. Cuppens-Boulahia, and F. Cuppens. Semantic
the document well, providing a very low overhead solution; search over encrypted data. In Proceedings of the 20th International
for example, hospital records with encrypted diagnosis notes. Conference on Telecommunications (ICT), pages 1–5, May 2013.
KSWF could be used in similar cases in which the slight [16] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-
Beaulieu, and Mike Gatford. Okapi at TREC-3. Overview of the Third
raise in accuracy is considered worth the slight decrease in Text Retrieval Conference (TREC-3), 3:109–126, 1995.
performance and security. FKSS could be used for small doc- [17] M. A. Salehi, T. Caldwell, A. Fernandez, E. Mickiewicz, E. W. D.
uments where the whole of the text is considered important; Rozier, S. Zonouz, and D. Redberg. Reseed: Regular expression search
over encrypted data in the cloud. In Proceedings of the 7th IEEE
examples including twitter updates or media tags. In addition, International Conference on Cloud Computing, pages 673–680, June
experiments showed that, due to low overhead, SKSS and 2014.
KSWF schemes can be utilized for searching big data scale [18] Dawn Xiaodong Song, D. Wagner, and A. Perrig. Practical techniques
for searches on encrypted data. In Proceedings of the 17th IEEE
datasets. symposium on Security and Privacy, pages 44–55, May 2000.
Because of the nature of our semantic query expansion, [19] Xingming Sun, Yanling Zhu, Zhihua Xia, and Lihong Chen. Privacy
we discuss how this work can be expanded. As new online preserving keyword based semantic search over encrypted cloud data.
International Journal of Security and Its Applications, 8(3), May 2014.
semantic networks are added to the internet and made available [20] Alberto Tonon, Michele Catasta, Roman Prokofyev, Gianluca Demartini,
for applications, they too could be extracted from and added to Karl Aberer, and Philippe Cudr-Mauroux. Contextualized ranking of
the query as part of the semantic processing step. This could entity types based on knowledge graphs. Web Semantics: Science,
Services and Agents on the World Wide Web, 3738:170 – 183, Mar.
lead to more accurate domain-specific searching. Additionally, 2016.
we are working towards pruning the search processing through [21] Peter van Liesdonk, Saeed Sedghi, Jeroen Doumen, Pieter Hartel,
topic-based clustering on the hashed index and only searching and Willem Jonker. Computationally efficient searchable symmetric
encryption. In Proceedings of the 7th VLDB Workshop on Secure Data
over related clusters. Management, pages 87–100. Springer, Sep. 2010.
R EFERENCES [22] C. Wang, K. Ren, Shucheng Yu, and K. M. R. Urs. Achieving usable
and privacy-assured similarity search over outsourced cloud data. In
[1] A. Andrejev, D. Misev, P. Baumann, and T. Risch. Spatio-temporal Proceedings of the 31st IEEE International Conference on Computer
gridded data processing on the semantic web. In Proceedings of the 2015 Communications, INFOCOM ’12, pages 451–459, Mar. 2012.
3731

Published Version PDF

Uploaded by

Copyright:

Available Formats

Published Version PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Published Version PDF

Uploaded by

Copyright:

Available Formats

2016 IEEE International Conference on Big Data (Big Data)

S3C: An Architecture for Space-Efficient Semantic

978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3722

Number of Entries in Hashed Index

Search Time (s)

terms in the query. This is because the majority of search time

[3] Dan Boneh, Giovanni Di Crescenzo, Rafail Ostrovsky, and Giuseppe

You might also like