Published Version PDF
Published Version PDF
Published Version PDF
Abstract—The recent rapid growth in Internet speeds and with system users who may not remember exact keywords in
file storage requirements has made cloud storage an appealing the documents they are looking for or may want to retrieve
option on both a personal and enterprise level. Despite the many documents similar to what they are asking for. Users can
benefits offered by cloud storage, many potential users with
sensitive data refrain from fully utilizing this service due to valid potentially also require the ability to perform the search on
concerns about information privacy. An established solution to their thin client devices. One example of such organization
this concern is to perform encryption on the user side with the is a hospital with encrypted patient records on the cloud
key stored on a local machine, meaning the cloud will never see and staff who would like to be able to find patients with
the user’s plaintext data. However, by encrypting data on the similar diagnoses using a tablet. Another example is a police
user side data processing capabilities (e.g., searching) are lost.
In particular, the ability to semantically search is of the user’s organization with officers who would like to search over
interest in large datasets. In this paper, we present S3C, a system encrypted police records on government clouds while on the
that provides a semantic search functionality over encrypted data move with their PDAs. An organization with this desire would
in the cloud. S3C combines approaches from traditional keyword- need a system that provides security for their documents and
based searchable encryption and semantic web searching. It a searching mechanism in which the user only has to enter a
offers a user transparent experience that accepts a simple multi-
phrase query and returns a list of documents ranked by semantic plaintext search query, and receive search results ranked based
relevance to the query. Our proposed approach is space-efficient, on the document’s relevance to the entered query. Therefore
which makes it suitable for large scale datasets. Our minimal they can see the most relevant documents first and will not
processing also allows the system to be run on thin clients such have to comb through a list of all relevant documents.
as smart-phones or tablets. We evaluate the performance of
our system against various real-world datasets, and our results
Other solutions to the problem of searchable encryption
show that it produces accurate search results while maintaining often fall short in three ways that make them unsuitable for the
minimal storage overhead (∼0.3% of the dataset size). examples described above. First, they do not offer semantic
Index Terms—Cloud services, Searchable Encryption, Seman- searching, meaning the user would need to remember exact
tic Search. keywords in the documents they are searching for. This is
inappropriate for a big data environment as it is unlikely that
I. I NTRODUCTION the users would be able to remember exact keywords very
Cloud storage is an efficient and scalable solution for well [2]. Second, solutions that do offer semantic searching
companies and individuals who want to store large to huge either only make the system more forgiving of typos and words
numbers of files without the burden of maintaining their own with similar spellings, or utilize a large semantic networks
data center. Despite the advantages offered by these solutions, that need to be stored locally making them inappropriate for
many potential clients abstain from using them due to valid thin-clients (e.g., [19]). Third, solutions that offer semantic
concerns over the security of the files once they are on remote searching often do not rank the related files by their relevance
servers, and thus desire stronger cloud security [9]. Tradi- to the query (e.g., [12]), instead offering only a boolean search
tionally, cloud providers provide security by encrypting user which returns a potentially huge pool of all related files.
documents on their own servers and storing the encryption key Therefore, we can define our problem as needing to answer
remotely, allowing internal attackers to access unauthorized these four questions:
data. One proven solution that addresses this concern is to
perform the encryption locally on the user’s machine before • How to semantically search a multi-phrase query over
it is transferred to the cloud [18]. Unfortunately, this limits encrypted files stored in the cloud?
the user’s ability to interact with the data, most importantly • How to rank results of a search based on semantic
limiting the ability to search over it. Although solutions for relevance to the user’s query?
searchable encryption exist, they often do not consider the • How to do the search processing on the cloud without
semantic meaning of the user’s query, impose a large storage revealing data to the cloud?
overhead, or do not rank documents based on their relevance • How to provide an approach that imposes the minimum
to the query. This research is an attempt to solve these issues. storage and processing overhead?
Our motivation in this research is an organization with In this paper, we introduce Secure Semantic Search over
increasingly large amounts of data with sensitive information, encrypted data in the Cloud (S3C), a scalable system that
3723
Moataz et al. [15] used various stemming methods on terms
in the index and query to provide more general matching. Sun
et al. [19] presented a system which used an indexing method
over encrypted file metadata and data mining techniques to
capture semantics of queries. This approach, however, builds
a semantic network only using the documents that are given
to the set and only considers words that are likely to co-occur
as semantically related, leaving out many possible synonyms
or categorically related terms.
III. S YSTEM M ODEL
The problem of multi-phrase searching can be formally
represented using the following elements:
• A Vocabulary of plaintext words V = {v1 , v2 , v3 , . . . , vn }
which constitutes a language (e.g., English)
• A Document (represented as a set of words) di =
{di1 , di2 , di3 , . . . , din } where di j ∈ V
• A multi-phrase Query q = {q1 , q2 , q3 , . . . , qn } where
qi ∈ V
• A Collection of documents C = {d1 , d2 , d3 , . . . , dN }
• A list of Relevant Documents R(q) ⊆ C where R is a
function for determining relevance based on a query
The aim of the search system is to find R(q) using q as a
guide for what elements of C it should contain.
To ensure the results of R(q) are as relevant as possible, we
consider adding semantics to the searching process. This adds
the following elements:
• A modification process M(q) which enriches q with
semantic data.
• A modified query set Q = M(q) which contains additional
related terms and ideas related to q.
• A weighting system W (Q) to weight the terms in Q based
on their closeness to the original query.
Introducing semantic data to the search process allows the
system to return results that are more meaningfully related
to the original query. Weighting is utilized to ensure that
the original terms in a document contribute more to that Fig. (1) Overview of S3C architecture and processes. Parts
document’s ranking than a related term. within the solid-line group indicate items or processes on the
The introduction of encryption adds the following elements: client side which are considered trusted. Parts in the dashed-
• A ciphertext version of the original Vocabulary V 0 = line group indicate those in the cloud processing server. All
{H(v1 ), H(v2 ), H(v3 ), . . . , H(vn )} where H is a hash func- components in the cloud are considered untrusted.
tion
• A Collection of encrypted documents C0 =
{E(d1 ), E(d2 ), E(d3 ), . . . , E(dN )} where E is an network channels between all machines should be considered
encryption method open to both external and internal attacks. Figure 1 presents
• A list of relevant documents R0 (q) ⊆ C0 an overview of the three components and processes associated
with them in the system.
Formally, the challenge is to find the relevant list of elements
in C0 while still using a plaintext multi-phrase query, and to
A. Client Application
have R0 (q) be as similar to R(q) as possible.
The client application provides an interface for the user to
IV. S3C A RCHITECTURE perform a document upload or search over the data in the
S3C has three main components, namely the client ap- cloud. It is responsible for parsing and extracting keywords
plication, cloud processing server, and cloud storage. The from plaintext documents and encrypting them before they
lightweight client application is hosted on the user’s machine, are uploaded to the cloud.
and is the only system in the architecture that is assumed to When the user requests to search, S3C expands the query
be trusted. Both cloud units are expected to be maintained by based on the system’s semantic scheme and transforms the
a third party cloud provider and are thus considered “honest query into the secure query set (i.e., trapdoor) to be sent to
but curious”. In our threat model, both cloud systems and the the cloud. The user will then receive a ranked list of documents
3724
and can select a file for the system to download and decrypt. hashed term with a list of documents it appeared in. The size
of the uploaded document is also recorded within the index.
B. Cloud Processing Server Our system also supports batch uploading of many data files
The cloud server is responsible for constructing and updat- at once and processes them as a series of individual files with
ing the inverted index and other related data structures based linear complexity.
on the parsed and processed data sent from the client. The
structures are created entirely out of hashed tokens to keep C. Search Process
the server oblivious to the actual file content. The search process consists of two main phases: query
When the server detects that the client has requested to modification and index searching and ranking. The query
search, it will receive the trapdoor and perform the search over modification phase starts with the user entering a plaintext
its index (see section V) and gives each related document a query into the client application. The query is then modified
score. Once the highest ranking documents are determined, on the client side and then sent to the cloud processing server
the server can request to retrieve them from the cloud storage where index searching and ranking is performed. The process
and send them back to the client. of query modification takes in the original query q and expands
C. Cloud Storage into the modified query set Q. It involves three phases: query
splitting, semantic expansion, and weighting.
The cloud storage block is used to store the encrypted files
The goal of splitting the query is to break q into smaller
that the user uploads. It will not see any representation of the
components. This is done because a multi-phrase string hashes
user’s query. The storage can potentially span multiple clouds,
to a different value than the sum or concatenation of the hash
so long as the computing server knows where each document
values of its parts, and once on the cloud, the terms must
is stored and the index is updated accordingly.
match the entries in the hashed index exactly. Once this phase
V. OVERVIEW OF U PLOAD AND S EARCH P ROCESSES is complete, Q will consist of q and its split parts.
A. Overview In order to achieve semantic expansion, the system injects
semantic data through the use of online ontological networks.
The main idea behind our approach is that if we can use a The most naı̈ve approach to this is to perform a synonym
searching method that is agnostic towards the meaning of the lookup for each member of Q (termed Qi ) through an online
terms in the documents or the query and only considers their thesaurus and add the results to Q. This assures that the search
occurrence and frequency, then we can perform the search results will include documents containing terms synonymous
over encrypted data. For that purpose, we should assure that with, but not exactly matching, the user’s query.
we transform each occurrence of a distinct word in every However, this approach alone does not cover ideas that
document into the same token, and consequently apply the are semantically related to the user’s query, but are not
same transformation when that word appears in the search synonymous. To achieve this, our system pulls from more
query. Doing this ensures that a match is still produced during advanced ontological networks. For example, in this research
the search process. Hashing is a good way to achieve this. we use the contents of Q to pull entries from Wikipedia and
The Okapi BM25 algorithm [16], frequently used for stan- perform keyphrase extraction on them to get related terms and
dard text retrieval, is a term-frequency, inverse-document- phrases (hereafter referred to as related terms). These related
frequency model that works using an inverted index. The terms are then added to Q. The result of this is that the search
algorithm does not need to consider actual meaning of the can retrieve documents that contain concepts more abstractly
terms in the document, and only needs to know in which related to the user’s query (e.g., related diseases). In addition,
documents they exist. This feature makes it very applicable the use of online resources relieves the client of the need to
to our use case. store semantic networks locally.
S3C has two main functionalities: uploading documents The goal of weighting is to ensure that the search results are
from the client, and searching over documents in the cloud. more relevant to the user’s original query than the synonyms
We explain these functionalities in the rest of this section. and related terms. For example, a document that matches the
B. Upload Process entire original query should be weighted higher and considered
more relevant than a document that only matches synonyms.
The goal of the upload process is to parse the desired
To achieve this, we introduce the following weighting scheme
document into indexable information and encrypt it before
with weights ranging from 0 to 1:
being sent to the cloud. In general, a subset of terms from
the document (termed keywords) is selected to represent the • The original query q is weighted as 1.
semantics of that file. In addition, term frequency of the • Results of query splitting are weighted as 1/n where n is
keywords within that document is gathered, then the terms are the number of terms derived from splitting.
transformed individually into their hashed form and written to • Synonyms or related terms of a term Qi are weighted as
a temporary key file to be sent to the cloud along with the full W (Qi )/m, where W (Qi ) is the weight of Qi and m is the
encrypted text file. number of synonyms or related terms derived from Qi .
Once the cloud processing server receives the encrypted These weights are added to all members of Q to complete
document file and associated key file, it moves the encrypted the modified query set.
document into storage. Then the terms and frequencies in the Once the entirety of Q is built, its members are hashed to
key file will be added to the hashed index, which associates a create the trapdoor Q0 which is sent to the cloud to perform
3725
the index search and ranking. On the cloud processing server, secure scheme, however, as it leaves the full scope of each
the system goes through each member of Q0 and checks them document in the hashed index.
against the hashed index to compile a list of files that could be
considered related to the query. These related files are further B. Space-Efficient, Fully Secure Scheme: Selected Keyphrase
ranked using our modification of the BM25 equation described Semantic Search (SKSS)
in the following equations: SKSS creates a space-efficient index by running the doc-
ument through a keyphrase extractor to obtain a constant
number of the most important keywords and phrases within the
n
f (Q0i , di )· (α + 1)
r(di , Q0 ) = ∑ IDF(Q0i )· ·W (Q0i ) document (in our implementation, we considered collecting
i=1 f (Q0i , di ) + α· (1 − β + β· |dδi | 10 keyphrases). These phrases can be considered to convey
(1) general information on what the document is about. Because
IDF in this equation refers to the inverse document fre- they can contain more than one word, they are broken down
quency for the term, which can be defined as: into their individual distinct words, so that the key file sent
to the server contains both hashed representations of the full
N − n(Q0i ) + 0.5
IDF(Q0i ) = log (2) phrase and each word within it. The use of a constant number
n(Q0i ) + 0.5 of terms per document keeps storage overhead small and
We define the terms in these equations as follows: increases security, as much of the document is not put in the
• Qi - an individual term in the original plaintext query
hashed index.
• Q0i - the hashed version of Qi in the hashed query set.
In an effort to further increase security, term frequency is
• r(di , Q0 ) - the ranking score attributed to document di for
eliminated on the justification that each term is considered to
hashed query set Q0 be equally important to the meaning of the document, and thus
• f (Q0i , di ) - the frequency of term Qi in document di
can be considered equally frequent within the document.
• N - the total number of documents in the collection C
To split the query, SKSS splits not only into individual
• n(Q0i ) - the total number of documents containing the
words, but into all possible adjacent subsets. An example
query term Qi of this can be seen in Figure 2. While some of the phrases
• |di | - the length of document di in words
added to the set might be meaningless (“Failure Wireless
• δ - the average length of all documents in C
Sensor”, for example), others will carry meaning that will be
• W (Q0i ) - the weight associated with term Qi
important during the semantic lookup (“Sensor Networks”,
• α and β - constants (in this work we considered the values
for example). Once the splitting is complete, synonyms and
1.2 and 0.75, respectively) related terms are looked up for all of the resulting phrases in
the query set.
The cloud processing server computes this equation for all
documents in the collection and returns the list to the client,
“Failure in Wireless Sensor Networks”, “Failure
sorted by score in descending order. Wireless Sensor”, “Wireless Sensor Networks”,
“Failure Wireless”, “Wireless Sensor”, “Sensor
VI. S EARCH S CHEMES Networks”, “Failure”, “Wireless”, “Sensor”,
S3C considers three main schemes for how to implement “Networks”
our approach. The primary differences among the proposed Fig. (2) A sample of the query splitting done by
schemes are: how to select the subset of terms to represent the SKSS.
document, split the user search query, and perform ranking. When performing ranking, SKSS modifies Equation (1)
A. Naı̈ve Scheme: Full Keyword Semantic Search (FKSS) to compensate for the lack of frequency data. Because the
keyphrae extractor pulls a limited number of terms from
FKSS follows the naı̈ve method of selecting terms as the document, all extracted phrases are considered equally
keywords. It simply goes through the document and collects frequent. Thus, a 1 is put in place of f (qi , di ).
and counts the frequency of each individual word that is not
considered a stopword. This gives the hashed index the full C. Space-Efficient, Accuracy Driven Scheme: Keyphrase
scope of the document, as no meaningful text is left out, but Search With Frequency (KSWF)
bloats it with possibly unneeded terms. KSWF is a combination of the two previous schemes. The
FKSS also follows a naı̈ve method of splitting the query, keyphrase extractor is still used to obtain keywords for the
as it just divides it into singular words. This is all that is index, similar to SKSS, and the phrases are subsequently split
necessary, as the keyword selection for the hashed index only into their individual words. After this, it makes a second pass
considers single words. Thus splitting the query into larger through the document to collect the frequency information
groups of words would add no value. for each word and phrase, similar to FKSS, which is stored
Ranking for FKSS is performed with no modification from alongside the terms in the index.
Equation (1). The user query is split in the same manner as SKSS, with
Though FKSS follows a naı̈ve approach, it can be useful for each adjacent subset added to the overall query set. Because
scenarios in which small documents are used or it is integral the frequency data is now present for all of the terms and
for the full text to be considered. For example, searching over phrases, it uses the same ranking method as FKSS. This
encrypted media tags or social media updates. It is the least scheme was developed primarily to analyze the impact of
3726
utilizing term frequency with a method like SKSS. Intuitively, from the document, meaning the attacker would only be able
adding term frequency should bring up more relevant search to ascertain how many times those specific terms and phrases
results, as there is more accurate data for the ranking. were in the document. The SKSS scheme adds more security
The addition of frequency data to KSWF adds greater by removing those term frequencies.
accuracy to the ranking function. For this reason, it is useful An attacker monitoring the process during a search could
in scenarios in which the highest accuracy possible is desired see the resultant file identifiers that are associated with the
while maintaining minimal storage overhead. given Q0 . This would show an encrypted history as (C0 , Q0 ).
However, since the attacker would not be able to discern the
VII. S ECURITY A NALYSIS query (without the use of the above dictionary), this data would
S3C provides a trustworthy architecture for storing confi- be of little use.
dential information securely in clouds while maintaining the Attackers could also potentially attempt to alter data in
ability to search over them. The only trusted component of C0 . These attacks, however, could be recognized as the client
the architecture is the user machine, which has access to all would not be able to decrypt them.
sensitive information such as the full plaintext documents and
the document key files. Keeping the client machine trusted is VIII. P ERFORMANCE E VALUATION
a reasonable assumption in the real world, as it can be kept A. Overview
with minimal exposure to outside attackers. To evaluate the performance of S3C and provide proof of
Our threat model assumes that adversaries may intend to concept, we tested it with the Request For Comments (RFC)
attack the communication streams between client and cloud dataset, a set of documents containing technical notes about
processing server and between cloud processing server and the Internet from various engineering groups. The dataset has
cloud storage, as well as the cloud processing server and stor- a total size of 357 MB and is made up of 6,942 text files.
age machines themselves. To explain what exactly the attacker To evaluate our system under Big data scale datasets, we
could see or do, we will first introduce some definitions. utilized a second dataset, the Common Crawl Corpus from
History: For a multi-phrase query q on a collection of AWS, a web crawl composed of over five billion web pages
documents C, a history Hq is defined as the tuple (C, q). In We evaluated our system against the RFC using three types of
other words, this is a history of searches and interactions metrics: Performance, Overhead, and Relevance.
between client and cloud server.
View: The view is whatever the cloud can actually see B. Metrics for Evaluation
during any given interaction between client and server. For our 1) Relevance: We define relevance as how closely the
system, this includes the hashed index I over the collection C, returned results meet user expectations. To evaluate the rel-
the trapdoor of the search query terms (including its semantic evance of our schemes, we used the TREC-Style Average
expansion) Q0 , the number and length of the files, and the Precision (TSAP) method described by Mariappan et al. in
collection of encrypted documents C0 . Let V (Hq ) be this view. [14]. This method is a modification of the precision-recall
Trace: The trace is the precise information leaked about method commonly used for judging text retrieval systems. It
Hq . For S3C, this includes file identifiers associated with the is defined as follows:
search results of the trapdoor Q0 . It is our goal to allow the
attacker to infer as little information about Hq as possible. ∑Ni=0 ri
The view and trace encompass all that the attacker would Score = (3)
N
be able to see. For the sake of this analysis, we will assume Where i is the rank of the document determined by the
that the chosen encryption and hashing methods are secure, system and N is the cutoff number (10 in our case, hence the
and so C0 itself will not leak any information. I only shows term TSAP@10). ri takes three different values:
a mapping of a single hashed term or phrase to a set of file
• ri = 1/i if the document is highly relevant
identifiers with frequencies, meaning a distribution of hashes
• ri = 1/2i if the document is somewhat relevant
to files could be compiled, but minimal data could be gained
• ri = 0 if the document is irrelevant
from the construction. Similarly, Q0 only shows a listing of
hashed search terms with weights. The addition of the weights This allows for systems to be given a comparative score
could potentially enable the attacker to infer which terms in against other schemes in a relatively fast manner.
the trapdoor were part of the original query, but they would 2) Performance: We define performance as the time it takes
still only have a smaller set of hashed terms. to perform the search operation. The aspects of performance
However, we must consider the small possibility that, if we measure are as follows:
the attacker was able to know the hash function used on • Time it takes to process the user query in seconds. This
the client side, they could in theory build a dictionary of all includes semantic query modification and hashing into
words in the vocabulary V that the documents are comprised the trapdoor.
of mapped to their hashed counterparts, and reconstruct I • Time it takes to search over the index in the cloud in
in plaintext. In this scenario, the attacker could put together seconds. This includes retrieving the related files from
the terms that the documents are comprised of, but since I the index and ranking them based on the query.
carries no sense of term order, they could not reconstruct the • Total time to perform the search in seconds. This encap-
entire file. The KSWF scheme adds additional security by only sulates both of the steps above, plus any additional time
showing a small portion of the important terms and phrases taken with communication over the network.
3727
3) Overhead: We define overhead to be the imposed cloud
server storage space taken by the hashed index and the com-
puting involved with it. The aspects of overhead we measure
are as follows:
• Size of the inverted index, measured in the form of
number of entries.
• Time it takes to construct the index in seconds. This
operation reads the data files for the index and compiles
them into a hash table. It is only performed on the cloud
server startup.
C. Benchmarks
We derived a set of benchmark queries based on the
information presented in the dataset. For testing relevance,
we looked at two categories of queries which a user may
desire to search. In the first category we consider a user who Fig. (4) TSAP@10 score for the specified query for each
already knows which document they are looking for, but may system. Once the system has returned a ranked list of results,
not remember where the document is located in their cloud or a score is computed based off of a predetermined relevance
may not want to look through a large number of files to find it. each file has to the given query.
Such queries are typically specific and only a small number of
documents should directly pertain to them. The search system the same semantic processing but with no encryption or
is expected to bring up these most desired documents first. hashing. Due to their similarities in indexing, the SNSS and
In the second category we consider a user who wants to FKSS schemes can be seen as being grouped together, as they
find all of the documents related to an idea, such as the both consider the entirety of the document text. Similarly, the
nurse attempting to find all patients with a similar disease SKSS and KSWF schemes can be grouped together since they
or diagnosis in our motivation. Such queries would be broad both consider a small subset of the document text.
with many possible related documents, and the search system
should bring up the most relevant ones first. D. Evaluating Relevance
Figure 4 shows the TSAP scores of each of the four schemes
• Category 1 - Specific: searching with each of the benchmark queries.
IBM Research Report (IRR) For queries in category 1, the main desired results were
Licklider Transmission Protocol (LTP)
Multicast Listener Discovery Protocol (MLDP)
ranked the highest for all schemes. Our space-efficient
• Category 2 - Broad: schemes (the SKSS and KSWF), which might intuitively
Internet Engineering (IE) seem to suffer greatly in accuracy, only show to suffer a
Transmission Control Protocol (TCP) small amount when compared to the schemes that utilize the
Cloud Computing (CC) documents full text. For queries in category 2, the SKSS and
Encryption (EN) KSWF schemes showed to return just as relevant results, and
Fig. (3) Queries used for testing relevance. Queries in in some cases were more relevant. Most interestingly, the
category 1 target a small set of specific, known docu- KSWF scheme does not actually show much benefit from the
ments within the collection, while queries in category 2 addition of term frequency, meaning that when working with
target a broad set of documents not necessarily known a small subset of the document’s text, finding the frequency of
to the user. those key phrases may be unnecessary due to causing a longer
indexing time, unless the highest possible accuracy is desired.
To measure performance, we measured time for a small
(single word) query and a mid-size (three word) query. In E. Evaluating Performance
addition, to measure the effects of expanding the size of the In the experiments, we measured the performance of each
search query, we measured times for queries that expanded scheme with a small (one-word) and mid-sized (three-word)
from one word to four words, taking measurements at each query, gathering the total time it takes to perform the search.
single word increment. Due to the inherent variety in the In addition, we measured the two main components of the
performance results, we report the mean and 95% confidence total search time: the time taken for query modification and
interval of 50 rounds of running each experiment. the time taken to perform the index search and ranking on the
For our scalability tests, we measured search times and cloud.
storage overhead for several three word queries against in- Results can be seen in Figures 5, 6, and 7. All schemes
creasingly large portions of the dataset. Specifically, we tested can be seen to be reasonably similar in terms of total search
against datasets of sizes: 500 MB, 1 GB, 5 GB, 10 GB, 25 time. The majority of search time across all models is taken
GB, and 50 GB. up by the query processing phase, as our system needs to pull
As a baseline for performance testing, we implemented a information from across the Internet in the form of synonyms
standard non-secure (SNSS) version of the system, utilizing and Wikipedia entry downloads. SKSS and KSWF both take
3728
250000
150000
100000
50000
0
SNSS FKSS SKSS KSWF
System
Fig. (5) Total search time in each scheme. This includes the Fig. (8) Size of the inverted index for each system. An entry
time taken to process the query, communicate between client denotes a hashed keyword mapped to a set of file identifiers.
and server, and perform searching over the index. The results
are averaged over 50 runs. slightly longer to process longer queries due to the addition
of the adjacent query subsets which need to be looked up as
well. Query processing time is thus linked to Internet speeds
and the size of the Wikipedia entry for each of the query
terms. The results indicate that under fast Internet speeds, the
performance time of this system will naturally improve. While
pulling information from the Internet does naturally increase
search times, it was included intentionally to reduce storage
size needed for the local client which would otherwise need
to house an onboard ontological network.
Most important to note is the difference in index searching
times. The space-efficient SKSS and KSWF schemes take
a near-negligible amount of time to search over the index.
This can be explained by the vastly decreased index size (as
shown in the next subsection) as only key phrases are stored,
meaning that the initial set of potentially relevant documents
is significantly smaller and the ranking equation must be run
Fig. (6) Time to process the query. This includes query
a lower number of times.
modification and hashing into the trapdoor. The results are Because the greatest amount of time is taken during query
averaged over 50 runs. processing and index search time is very small for the space-
efficient schemes, these two schemes can be scaled up to work
on larger datasets without facing a huge growth in search time.
F. Evaluating Overhead
To demonstrate space-efficiency in our evaluation, we mea-
sured the overhead for each scheme in terms of how many
entries were stored in the hashed index. These results can
be seen in figures 8 and 9. The two groups of schemes
show a vast difference in this regard, due to the number of
terms selected from each document. The linear growth per
document of the index guaranteed by the constant number of
key phrases extracted keeps the index small while maintaining
the relevance of search results (as shown previously).
In addition, we measured the effect that the size of the
inverted index had on the time it takes to construct the
Fig. (7) Time it takes to perform the search on the hashed index from the utility files on the server. The differences are
index on the cloud. This includes the time taken to find all again vast, with construction times being almost negligible
files in the hashed index that contain any hashed terms in the for SKSS and KSWF. It is worth noting that this operation
query trapdoor and rank them with the scheme’s respective needs only to be performed at startup of the cloud server, and
functions. The results are averaged over 50 runs. that additions to the index at runtime operate at near constant
3729
25
7
Index Construction Time (s)
20 6
5
3730
IEEE International Conference on Data Science and Data Intensive
160 Systems, pages 38–45, Dec. 2015.
[2] Christian Bizer, Peter Boncz, Michael L. Brodie, and Orri Erling. The
140 meaningful use of big data: Four perspectives – four challenges. ACM
SIGMOD Record, 40(4):56–60, Jan. 2012.
120
Size of Index (in MB)
3731