Academia.eduAcademia.edu

Grid's confidential outsourcing of string matching

2007

Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 Grid’s confidential outsourcing of string matching RACHEL AKIMANA, OLIVIER MARKOWITCH and YVES ROGGEMAN Département d’Informatique Université Libre de Bruxelles Bd. du Triomphe – CP212, 1050 Bruxelles BELGIUM Abstract: In this paper we consider the confidentiality aspects of particular Grid’s applications such as, for example, genetic applications. The search of DNA similarities is one of the interesting areas of genetic biology. However, DNA sequences comparisons need greedy and sensitive computations. We propose a model allowing to search DNA similarities in a public DNA database on the Grid. The model is related to the private approximate string matching problem where neither the inputs nor the outputs of the comparisons are revealed. We analyze the performance of our proposed DNA disguising method by taking into account how the edit distances between the client’s queries and their corresponding disguises are distributed along the DNA sequences. Key-Words: Grid systems, Secure outsourcing, Secure approximate matching 1 Introduction In its computer science acceptation, a Grid is a widely distributed system composed of resources of many computing systems. Since a lot of different, and possibly malicious, users are using these resources, the risks of eavesdropping of data and information that are stored or processed on Grid resources, or even that are traveling on the Grid’s network, cannot be disregarded. Moreover, data stored on Grid’s resources may be related to individual private information (e.g. medical data, biological data, genetic data, etc.), in this case, confidentiality issues and protection of the users’ privacy must be studied carefully. For example we have to take into account the fact that data may be stored or processed on a remote and probably untrusted Grid node. In this work, the word data will be taken in a broad sense including data resulting from simulations and experiments that are organized in databases on the Grid as well as executables codes of jobs to be processed on the Grid. We will show that existing solutions for confidentiality issues in a Grid system as SSL for example ensure the confidentiality of data during their transport phase but do not guarantee the confidentiality to sensitive computations during their execution. We will focus our interest on the confidentiality aspects of genetic applications on the Grid; more precisely, in the search of DNA similarities on DNA sequences stored in Grid’s databases. The DNA sequence comparisons are expensive computations since one DNA sequence may contain thousands to millions of nucleotides. Therefore such comparisons need powerful computing resources. A Grid seems to be of course an appropriate environment for such computations. A remote DNA sequence comparison mechanism may be a sensitive computation in the sense that we may have to ensure that the DNA sequences are not subject to unauthorized tests whose outcome could have such serious consequences [3] (as jeopardizing an individual’s insurability or employability, etc). On the basis of these security and computing power requirements, we propose a disguise model allowing to search DNA similarities in a public DNA database on the Grid in such a way that neither the inputs nor the outputs of the comparisons are revealed to the computing node. This work is related to problems of Private Information Matching with a public database [7] where a client searches similarities to a given item in a public database without revealing neither the client’s item nor the output of the comparison. Solutions to the private information matching with private databases (PIM) are proposed in [7]. These solutions are also used to private information matching with a public database (PIMPD). In this paper, we propose a specific solution to a PIMPD problem that consider the fact that the database is public. The model exploits the opportunities offered by the data replication service in Grid system as the possibility of replication of a database on several servers. The fact that the database is public will be profitable in the sense that the client will download as many as possible DNA sequences. This avoids the use of a third untrusted entity as it is done in PIM solutions. This work provides an example of a practical grid enabled application that preserves in addition the security of the application. 63 Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 2 Confidentiality of transmitted data When we deal with confidential information, we have to consider the fact that, among the entities involved in the transport of the corresponding data or in their processing, some may be authorized to read the data whereas others may not. The “transport phase” concerns messages transmitted between Grid entities or jobs that are migrating on Grid nodes. The protocol SSL is integrated for confidentiality issues in the security layers the two well-known grid middlewares Globus and UNICORE. In Legion middleware [2], it is up to users to choose which mechanisms they assume to be secure enough for their security requirements (identification, login, delegation, confidentiality). SSL is used to transmit messages between Grid entities in such a way that those only authorized to read the message can understand it. The protocol is initiated between two entities. Via a chain of delegation, the protocol may involve more than two entities that agree on peer-to-peer messages protection. In consequence, messages may be (symmetrically or asymmetrically) encrypted for their recipients and therefore may be securely handled by intermediary nodes. However, SSL fails to guarantee the confidentiality to sensitive computations toward the computing nodes. Indeed, during the execution of such computations, they have to be deciphered (if they were encrypted) before they are read, interpreted and executed on different Grid’s nodes. Otherwise the execution of encrypted codes may lead to results from which it is hard to deduce the results of the original codes. 3 Secure outsourcing Outsourcing is used by an entity that has to execute a task but does not have the appropriate hardware and/or software to realize the execution. Another entity, that has the appropriate resources, will execute the task and provide the results to the task’s owner. Secure outsourcing refers to an outsourcing in which security requirements, as integrity, authentication and confidentiality, are involved. We assume that the nodes involved in a computation are honest and execute the tasks correctly. However, since the tasks and/or the corresponding outputs may refer to private data, the tasks’ owners may want to prevent the executing nodes to know the tasks related information. Therefore, secure outsourcing with confidentiality requirement may be such that the nodes involved in the computations never knows neither the task’s inputs nor its outputs. There are, at least, two ways to hide computation details to the agents that participate in this computation: the disguise and the encryption methods. A disguise operation realizes a functional or mathematical transformation on the objects of the disguise (for example input of sensitive computations). Generally, the execution of the disguised problem leads to results from which it is possible to deduce, knowing some secret information about the disguise, the results of the original problem. The encryption is eventually another kind of disguise method based on the usage of secret keys. However, it seems harder and less efficient to recover the results of an original problem that has been encrypted before execution than when the problem has been disguised. Nevertheless, if disguise methods seem to be more convenient solutions for secure outsourcing, the disadvantage is that it seems that there is no generic disguise method that fits all possible problems. Since Grid systems are dedicated to a broad range of applications, this disadvantage becomes, in this framework, a serious problem. 4 Outsourcing of strings matching Among Grid’s applications, some may be sensitive in a secure point of view (e.g. medical and genetic applications). Therefore, before they are outsourced on Grid nodes, they may need to be disguised. In this work, we are interested in the outsourcing operations with confidentiality requirements. On that context, we will focus our interest on the disguise of string matching procedures that allow errors. This problem is also called “approximate string matching problems” [7]. 4.1 Related works Many works have been done in the framework of secure outsourcing [3, 4, 7, 10, 11, 12]. Unfortunately, it seems that no general and generic solution seems to exist. The different proposed secure outsourcing models are appropriate tools resolving specific situations. In [3], the authors propose a model for sequences comparison (in speech recognition, machine vision and molecular sequence comparisons) involving three entities: the client who needs the result of the comparison of two sequences and two other agents that participate in the comparison while ignoring the two sequences. In [4, 11], the problems of outsourcing and speeding up secret computations are evoked. Solu- 64 Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 tions to mathematical applications (matrix multiplications, quadratures, edge detections, solutions of differential equations, etc.) are discussed. The problems of secure outsourcing and speeding up fixed-based exponentiation and variable-based exponentiation computations are also evoked in [12]. S. Hohenberger et al. in [10] proposes two models of securely outsourcing cryptographic computations: the outsource of a modular exponentiation and the outsource-secure encryption using one trusted program. The approximate string matching problem deals with the problem of finding all substrings of a text T that match a given pattern with the exception of at most m differences, for some given integer m. The differences being in the form of inserted, deleted or replaced characters [13]. The approximate string matching is the realistic version of the exact pattern matching where we have a database x = x1 , . . . , xn and a user who has an item xi and wants to verify whether his item xi is in the database. If in addition to this exact pattern matching, the user wants his query to be kept confidential, we deal with the “private information retrieval” [5, 6]. In the approximate matching framework, the user has an item xi and wants to find items in a database that are similar to xi . The notion of distance can also be used, in the sense that the most similar element to a given item is the element which is at the minimum distance (compared to all the other elements in the database) to the item. Many algorithms to solve the problem of approximate string matching are proposed in [13, 14, 15, 16]. These solutions are applicable in matching fingerprint, voice, matching DNA sequence, image template matching, etc. The private information matching (PIM) problem is an approximate matching problem where the confidentiality of client’s queries and of the database content has to be preserved. Solutions to this problem are proposed in [7] with the P P metrics ni=0 (ai − bi )2 and ni=0 |ai − bi | where a and b are the sequences that are being compared. These solutions are also applicable to private information matching with public database called in [7] “Public Information Matching with a Public Database” (PIMPD). In this work, we propose a specific solution to PIMPD that we apply to DNA sequence comparisons. There are two types of nucleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). These molecules make it possible to living beings to reproduce their complex equipment from one generation to another. The RNA is used as intermediary in the genetic information flow of the DNA with proteins. DNA is a polymer. The monomer units of DNA are nucleotides and the polymer is known as a polynucleotide. There are four different types of nucleotides found in DNA, differing only in the nitrogen base. The four nucleotides are given one letter abbreviation as shorthand for the four bases: A for Adenine, G for Guanine, C for Cytosine and T for Thymine. A DNA is a normally double stranded macromolecule. Two polynucleotide chains are held together by a weak thermodynamic force. In the DNA helix, we have four different bonds A − T , T −A, C −G and G−C (by taking into account that one base is on the first polynucleotide chain and the other base is in the second chain). The ith character λi of the sequence may be one of the four base bounds A − T , T − A, C − G and G − C. The searching of specific sequences appears as a fundamental operation for problems such as looking for given features in DNA chains or determining how two genetic sequences are similar. For example a database of over 3 millions of DNA individual profiles has been constituted from 1995 in England and Wales [9]. One of the practical utilities of such database is the elucidation of crimes. Indeed, crimes are successfully solved when DNA is recovered from the crime scene and the DNA profiles are successfully loaded onto the DNA database. Since such databases are greedy in storage spaces, the Grid seems to be the right environment for their management. We assume on the Grid, the existence of a DNA database that is replicated on different storage servers. The DNA database is public. The DNA’s owners are either anonymous or their identities are stored on another secured storage server. The management of such identities is out of the scope of this work. We consider a client who has a DNA sequence q = (λ1 , . . . , λn ) that he has recovered from either a crime scene or from a given living being. He wants to know whether his sequence exists already in the database or whether there is a sequence which is similar to his sequence. How4.2 DNA sequences comparisons ever he does not need to reveal neither his sequence nor In this paper, as an example of possible string match- the result of the comparison. Indeed, since the comparing framework, we will consider the problem of private ison is done on a remote entity, if the DNA is revealed, a dishonest remote entity may do some tests on one’s DNA sequence comparisons. 65 Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 DNA. Such tests may reveal private information on the DNA owner (as genetic diseases). Before going any further, we define the notion of distance in the context of DNA sequences. Let consider two DNA sequences q = (λ1 , . . . , λn ) ′ ′ ′ and q = (λ1 , . . . , λn ) over a finite alphabet Σ = A − T, T − A, C − G, G − C. The distance between the two sequences is the minimum cost of the sequence ′ of operations that transform q in q . Such operations may be deletion, insertion or substitution of characters [3]. If the different operations have different costs or the cost depends on the characters involved, we speak of the general edit distance. Otherwise, if all the operations cost 1, we speak of simple edit distance [7]. Without lost of generality, we will consider the simple edit distance between two DNA sequences of the same length and we will deal only with the substitution operation. 4.3 ′ ′′ the elements such that d(q , q ) = min, the server will ′′ ′′ choose the elements q such that λj = λj . Other′′ wise, if there is not any element such that λj = λj , the server has to retrieve all the elements such that ′ ′′ d(q , q ) = min. The server chooses one of the ele′′ ′′ ments q and searches in the database the elements p ′′ ′′ ′′ such that λj = λj and d(q , p ) ≤ k+min. Among the ′′ elements p there is the most similar element to q. In′ ′ ′′ deed, since d(q, q ) = k and d(q , q ) = min therefore ′′ d(q, q ) ≤ k + min according to distance properties. ′′ The server returns to the client the elements p . The client proceeds in the same way with the other k − 1 replica servers. To each replica server, the client reveals a different character λj among the k characters where ′ λj 6= λj . The client computes the intersection of all the ′′ elements p returned by all the k replica servers. This intersection contains the sequence which is the most similar to sequence q. The DNA Private matching model In this section, we propose a model allowing to outsource DNA sequences comparisons on the Grid. We assume the existence of a DNA public database which is replicated on k replica servers. A client having a DNA sequence q = (λ1 , . . . , λn ) will query the k servers to find in the database the most similar element to his item. This similarity searching is done is such way that neither the client’s query nor the output of the comparisons are revealed to the replica servers. We assume that the servers do not collude against the client. The client disguises his sequence q = (λ1 , . . . , λn ) ′ ′ ′ in the sequence q = (λ1 , . . . , λn ). This disguise operation is done by choosing random characters λj in q that ′ ′ we substitute by λj . The elements λj are taken from the DNA alphabet Σ = A − T, T − A, C − G, G − C. ′ This means that at random positions i in q and q , we ′ will have λi = λi whereas at other positions j we ′ will have λj 6= λj . The number k of all positions ′ where λj 6= λj is the simple edit distance between ′ q and q . We assume that the client interacts with k database replica servers to find the similar sequence to his query. To the first server, the client sends the ′ disguised sequence q and the distance k. This server ′′ ′′ ′′ chooses in the database the elements q = (λ1 , . . . , λn ) ′ ′′ such that d(q , q ) = min; where min is the smallest ′ ′′ distance between q and the elements q of length n in the database. The client can reduce the number of such ′ elements by giving to the server in addition to q and k, ′ one of the elements λj such λj 6= λj ; therefore, among 4.4 Distances distribution Assumption on pair bases distribution. In order to make an evaluation of the proposed model and for calculation facilities, we will make an assumption on the base bond distribution along a DNA sequence: we assume that each base bond occurs in a given 4-length subsequence with a probability of 0, 25. Two different 4-length DNA subsequences differ by the position of the four base bonds in each subsequence. This distribution constitutes the simplest and also the worst case toward the quality of disguise. Indeed, we keep in mind that there are other distributions where the occurrence of a given base bond may be random in a DNA subsequence. This might make more random the occurrence of a given subsequence and subsequently improves the quality of a disguise. If our model is secure (in the quality of disguise point of view) in a worst case, we will be ensured that this model will be more secure in other cases. It is important to note that this assumption about the base pairs distribution is incompatible with a certain distance distribution between two DNA subsequences. Indeed, if two 4-length DNA subsequences are at a distance 1 to each other, there is at least one base bond which occurs twice in one of the two subsequences. For example the subsequences A − T, T − A, C − G, G − C and A − T, T − A, C − G, T − A are at distance 1 to each other, but there is the base bond T − A which occurs twice (this is incompatible with the equiprobable pair bases distribution). That’s why in order to stay in the 66 Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 context of the assumption on base bonds distribution, we will consider in the rest of this paper that the distance between two 4-length DNA subsequences is > 1 and that k > 1. Equiprobable distance distribution. Here we assume that the distance between two DNA sequences is distributed equiprobably along the sequences. We are going to see that this distribution strengthens the security of our model while causing loss in performance. Indeed, in this section we will show that the quality of the disguise is good when the distance is equiprobably distributed. The disadvantage of this distribution is that the number of replica servers k is directly dependent of n (the DNA size). Since the DNA size is generally great, the number of servers becomes rapidly prohibitory. Firstly, we show that the revelation of the distance k in the model we proposed does not affect the security of the disguised sequence q. Indeed, if two n-length ′ sequences q and q are distant of k, therefore two corre′ sponding 4-length subsequences from q and q are distant of 4k n since the distance is distributed equiprobably. By corresponding subsequences, we mean subsequences that are at the same position in respective DNA sequences. Thus, given a 4-length subsequence of ′ q , the number of corresponding 4-length subsequences 4 4k from q that are distant of n is 4k with n dividing 4k n 4k n > 1. Since in a n-length DNA sequence, there are n4 4-length DNA subsequences and each of such sub4 sequences has 4k corresponding 4-length DNA seand n quences that are at 4k n of distance; therefore, given ′ a n-length DNA sequence q (as the one sent to storage servers in the model), there are S(n, k) = 4 n2 3! = 4k( 4k −1)!(4− 4-length DNA subsequences ( n4 ) 4k 4k )! ′ corresponding 4-length subsequence from q . Normally ′′ 4 such elements. However if λj is fixed, there are 4min n this number is reduced to 3 are  ( 4min )−1 n 4  3  4min n = 4min n ′′ ′′ 3 4min n  . This means that there elements that are not considered (since +  3 ( 4min )−1 ). n such d(q , p ) ≤ k+min is The number of elements p Pk+min i=2 ′′ S(n,i) 0,25n when there ′′ is not any element λj = λj ; otherwise, this number  is reduced since at each iteration, there are ( 4i3)−1 4n length subsequences that are not considered. Given two different 4-length DNA subsequences, the distance between the two DNA subsequences varies from 2 to 4. When we consider our model, this means that 4k n varies from 2 to 4 and k varies from 0, 5n to n. For example with n = 200 and k distributed with a ratio 2 in each 4-length DNA subsequence, we will have k = 100. In the model, k is also the number of servers with which the user interacts. This number of servers depends on the DNA sequence size. Such amount of servers is prohibitory. We are going to see hereunder a way to distribute the distance along the sequences which reduces this amount of servers. Distance distributed randomly along the DNA sequences. We are going to consider the case where the distance k is distributed randomly in the n-length ′ DNA sequences q and q . By random distribution, we mean that in despite of distributing the distance k in all 4-length DNA subsequences, the client can distribute the distance k in random chosen 4-length subsequences. For example, the client can distribute the distance k in k 2 random 4-length subsequences with the ratio 2. In this case, the distance k (and implicitly the number of servers) does not depend on the DNA size n. n n n We assume that the distance k is distributed in from which we can construct n-length different serandom k2 4-length DNA subsequences, each 4-length quences q that are at a distance k to the given n-length  ′ DNA sequence from q (where the distance k is dissequence q . We can precisely construct S(n,k) 0,25n ntributed) being at a distance 2 to the corresponding 4′ length DNA sequences q that are at a distance k from q . length DNA subsequence from q ′ . We will analyze This means that the user’s query q is disguised among how the quality of the disguise and the performance S(n,k) elements that are at a distance k to q. In the of the model are affected by this distribution. Given 0,25n model, when the server chooses in the database the ele- two corresponding 4-length DNA subsequences distant ′′ ′′ ′′ ′ ′′ ments q = (λ1 , . . . , λn ) such that d(q , q ) = min. of 2 each other, this means that they differ by one  The number of such elements is S(n,min) 0,25n . We can of the positions of two base bonds; for example the have an idea of how the number of such elements is re- two subsequences A − T, C − G, T − A, G − C and duced when the client reveals one of the elements λj A − T, C − G, G − C, T − A. In our model, given a ′ such that λj 6= λj and that the server chooses the el- 4-length DNA subsequence from q ′ where the distance ′′ ′′ ′′  ements q such that λj = λj . Indeed, λj ∈ in a is distributed, there are 4 =6 4-length subsequences 2 ′′ 4-length subsequence from q which is at 4min to the from q that are at a distance 2. Since the distance k n 67 Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 is distributed in k2 4-length DNA subsequences, we have  k 4 2 ∗ 2 =3 ∗ k such 4-length subsequences from which we can construct the sequences q that are at a distance k to ′ the sequence q . Precisely we can construct 3∗k nk 2 length DNA sequences q that are at a distance k to the ′ DNA sequence q . We note that this amount of n-length DNA sequences q that are at a distance k to the DNA ′ sequence q with a random distribution of the distance k is small in comparison of the amount of the n-length DNA sequences q we can construct with an equiprob able distribution of the same distance ( S(n,k) 0,25n ). This allow us to state that the quality of the disguise is better with an equiprobable distribution of the distance than with a random distribution. Nevertheless, we estimate that the quality of the disguise with random distance distribution still is efficient to hide the client’s query q. For example with the distance k = 6 distributed with theratio 2 in random 4-length subsequence, there are 18 3 =816 n-length DNA sequences q that are at a dis′ tance k from the n-length sequence q . This loss in quality of disguise is compensated by a profit in performance especially in the number of servers which is reduced with a random distance distribution. Indeed, with the random distribution of the distance, the number of servers k is equal to the distance between the two n sequences and the client still is able to adapt the distance k to the number of servers which is available. This is not possible with the distance k that is equiprobable distributed along the DNA sequence since the distance k depends on the DNA size. With a large DNA sequence, the number of servers becomes prohibitory. Considering together the quality of disguise and the performance of the model, we see that it is advantageous to the client to distribute randomly the distance k when he is disguising his query. References [1] S. Naqvi, M. Riguidel Threat model for Grid Security Services LNCS, Springer, Vol 3470, Advances in Grid Computing - EGC 2005, pp.1048-1055. [5] B. Chor, N. Gilboa Computationally Private Information Retrieval ACM symposium on Theory of computing, pp.304-313, 1997. [6] B. Chor, O. Goldreich, E. Kushilevitz, M. Sudan Private Information Retrieval IEEE Symposium on foundations of Computer science, 1995. [7] W. Du, M. Atallah. Protocols for Secure Remote Database Access with Approximate Matching 7th ACM Conference of Computer and Communications Security, 25 pages, 2000. [8] G. Navarro A Guided Tour to Approximate String Matching ACM Computing Surveys, 33(1):pp.3188, 2001. [9] The national DNA database Parliamentary Office of Science and Technology, Postnote, Number 258, 2006. [10] S. Hohenberger, A. Lysyanskaya How To Securely Outsource Cryptographic Computations Theory of Cryptography Conference, LNCS, Springer, Vol 3378, pp.264-282, 2005. [11] T. Matsumoto, K. Kato, H. Imai Speeding Up Secret Computations with Insecure Auxiliary Devices Proceedings of Crypto, LNCS, Springer, Vol 403, pp.497-506, 1988. [12] M. Dijk, D. Clarke, B. Gassend G. Suh, S. Devadas Speeding up Exponentiation using an Untrusted Computational Resource Designs, Codes and Cryptography, 39(2), pp.253-273, 2006. [13] S. Sahinalp, U. Vishkin Efficient Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm IEEE Symposium on Foundations of Computer Science, pp. 320-328, 1996. [14] G. Landau, U. Vishkin Efficient String Matching in the presence of errors IEEE Symposium on Foundations of Computer Science, pp.126-136, 1985. [2] J. Chapin, A new Model of security for Metasystems Future Generation Computer Systems, 15(5), [15] G. Landau, U. Vishkin Fast String Matching with k differences Journal of Computer and System Scipp.713-722, 1999. ences, 37(1), pp.63-78, 1988. [3] M.J.Atallah, J. Jiangtao Secure Outsourcing of Sequence Comparisons International Journal of Infor- [16] G. Landau, U. Vishkin Fast Parallel and Serial Approximate String Matching Journal of Algomation Security, Vol 4, pp.277-287, 2005. rithms, 10(2), pp.157-169, 1989. [4] M.J. Atallah, J.R. Rice Secure Outsourcing of Scientific Computations Technical Report 98-15, Purdue University. View publication stats 68