Academia.eduAcademia.edu

Privacy Preserving Data Leak Detection(IJIFR-V4-E5-033)

Abstract

Leaked sensitive data records has amplified dramatically during the last few years, from 412 million in 2012 to 822 million in 2013 [1]. Security firms, research institutions and government organizations are the main areas facing the data leakage problem. Data leak simply means that a secret data get exposed or data leakage is the unauthorized transmission of data. Unintentional or accidental data leakage is also unauthorized. Human mistakes are one of the main causes of data leak. A person mistakenly sending a confidential message to all contacts in his e-mail is a data leak caused by human mistake. Data leakage is defined as the accidental or unintentional distribution of private or sensitive data to an illegal entity. Sensitive data in companies and organizations include intellectual property (IP), financial information, patient information, private credit-card data, and other information depending on the business and the industry. This paper proposes a data-leak detection solution environment. Can design, implement, and evaluate fingerprint technique that enhances data privacy throughout data-leak detection operations. It enables the data owner to securely delegate the content-scrutiny job to DLD providers without revealing the sensitive data.

www.ijifr.com Volume 4 Issue 5 January 2017 International Journal of Informative & Futuristic Research ISSN: 2347-1697 Privacy Preserving Data Leak Detection Paper ID Key Words IJIFR/V4/ E5/ 033 Page No. 6411-6417 Subject Area Computer Engineering data leakage, fingerprint technique, content-scrutiny job, M.Tech. Student, Department of Computer Engineering, Sarisha Kumari.S Musaliar College of Engineering and Technology Pathanamthitta, Kerala Abstract Leaked sensitive data records has amplified dramatically during the last few years, from 412 million in 2012 to 822 million in 2013 [1]. Security firms, research institutions and government organizations are the main areas facing the data leakage problem. Data leak simply means that a secret data get exposed or data leakage is the unauthorized transmission of data. Unintentional or accidental data leakage is also unauthorized. Human mistakes are one of the main causes of data leak. A person mistakenly sending a confidential message to all contacts in his e-mail is a data leak caused by human mistake. Data leakage is defined as the accidental or unintentional distribution of private or sensitive data to an illegal entity. Sensitive data in companies and organizations include intellectual property (IP), financial information, patient information, private credit-card data, and other information depending on the business and the industry. This paper proposes a data-leak detection solution environment. Can design, implement, and evaluate fingerprint technique that enhances data privacy throughout data-leak detection operations. It enables the data owner to securely delegate the content-scrutiny job to DLD providers without revealing the sensitive data. I. INTRODUCTION Since there is a rapid growth in data-leak instances, it is necessary to take few steps against data loss. Data leakage detection is necessary to solve the issue. After detecting whether any data leakage has occurred or not, can take measures to overcome it. A data-leak detection solution can be outsourced and be deployed in a semi honest detection environment. Data leak detection (DLD) provider is used, which is responsible to detect whether any data leak occurred. Rabin fingerprint and fuzzy fingerprint technique can be designed, implemented, This work is published under Attribution-NonCommercial-ShareAlike 4.0 International License Copyright©IJIFR 2017 . 6411 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 and evaluated to enhance data privacy during data-leak detection operations. It enables the data owner to securely delegate the content-inspection task to DLD provider without exposing the sensitive data. Using our detection method, the DLD provider, who is modelled as a honest-but-curious adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using proposed techniques, an Internet service provider (ISP) can carry out exposure on its customer’s traffic securely and provide data-leak detection as an add-on service for its clients. Otherwise individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them. There are two goals:   Security Goal Privacy Goal Security Goal The sensitive data is accidentally leaked by a legitimate user termed as inadvertent data leak.DLD provider can be used to detect this type of leaks over supervised network channels. Privacy Goal Prevent the DLD provider from gaining knowledge of sensitive data during the detection process. The DLD provider is semi-honest , who follows proposed protocols to carry out the operations, but may attempt to gain knowledge about the sensitive data of the data owner. The DLD provider is given digests of sensitive data from the data owner and the content of network traffic to be examined. From the detection viewpoint, a straightforward method is for the DLD provider to be attentive and report if any sensitive fingerprint matches the fingerprints from the traffic. Then, the DLD provider learns the resultant shingle, as it knows the content of the packet. Therefore, the central challenge is to prevent the DLD provider from learning the sensitive values even in data-leak scenarios. II. PRIOR WORK There have been several advances in understanding the privacy needs or the privacy requirement of security applications. This paper identifies the privacy needs in an outsourced data-leak detection service and provides a systematic solution to enable privacy-preserving DLD services. Shingle with Rabin fingerprint was used previously for identifying similar spam messages in a collaborative setting, as well as collaborative worm containment, virus scan , and fragment detection. In comparison, we tackle the unique data-leak detection problem in an outsourced setting where the DLD provider is not fully trusted. GoCloudDLP is a little different, which allows its customers to outsource the detection to a fully honest DLD provider. Our fuzzy fingerprint method differs from these solutions and enables its adopter to provide data leak detection as a service. The customer or data owner does not need to fully trust the DLD provider using our approach. Bloom filter is a space-saving data structure for set membership test, and it is used in network security from network layer to application layer. The fuzzy Bloom Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6412 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 filter invented in constructs a special Bloom filter that probabilistically sets the corresponding filter bits to 1’s. Although it is designed to support a resource-sufficient routing scheme, it is a potential privacy-preserving technique. We do not invent a variant of Bloom filter for our fuzzy fingerprint, and our fuzzification process is separate from membership test. The advantage of separating fingerprint fuzzification from membership test is that it is flexible to test whether the fingerprint is sensitive with or without fuzzification. Besides fingerprint-based detection, other approaches can be applied to data-leak detection. III. PROPOSED SYSTEM Operations Include: PREPROCESS: Run by the data holder to organize the digests of sensitive data. RELEASE: Data owner sends the preprocessed data digests to the DLD provider for detection purpose. MONITOR and DETECT: DLD provider collects the outgoing traffic of particular organization and calculates digests of traffic content, and identifies potential leaks. REPORT: DLD provider returns data-leak alerts to the data owner where there may be false positives along with true positives. POSTPROCESS: Data owner pinpoints true data-leak instances. The protocol is based on strategically computing data similarity, specifically the quantitative similarity between the sensitive information and the observed network traffic. High similarity indicates potential data leak. Figure 1: Data leak detection model 3.1Shingles and Rabin Fingerprints The DLD provider receives digests of sensitive data from the data owner. The data owner uses Rabin fingerprint and fuzzy fingerprint algorithm to generate hard to- reverse (i.e., one-way) digests through polynomial modulus operation. First generates small fragments of data (sensitive data or network traffic), which preserves the local features of the data. Each fragment will be of fixed size and fragments are called shingle (q-gram). If a fragment consists of 3 elements, then it is 3-gram. Example: Data – “abcdef” can be divided into four fragments of fixed size {abc, bcd, cde, def}. Since there are three elements in each fragment, it is 3-gram. Each fragment is treated as a polynomial q(x) Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6413 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 using it’s ASCII (American Standard Code for Information Interchange) values and fingerprint of each fragment is calculated. Each coefficient of q(x) is one bit in fragment. For above example, fragment abc can be represented in polynomial form as : q(x)= � + � + . Rabin fingerprints are computed by performing polynomial modulus operations, and also implemented with XOR, shift, and table look-up operations. The shingle-andfingerprint process is defined as follows approach can tolerate sensitive data modification to some extent, e.g., inserted tags, small amount of character substitution, and lightly reformatted data. In fingerprinting, each shingle is treated as a polynomial q(x). Each coefficient of q(x), i.e., ci, is one bit in the shingle. q(x) is mod by a selected irreducible polynomial p(x). Rabin fingerprint f = q(x) mod p(x) f = c1x k−1 + c2x k−2 + . . . + c k−1x + ck mod p(x) …………….. (1) p(x) is an irreducible polynomial. 3.2 Fuzzy Fingerprint Protocol Rabin fingerprints are passed through certain logical operations to obtain fuzzy fingerprints. 1) PREPROCESS: This action is run by the data owner on each piece of sensitive data. a) The data owner chooses four public parameters (q, p(x)). q is the length of a shingle. p(x), is an irreducible polynomial used in Rabin fingerprint. Each fingerprint is p f –bit long. The positions of 1’s and 0’s in M indicate the bits to preserve and to randomize in the fuzzification, respectively. b) The data owner computes S, which is the set of all Rabin fingerprints of the piece of sensitive data. c) The data owner transforms each fingerprint f ∈ S into a fuzzy fingerprint f * with randomized bits (specified by the mask M). The procedure is described as follows: for each f ∈ S, the data owner generates a random p f -bit binary string f˙, mask out the bits not randomized by f˙’ = (NOT M) AND f ˙ (1’s in M indicate positions of bits not to randomize), and fuzzify f with f * = f XOR f˙’. Bit value of fingerprint � +9 � + � can be calculated as: ∗ +9 ∗ + ∗ =90679. Similarly bit value of each fingerprint is calculated. f* = ((NOT M) AND f ˙) XOR f ……………..(2) All fuzzy fingerprints are collected and form the output of this operation, the fuzzy fingerprint set, S*. 2) RELEASE: This operation is run by the data owner. The fuzzy fingerprint set S* obtained by PREPROCESS is released to the DLD provider for use in the detection, along with the public parameters (q, p(x)). The data owner keeps S for use in the subsequent POSTPROCESS operation. 3) MONITOR: This operation is run by the DLD provider. The DLD provider monitors the network traffic T from the data owner’s organization. Identifies TCP flows and extract contents in a TCP session as T˜. Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6414 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 4) DETECT: This operation is run by the DLD provider on each T˜ as follows. a) The DLD provider first computes the set of Rabin fingerprints of traffic content T˜ based on the public parameters. The set is denoted as T. b) The DLD provider tests whether each fingerprint f’∈ T is also in S*using the fuzzy equivalence test. E( f ‘, f *) = NOT (M AND ( f ‘ XOR f *))…………… (3) c) The DLD provider aggregates the outputs from the preceding step and raises alerts based on a threshold. 5) REPORT: If DETECTION on T˜ yields an alert, the DLD provider reports the set of detected candidate leak instances ˆT to the data owner. 6) POSTPROCESS: After receiving ˆT, the data owner test every f ‘ ∈ ˆT to see whether it is in S. Similarly, leaked images can also be detected by applying Rabin and Fuzzy fingerprint methods. Data owner can create fingerprint for image or part of a large image. Since even a small image is comprised of large number of pixels, it will generate number of sentences for each image. ASCII value corresponding to each pixel is calculated to form sentences. Sentence will be grouped into fragments of three elements each. Each of the three elements in all fragments are converted into polynomial form. For example: q(x)= � + � + . Rabin fingerprint is generated by performing modulus operation. Fuzzy fingerprint of image can be generated by performing the same logical operations that were done on document data: f* = ((NOT M) AND f ˙) XOR f Finally fingerprints can be sending to data leak detector, which is responsible to find out the data leak if any. Data leak detector will capture the data in the network traffic of that particular data owner and convert it into fingerprints. Comparison operation will result true positives along with number of false positives. Word documents will return very few false positives but image documents output consist of number of false positives. IV. EXPERIMENTAL RESULT Data leakage detection is performed and sends back to the data owner that which all data were leaked. Privacy of data is preserved by using fingerprint techniques. Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6415 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 VI. CONCLUSIONN A data leak detection solution environment has been successfully proposed. Designed, implemented, and evaluated fingerprint technique that enhances data privacy during dataleak detection operation. It enables the data owner to securely delegate the contentinspection task to DLD providers without exposing the sensitive data. Security and privacy goal is achieved throughout the detection operation. Only the fingerprints of data are sending to the detector to perform detection operation. After performing comparison operation, data leak detector will find out the potential data leaks. Detection of word as well as images can be performed by this method. VII. REFERENCES [1] X. Shu and D. Yao, “Data leak detection as a service,” in Proc. 8th Int.Conf. Secur. Privacy Commun. Netw., 2012, pp. 222–240. [2] Risk Based Security. (Feb. 2014). Data Breach Quick-View: An Executive’s Guide to 2013 Data Breach Trends.[Online]. Available: https://www.riskbasedsecurity.com/reports/2013DataBreachQuickView.pdf, accessed Oct. 2014. [3] Ponemon Institute. (May 2013). 2013 Cost of Data Breach Study: Global Analysis. [Online].Available: https://www4.symantec.com/mktginfo/whitepaper/053013_GL_NA_WP_Ponemon-2013-Cost-of-aData-Breach Report_daiNA_cta72382.pdf, accessed Oct. 2014. [4] Identity Finder. Discover Sensitive Data Prevent Breaches DLP Data Loss Prevention. [Online]. Available: http://www.identityfinder.com/, accessed Oct. 2014. [5] K. Borders and A. Prakash, “Quantifying information leaks in outbound web traffic,” in Proc. 30th IEEE Symp. Secur. Privacy, May 2009, pp. 129–140. [6] H. Yin, D. Song, M. Egele, C. Kruegel, and E. Kirda, “Panorama: Capturing system-wide information flow for malware detection and analysis,” in Proc. 14th ACM Conf. Comput. Commun. Secur., 2007, pp. 116–127. [7] K. Borders, E. V. Weele, B. Lau, and A. Prakash, “Protecting confidential data on personal computers with storage capsules,” in Proc. 18th USENIX Secur. Symp., 2009, pp. 367–382. [8] A. Nadkarni and W. Enck, “Preventing accidental data disclosure in modern operating systems,” in Proc. 20th ACM Conf. Comput. Commun. Secur., 2013, pp. 1029–1042. [9] A. Kapravelos, Y. Shoshitaishvili, M. Cova, C. Kruegel, and G. Vigna, “Revolver: An automated approach to the detection of evasiveweb-based malware,” in Proc. 22nd USENIX Secur. Symp., 2013, pp. 637–652. [10] X. Jiang, X. Wang, and D. Xu, “Stealthy malware detection and monitoring through VMMbased ‘out-of-the-box’ semantic view reconstruction,” ACM Trans. Inf. Syst. Secur., vol. 13, no. 2, 2010, p. 12. [11] Matt Curtin, Kent Information Services, Inc.(march 1997). [12] G. Karjoth and M. Schunter, “A privacy policy model for enterprises,” in Proc. 15th IEEE Comput. Secur. Found. Workshop, Jun. 2002, pp. 271–281. [13] J. Jung, A. Sheth, B. Greenstein, D. Wetherall, G. Maganis, and T. Kohno, “Privacy oracle: A system for finding application leaks with black box differential testing,” in Proc. 15th ACM Conf. Comput. Commun. Secur., 2008, pp. 279–288. [14] Y. Jang, S. P. Chung, B. D. Payne, and W. Lee, “Gyrus: A framework for user-intent monitoring of text-based networked applications,” in Proc. 23rd USENIX Secur. Symp., 2014, pp. 79–93. [15] K. Xu, D. Yao, Q. Ma, and A. Crowell, “Detecting infection onset with behavior-based policies,” in Proc. 5th Int. Conf. Netw. Syst. Secur., Sep. 2011, pp. 57–64. [16] M. O. Rabin, “Fingerprinting by random polynomials,” Dept. Math., Hebrew Univ. Jerusalem, Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6416 ISSN: 2347-1697 International Journal of Informative & Futuristic Research (IJIFR) Volume - 4, Issue -5, January 2017 Continuous 41st Edition, Page No: 6411-6417 Jerusalem, Israel, Tech. Rep. TR-15-81, 1981. [17] A. Shabtai et al., A Survey of Data Leakage Detection and Prevention Solutions, 5SpringerBriefs in Computer Science, DOI 10.1007/978-1-4614-2053-8_2, © The Author(s) 2012. [18] Stallings, W. (1995). Network and internetwork security: principles and practice (Vol.1). Englewood Cliffs: Prentice Hall. [19] Sandip A. Kale, S.V.Kulkarni International Journal of Advanced Research in Computer and Communication Engineering Vol. 1, Issue 9, November 2012. [20] Chandni Bhatt and Richa Sharma Chandni Bhatt et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2556-2558 [21] B. Bloom. Space/time tradeoffs in in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970. [22] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. Beyond Bloom filters: From approximate membership checks to approximate state machines. To appear in Proc. of SIGCOMM, 2006. TO CITE THIS PAPER Kumari, S. S. (2017) :: “Privacy Preserving Data Leak Detection” International Journal of Informative & Futuristic Research (ISSN: 2347-1697), Vol. 4 No. (5), January 2017, pp. 6411-6417, Paper ID: IJIFR/V4/E5/033 Sarisha Kumari.S :: Privacy Preserving Data Leak Detection 6417