Academia.eduAcademia.edu

Securing Data Warehouses from Web-Based Intrusions

2012, Lecture Notes in Computer Science

Decision support for 24/7 enterprises requires 24/7 available Data Warehouses (DWs). In this context, web-based connections to DWs are used by business management applications demanding continuous availability. Given that DWs store highly sensitive business data, a web-based connection provides a door for outside attackers and thus, creates a main security issue. Database Intrusion Detection Systems (DIDS) deal with intrusions in databases. However, given the distinct features of DW environments most DIDS either generate too many false alarms or too low intrusion detection rates. This paper proposes a real-time DIDS explicitly tailored for web-access DWs, functioning at the SQL command level as an extension of the DataBase Management System, using an SQL-like rule set and predefined checkups on well-defined DW features, which enable wide security coverage. We also propose a risk exposure method for ranking alerts which is much more effective than alert correlation techniques.

Securing Data Warehouses from Web-Based Intrusions Ricardo Jorge Santos1, Jorge Bernardino2, Marco Vieira3 and Deolinda M. L. Rasteiro4 1,3 CISUC – DEI – FCTUC – University of Coimbra – Coimbra, Portugal CISUC – DEIS – ISEC – Polytechnic Institute of Coimbra – Coimbra, Portugal 4 DFM – ISEC – Polytechnic Institute of Coimbra – Coimbra, Portugal 1 [email protected], 2 [email protected], 3 [email protected], 4 [email protected] 2 Abstract. Decision support for 24/7 enterprises requires 24/7 available Data Warehouses (DWs). In this context, web-based connections to DWs are used by business management applications demanding continuous availability. Given that DWs store highly sensitive business data, a web-based connection provides a door for outside attackers and thus, creates a main security issue. Database Intrusion Detection Systems (DIDS) deal with intrusions in databases. However, given the distinct features of DW environments most DIDS either generate too many false alarms or too low intrusion detection rates. This paper proposes a real-time DIDS explicitly tailored for web-access DWs, functioning at the SQL command level as an extension of the DataBase Management System, using an SQL-like rule set and predefined checkups on well-defined DW features, which enable wide security coverage. We also propose a risk exposure method for ranking alerts which is much more effective than alert correlation techniques. Keywords: Database security, Web security, Intrusion detection, Data warehouses. 1 Introduction Many business models using web-based infrastructures require continuous access to decision support means such as Data Warehouses (DWs). To ensure this kind of access, DWs need to be available at any time from any location through communication infrastructures such as the Internet. This creates a main security issue, since it provides a mean for accessing their databases from outside the enterprise. Intrusion is as a set of actions that attempt to violate the integrity, confidentiality or availability of a system [8]. Automatic detection of intrusion actions in databases is the main goal of Database Intrusion Detection Systems (DIDS). Since DWs are the core of enterprise sensitive data, quickly detecting and responding to intrusions is critical. However, most DIDS applied to DWs typically spawn too low true intrusion detection rates (i.e. false negatives) or too high false alarm rates (i.e. false positives) [7, 8, 9]. In the first case, many intrusions pass undetected; in the second case, the number of generated alerts is frequently so large that it leads to wasting vast amounts of time and limited resources, or they are simply just too much to be checked [7, 8]. This jeopardizes the credibility and feasibility of the DIDS [5, 7, 8]. Given the welldefined features intrinsic to DW environments, we argue they require specifically tailored DIDS. To the best of our knowledge, no such DIDS has been proposed. Although alert correlation techniques [7, 10] have been proposed to decrease false positive rates, we also argue they are not the best choice for alert management in DW environments. These techniques filter alerts to determine those which present a higher probability of referring a true intrusion, given a predefined threshold. Using a threshold implies some alerts are discarded and thus, there is always the risk that a true intrusion may pass undetected. Given the value of DW data, this is not advisable. In our approach, we decide not to correlate/filter alerts, but measure their risk exposure (probability vs impact) to the enterprise. Instead of filtering alerts, our approach ranks all alerts by their potential cost to the enterprise, dealing with the most critical intrusions first instead of wasting time checking alerts with low impact for the enterprise. The main achievements and contributions of our work are: Our DIDS is the first tailored for web-acessible DWs, analyzing each user command both a priori and a posteriori of its execution; It is also the first to use risk exposure to increase alert management efficiency; We use a very easy to understand and use declarative SQLlike form for defining rules at a fine-grain level for intrusion detection (ID) and response. Their flexibility covers a very large spectrum of possibilities that enables detecting and responding to a wide range of intrusions; The DIDS works as an extension of any DBMS, adding real-time ID and response management to the native database server; It can be easily implemented and used in any web-accessed DW, acting transparently at the application layer between DW user applications and the database. The remainder of this paper is structured as follows. In section 3, we present our proposal, describing its architecture and each of its components and explaining how intrusion detection and response is managed. In section 4 we describe how each form of attack is dealt with by our solution. Section 5 presents related work on DIDS. Finally, in section 6 we present our conclusions and future work. 2 Data Warehouse Database Intrusion Detection System Figure 1 shows the typical user action flow in a web-accessible DW, while Figure 2 shows the DIDS architecture, working as an extension of the DBMS. Fig. 1. Typical user action flow in a web-accessible Data Warehouse Fig. 2. The conceptual architecture of the DIDS for DWs The sequence of intrusion detection steps is labeled in the figure and described as: A user requests an action through a Web Application Server, arriving at the DBMS for execution (step 1). Before executing it, the Command Analyzer retrieves the command text, date/time, and user/IP identification (step 2), parses the command, splits it into the ID features and passes all information to the Intrusion Detector (step 3). This component then gets the statistical model values for all features from the DW-IDE Database (step 4) and applies the ID algorithms (explained in subsection 2.2) to decide if the command is a potential intrusion. The detector then passes all information (features and respective intrusion detection results) to the Intrusion Response Manager (IRM) (step 5). Given each feature’s result considering the user’s action as an intrusion, the IRM retrieves the probability and impact rules, evaluate its risk exposure and generates the resulting alert (step 6), stores the data concerning the alert and the feature(s) that generated it in the DW-IDE Database for future reference, takes the appropriate actions to deal with the potential intrusion through the DBMS and notifies the DW Security Administrator (step 7). The IRM takes action by commiting the command’s execution in the DBMS (in case it has been considered a non-intrusion) or by suspending or killing its execution, or killing the user session, either automatically or on request of the DW Security Administrator (step 7). If the user action is not considered an intrusion the IRM will simply update the feature’s statistics in the DW-IDE Database (step 7) without notifying the DW Security Administrator. If the IRM concludes the user’s action is not an intrusion, it notifies the DBMS to normally execute it against the DW Database(s) (step 8). After the user command has been computed (step 9), its response is analyzed by the Response Analyzer before returning it to the interface which requested it (step 10), extracting the response features and passing them to the Intrusion Detector (step 11), which will repeat step 4 to detect possible intrusion action for each response feature. The Intrusion Detector will then pass the information to the IRM, which will repeat steps 5, 6, 7 as steps 13, 14 and 15. Finally, if the IRM concludes that the response is accepted or considered an intrusion, the computed results are respectively either sent back to the user interface which requested them or eliminated (step 16). 2.1 Risk Exposure Assessment Given a user action, risk exposure is a function of both the probability it has of being an intrusion and the impact it may have, i.e., the potential magnitude of the cost for the enterprise related to the damage or disclosure of the sensitive data which the action affects. Risk analysis consists on ranking the alerts given their computed risk exposure, according to a matrix similar to Table 1. Impact Table 1. The risk exposure matrix Very High High Low Very Low Very Low High Low Very Low Very Low Probability Low High High Very High High High Low High Very Low Low Very High Critical Very High High High To define which responses should be taken given the risk exposure matrix, the DW Security Administrator may define rules as the following: GIVEN RISK EXPOSURE AS Low|Medium|High|Critical ON FEATURE {FeatureName1, FeatureName2, ...}, AllFeatures TAKE ACTION {DoNothing,Alert,PauseCommand,TerminateCommand,KillSession} The definition of probability and impact rules that make up the assessment of risk exposure measures depend on the chosen intrusion detection features and sensitive data assessment by the DW Security Administrator, and will be explained in the next subsections. All risk exposure, probability and impact rules are stored in the DW-IDE Database and used by the Intrusion Response Manager (IRM), as explained formerly. 2.2 Intrusion Detection and Response Management Intrusion Detection. Given the distinctive assumptions for typical web-accessible DWs [4], in Table 2 we define the relevant ID features from a usability perspective. As shown, several features group values per user/IPAddress, other features are referred to values per command given each user/IPAddress, and further features refer those that are grouped by each session of each user/IPAddress. This allows testing features in different grouping levels (per user / per user session / per SQL command) and thus, widens the detection scope. Our approach adjusts a probabilistic distribution for each feature {F1, …, F29} for each user, from observations (feature values) during an initial training stage. To obtain those observations, we suppose the existence of an “intrusion-free” database command log. Executing that log’s user commands we extract the values, i.e., observations for building each feature’s statistical distribution. Statistical adjustment tests are performed to obtain each population’s distribution. Table 2. Intrusion detection features F# F1 F2 F3 F4 F5 F6, F7 F8, F9 F10 F11 F12 F13…F17 F18, F19 F20 F21 F22 F23…F27 F28 F29 Features per User/IPAddress FeatureName Description #ConsFailedLoginAttempts The number of consecutive failed database login attempts by a UserID or from an IPAddress (accumulated or in a given timespan) #SimultSQLSessions The number of active simultaneous database connections #UnauthorAccessAttempts The number of consecutive user requests to execute an unauthorized actions (e.g. request to modify data when the database is read-only, or requesting to query data to which does not have access privileges) Features per User/IPAddress per Command CPUTime CPU time spent by the DBMS to process the command ResponseSize Size (in bytes) of the result of the command’s execution #ResponseLines, Nr. of lines and columns in the result of the command’s execution #ResponseColumns #ProcessedRows, Nr. of accessed rows and columns for processing the command #ProcessedColumns CommandLength Number of characters #GroupBy Number of GROUP BY columns #Union Number of UNION clauses #Sum, #Max, #Min, #Avg, Nr. of SUM, MAX, MIN, AVG and COUNT functions #Count #And, #Or Nr. of AND and OR operators in the command’s WHERE clause(s) #LiteralValues Nr. of literal values in the command’s WHERE clause(s) Features per User/IPAddress per Session #GroupBy Number of GROUPBY columns in all SELECT statements, p/ session #Union Number of UNION clauses in all SELECT statements, per session #Sum, #Max, #Min, #Avg, Nr. of appearances of SUM, MAX, MIN, AVG and COUNT functions #Count in all commands, per session TimeBetwCommands Time period (in seconds) between exec. of commands, per session #SimultaneousCommands Number of commands simultaneously executing, per session For each active user session, we gather each new value generated for each feature and build sample sets. To detect an intrusion, statistical tests are performed: given each feature’s original population, a new sample set is built joining that population with the user session sample set for that feature. New statistical tests are performed to adjust a new probability distribution to the former data collection. By testing if the new feature’s distribution matches its original one (Ho), using Chi-square, Kolmogorov-Smirnov or Shapiro-Wilk tests, all performed at a level of 5% significance, for each test decision for a certain feature that results in rejecting the distribution’s equality (Ho), we consider the user action as a probable intrusion. Defining risk probability and impact. To determine each feature’s individual importance in the overall intrusion detection process (which will be directly related to its risk probability), we attribute a weight to it. To compute its weight, we assume a priori each feature has the same relevance and will be incrementally self-calibrated using their respective True Positives TP (i.e. alerts generated by the feature that were confirmed as true intrusions) and False Positives FP (i.e. confirmed false alarms). For each feature Fi, its weight Wi is given by: Wi = 0.5 + ((TPi - FPi) / (TPi + FPi)) / 2 (1) where TPi and FPi are the total number of TP and FP, respectively, of all alerts generated by feature Fi. Every time an intrusion alert is generated by a given feature Fi, after it is checked the feature will have its TP or FP rate updated if it respectively refers to a true intrusion or a false alarm and, consequently, its weight Wi is also accordingly updated (increased or decreased). Thus, the self-calibrating formula works smoothly, giving a higher importance to the features that are more accurate. To define the probability of each intrusion alert given the feature that generated it, our approach allows defining rules with the following syntax (list values with | are to be chosen from, while clauses in brackets are optional): DEFINE PROBABILITY AS None|VeryLow|Low|High|VeryHigh ON FEATURE {FeatureName1, FeatureName2, ...}, AllFeatures [WHERE {List of filtering conditions}] [WHEN {List of time-based conditions}] Using this rule syntax, the intrusion probability of each feature Fi given its Wi as: DEFINE DEFINE ON DEFINE ON DEFINE PROBABILITY AS VeryLow ON FEATURE Fi WHERE Weight(Fi)<=0.25 PROBABILITY AS Low FEATURE Fi WHERE Weight(Fi)>0.25 AND Weight(Fi)<=0.50 PROBABILITY AS High FEATURE Fi WHERE Weight(Fi)>0.50 AND Weight(Fi)<=0.75 PROBABILITY AS VeryHigh ON FEATURE Fi WHERE Weight(Fi)>0.75 The assessment of the impact caused by a user action is based on which, how much, and when sensitive data can be exposed or damaged by the user command, as well as who is the user. It is managed by using the following rules: DEFINE IMPACT AS VeryLow|Low|High|VeryHigh ON FEATURE {FeatureName1, FeatureName2, ...}, AllFeatures, [WITH COLUMNS {Column1,Column2,...},AllColumns] [WHERE {List of filtering conditions}] [WHEN {List of time-based conditions}] [JOINED WITH {Column1,Column2,...},AllColumns The clauses are used in a similar manner to those in the probability rules, plus the clause distinguishing which is the user command (ON COMMAND) and the clause defining the impact of two or more columns being processed or shown together (WITH COLUMNS). The WHERE clauses in the DIDS rules (as in standard SQL WHERE clauses) allow a wide range of definitions and due to lack of space are not included. We just wish to make clear that the IRM algorithms can be easily adapted to cope with a wide range of rule possibilities, providing a very wide ID scope. 3 Experimental Evaluation We used the TPC-H benchmark [18] to build a 1GB DW using Oracle 11g DBMS on a Pentium 2.8GHz machine with 2GB SDRAM (with 512MB dedicated to the database server), in a scenario with ten open web connections to the DW in which there are 2 “intruders” and 8 “true” DW users (non-intruders). For each “true” DW user’s workload, a set of randomly chosen TPC-H benchmark queries were selected, i.e., each user has different queries to execute, as well as a distinct number of queries. In each workload’s queries, several were randomly picked for randomly modifying their parameters, to obtain a larger scope of diverse user actions. Each workload also included a random number of random queries (randomly picking a set of tables, columns, functions to execute, grouping and sorting, and literal restrictions for columns included in the WHERE clauses). The TPC-H queries represent typical reporting behavior, while ad hoc queries were simulated by random queries, in smaller number. To build the statistical models for each feature of each “true” user, we executed each user’s workload 50 times. To build each “intruder” workload, we generated 200 random intrusion queries of several types: SQL injection tautologies; Login/password guessing; inserting, changing or deleting a random number of rows; Selecting a random amount of columns and a random amount of functions (MAX, SUM, etc.) from a random number of tables, with and without a random number of grouping columns, with and without range value restrictions; SQL union queries with a random amount of columns and a random amount of tables; Query flooding; Unauthorized actions (create, drop, etc). These intrusion queries represent a wide variety of attacks. The TPC-H benchmark has approximately seven years of business data. We consider the data from the most recent year to have high impact due to intrusion actions, the data from the two previous years as high impact, the data from the two years before that as low impact and the remaining as having very low impact. Table 3 shows the ID results. Figure 3 shows the TP rate is considerably high (89%) while the FP rate is relatively low (5%), with an absolute number of 48 false alarms for a total of 225 generated alerts. Observing Table 4, the absolute number of false negatives is relatively low (23 in a total of 995 non-intrusions). The approach’s precision is considerable (79%) and its accuracy is high (94%). Observing Table 4, the most relevant alerts (very high and critical) represent approximately one third of all alerts; these should be the ones first deserving attention on behalf of the security staff, instead of wasting time checking the remaining alerts (two thirds of all alerts), given that they present smaller impact. Finally, we measured an average overhead of 20% on user workload response time due to running the DIDS detection algorithms. Table 3. Experimental results for the generated alerts (absolute values) # True user actions # intruder actions 1020 200 #TP 177 #FP 48 #TN 972 #FN 23 Table 4. Number of generated alerts per risk exposure measure Very Low 36 Low 50 High 59 Very High 49 Critical 31 Total Number of Alerts 225 Fig. 3. True positive and false positive rates, precision and accuracy 4 Related Work In [1], database transactions are defined by directed graphs describing the SQL command types used for malicious data access detection. This approach cannot handle ad hoc queries and works at the coarse-grained transaction level as opposed to a finegrained query level. A Role Based Access Control mechanism for DIDS was proposed by [3]. Data mining techniques are used, namely classification and clustering, against SQL instructions stored in database audit files to deduce role profiles of normal user behavior. A limitation of this approach is that it cannot extract correlation among queries in transactions. Moreover, since this solution is role-based, it works at a higher coarse-grained level than the user-based profiling in our approach. Detecting attacks by comparison, summarizing SQL statements into compact access patterns named as fingerprints, is the focus of [5]. Profiling the data accessed by users to try to determine their intent is an approach used in [6], using statistical learning algorithms. They argue that analyzing what the user is looking for (i.e., what data) instead of analyzing how s/he is looking for it (i.e., which SQL expressions), is more efficient for anomaly detection. In our paper, we integrate both these views. Data correlation using data mining or machine learning techniques is used in [2, 7, 8, 10]. 5 Conclusions and Future Work We have pointed out issues involving ID in web-access DWs and proposed a specific DIDS for these environments. Our DIDS works transparently between user applications and the database server as an extension of the DBMS itself. The SQL-like rulebase allows extending DBMS data access policies and covers an extremely wide range of intrusion attacks. Risk exposure assessment is used for ranking and prioritizing the generated intrusion alerts, presenting clear advantages when compared with correlation techniques. Experimental results show our approach achieves high efficiency and accuracy for the tested setup. As future work, we intend to test our approach in real-world DWs, namely in cloud environments. References 1. Fonseca, J., Vieira, M., and Madeira, H., “Online Detection of Malicious Data Access Using DBMS Auditing”, ACM Symposium on Applied Computing (SAC), 2008. 2. Hu, Y., and Panda, B., “A Data Mining Approach for Database Intrusion Detection”, ACM Symposium on Applied Computing (SAC), 2004. 3. Kamra, A., Terzi, E., and Bertino, E., “Detecting Anomalous Access Patterns in Relational Databases”, Springer VLDB Journal, 17, 2008. 4. Kimball, R. and Ross, M. The Data Warehouse Toolkit. 2nd Ed, Wiley & Sons, Inc., 2002. 5. Lee, S. Y., Low, W. L., and Wong, P. Y., “Learning Fingerprints for a Database Intrusion Detection System”, European Symp. on Research in Computer Security (ESORICS), 2002. 6. Mathew, S., Petropoulos, M., Ngo, H. Q., and Upadhyaya, S., “A Data-Centric Approach to Insider Attack Detection in Database Systems”, Recent Advances in Intrusion Detection (RAID), 2010. 7. Pietraszek, T., “Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection”, Recent Advances in Intrusion Detection (RAID), 2004. 8. Srivastava, A., Sural, S., and Majumdar, A. K., “Database Intrusion Detection using Weighted Sequence Mining”, Journal of Computers, Vol. I, No. 4, 2006. 9. Treinen, J. J., and Thurimella, R., “A Framework for the Application of Association Rule Mining in Large Intrusion Detection Infrastructures”, (RAID), 2006. 10. Valdes, A., and Skinner, K., “Probabilistic Alert Correlation”, (RAID), 2001. 11. Transaction Processing Council, TPC Decision Support Benchmark H, www.tpc.org/tpch. View publication stats