Academia.eduAcademia.edu

Resilient Identity Crime Detection

2012, IEEE Transactions on Knowledge and Data Engineering

Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity crime. The existing nondata mining detection system of business rules and scorecards, and known fraud matching have limitations. To address these limitations and combat identity crime in real time, this paper proposes a new multilayered detection system complemented with two additional layers: communal detection (CD) and spike detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing legal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several million real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to the design, implementation, and evaluation of all detection systems.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X 1 Resilient Identity Crime Detection Clifton Phua, Member, IEEE, Kate Smith-Miles, Senior Member, IEEE, Vincent Lee, and Ross Gayler Abstract—Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity crime. The existing non-data mining detection systems of business rules and scorecards, and known fraud matching have limitations. To address these limitations and combat identity crime in real-time, this paper proposes a new multi-layered detection system complemented with two additional layers: Communal Detection (CD) and Spike Detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper-resistant to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing legal behaviour, and remove the redundant attributes. Experiments were carried out on CD and SD with several million real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to the design, implementation, and evaluation of all detection systems. Index Terms—data mining-based fraud detection, security, data stream mining, anomaly detection. ✦ 1 I NTRODUCTION I DENTITY CRIME is defined as broadly as possible in this paper. At one extreme, synthetic identity fraud refers to the use of plausible but fictitious identities. These are effortless to create but more difficult to apply successfully. At the other extreme, real identity theft refers to illegal use of innocent people’s complete identity details. These can be harder to obtain (although large volumes of some identity data are widely available) but easier to successfully apply. In reality, identity crime can be committed with a mix of both synthetic and real identity details. Identity crime has become prominent because there is so much real identity data available on the Web, and confidential data accessible through unsecured mailboxes. It has also become easy for perpetrators to hide their true identities. This can happen in a myriad of insurance, credit, and telecommunications fraud, as well as other more serious crimes. In addition to this, identity crime is prevalent and costly in developed countries that do not have nationally registered identity numbers. Data breaches which involve lost or stolen consumers’ identity information can lead to other frauds such as tax returns, home equity, and payment card fraud. Consumers can incur thousands of dollars in out-of-pocket expenses. The US law requires offending organisations to notify consumers, so that consumers can mitigate the harm. As a result, these organisations incur economic damage, such as notification costs, fines, and lost business [24]. Credit applications are Internet or paper-based forms • C. Phua is with the Data Mining Department, Institute for Infocomm Research (I2 R), Singapore. E-mail: see https://sites.google.com/site/cliftonphua/ • K. Smith-Miles and V. Lee are with Monash University, and R. Gayler is with Veda Advantage. Digital Object Indentifier 10.1109/TKDE.2010.262 with written requests by potential customers for credit cards, mortgage loans, and personal loans. Credit application fraud is a specific case of identity crime, involving synthetic identity fraud and real identity theft. As in identity crime, credit application fraud has reached a critical mass of fraudsters who are highly experienced, organised, and sophisticated [10]. Their visible patterns can be different to each other and constantly change. They are persistent, due to the high financial rewards, and the risk and effort involved are minimal. Based on anecdotal observations of experienced credit application investigators, fraudsters can use software automation to manipulate particular values within an application and increase frequency of successful values. Duplicates (or matches) refer to applications which share common values. There are two types of duplicates: exact (or identical) duplicates have the all same values; near (or approximate) duplicates have some same values (or characters), some similar values with slightly altered spellings, or both. This paper argues that each successful credit application fraud pattern is represented by a sudden and sharp spike in duplicates within a short time, relative to the established baseline level. Duplicates are hard to avoid from fraudsters’ pointof-view because duplicates increase their’ success rate. The synthetic identity fraudster has low success rate, and is likely to reuse fictitious identities which have been successful before. The identity thief has limited time because innocent people can discover the fraud early and take action, and will quickly use the same real identities at different places. It will be shown later in this paper that many fraudsters operate this way with these applications and that their characteristic pattern of behaviour can be detected by the methods reported. In short, the new methods are based on white-listing and detecting spikes of similar applications. White-listing uses real social relationships on a fixed set of attributes. This reduces false positives 1041-4347/10/$26.00 © 2010 IEEE This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X by lowering some suspicion scores. Detecting spikes in duplicates, on a variable set of attributes, increases true positives by adjusting suspicion scores appropriately. Throughout this paper, data mining is defined as the real-time search for patterns in a principled (or systematic) fashion. These patterns can be highly indicative of early symptoms in identity crime, especially synthetic identity fraud [22]. 1.1 Main challenges for detection systems Resilience is the ability to degrade gracefully when under most real attacks. The basic question asked by all detection systems is whether they can achieve resilience. To do so, the detection system trades off a small degree of efficiency (degrades processing speed) for a much larger degree of effectiveness (improves security by detecting most real attacks). In fact, any form of security involves trade-offs [26]. The detection system needs “defence-in-depth” with multiple, sequential, and independent layers of defence [25] to cover different types of attacks. These layers are needed to reduce false negatives. In other words, any successful attack has to pass every layer of defence without being detected. The two greatest challenges for the data mining-based layers of defence are adaptivity and use of quality data. These challenges need to be addressed in order to reduce false positives. Adaptivity accounts for morphing fraud behaviour, as the attempt to observe fraud changes its behaviour. But what is not obvious, yet equally important, is the need to also account for changing legal (or legitimate) behaviour within a changing environment. In the credit application domain, changing legal behaviour is exhibited by communal relationships (such as rising/falling numbers of siblings) and can be caused by external events (such as introduction of organisational marketing campaigns). This means legal behaviour can be hard to distinguish from fraud behaviour, but it will be shown later in this paper that they are indeed distinguishable from each other. The detection system needs to exercise caution with applications which reflect communal relationships. It also needs to make allowance for certain external events. Quality Data is highly desirable for data mining and data quality can be improved through the real-time removal of data errors (or noise). The detection system has to filter duplicates which have been re-entered due to human error or for other reasons. It also needs to ignore redundant attributes which have many missing values, and other issues. 1.2 Existing identity crime detection systems There are non-data mining layers of defence to protect against credit application fraud, each with its unique strengths and weaknesses. The first existing defence is made up of business rules and scorecards. In Australia, one business rule is the 2 hundred-point physical identity check test which requires the applicant to provide sufficient point-weighted identity documents face-to-face. They must add up to at least one hundred points, where a passport is worth seventy points. Another business rule is to contact (or investigate) the applicant over the telephone or Internet. The above two business rules are highly effective, but human resource intensive. To rely less on human resources, a common business rule is to match an application’s identity number, address, or phone number against external databases. This is convenient, but the public telephone and address directories, semi-public voters’ register, and credit history data can have data quality issues of accuracy, completeness, and timeliness. In addition, scorecards for credit scoring can catch a small percentage of fraud which does not look creditworthy; but it also removes outlier applications which have a higher probability of being fraudulent. The second existing defence is known fraud matching. Here, known frauds are complete applications which were confirmed to have the intent to defraud and usually periodically recorded into a blacklist. Subsequently, the current applications are matched against the blacklist. This has the benefit and clarity of hindsight because patterns often repeat themselves. However, there are two main problems in using known frauds. First, they are untimely due to long time delays, in days or months, for fraud to reveal itself, and be reported and recorded. This provides a window of opportunity for fraudsters. Second, recording of frauds is highly manual. This means known frauds can be incorrect [11], expensive, difficult to obtain [21], [3], and have the potential of breaching privacy. In the real-time credit application fraud detection domain, this paper argues against the use of classification (or supervised) algorithms which use class labels. In addition to the problems of using known frauds, these algorithms, such as logistic regression, neural networks, or Support Vector Machines (SVM), cannot achieve scalability or handle the extreme imbalanced class [11] in credit application data streams. As fraud and legal behaviour changes frequently, the classifiers will deteriorate rapidly and the supervised classification algorithms will need to be trained on the new data. But the training time is too long for real-time credit application fraud detection because the new training data has too many derived numerical attributes (converted from the original, sparse string attributes) and too few known frauds. This paper acknowledges that in another domain, real-time credit card transactional fraud detection, there are the same issues of scalability, extremely imbalanced classes, and changing behaviour. For example, FairIsaac - a company renown for their predictive fraud analytics has been successfully applying supervised classification algorithms, including neural networks and SVM. 1.3 New data mining-based layers of defence The main objective of this research is to achieve resilience by adding two new, real-time, data mining-based layers This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X to complement the two existing non-data mining layers discussed in the subsection. These new layers will improve detection of fraudulent applications because the detection system can detect more types of attacks, better account for changing legal behaviour, and remove the redundant attributes. These new layers are not human resource intensive. They represent patterns in a score where the higher the score for an application, the higher the suspicion of fraud (or anomaly). In this way, only the highest scores require human intervention. These two new layers, communal and spike detection, do not use external databases, but only the credit application database per se. And crucially, these two layers are unsupervised algorithms which are not completely dependent on known frauds but use them only for evaluation. The main contribution of this paper is the demonstration of resilience, with adaptivity and quality data in real-time data mining-based detection algorithms. The first new layer is Communal Detection (CD): the whitelist-oriented approach on a fixed set of attributes. To complement and strengthen CD, the second new layer is Spike Detection (SD): the attribute-oriented approach on a variable-size set of attributes. The second contribution is the significant extension of knowledge in credit application fraud detection because publications in this area are rare. In addition, this research uses the key ideas from other related domains to design the credit application fraud detection algorithms. Finally, the last contribution is the recommendation of credit application fraud detection as one of the many solutions to identity crime. Being at the first stage of the credit life cycle, credit application fraud detection also prevents some credit transactional fraud. Section 2 gives an overview of related work in credit application fraud detection and other domains. Section 3 presents the justifications and anatomy of the CD algorithm, followed by the SD algorithm. Before the analysis and interpretation of CD and SD results, Section 4 considers the legal and ethical responsibility of handling application data, and describes the data, evaluation measures, and experimental design. Section 5 concludes the paper. 2 BACKGROUND Many individual data mining algorithms have been designed, implemented, and evaluated in fraud detection. Yet until now, to the best of the researchers’ knowledge, resilience of data mining algorithms in a complete detection system has not been explicitly addressed. Much work in credit application fraud detection remains proprietary and exact performance figures unpublished, therefore there is no way to compare the CD and SD algorithms against their leading industry methods and techniques. For example, [14] has ID Score-Risk which gives a combined view of each credit application’s characteristics and their similarity to other industryprovided or Web identity’s characteristics. In another 3 example, [7] has Detect which provides four categories of policy rules to signal fraud, one of which is checking a new credit application against historical application data for consistency. Case-Based Reasoning (CBR) is the only known prior publication in the screening of credit applications [29]. CBR analyses the hardest cases which have been misclassified by existing methods and techniques. Retrieval uses thresholded nearest neighbour matching. Diagnosis utilises multiple selection criteria (probabilistic curve, best match, negative selection, density selection, and default) and resolution strategies (sequential resolutiondefault, best guess, and combined confidence) to analyse the retrieved cases. CBR has twenty percent higher true positive and true negative rates than common algorithms on credit applications. The CD and SD algorithms, which monitor the significant increase or decrease in amount of something important (Section 3), are similar in concept to credit transactional fraud detection and bio-terrorism detection. Peer Group Analysis [2] monitors inter-account behaviour over time. It compares the cumulative mean weekly amount between a target account and other similar accounts (peer group) at subsequent time points. The suspicion score is a t-statistic which determines the standardised distance from the centroid of the peer group. On credit card accounts, the time window to calculate a peer group is thirteen weeks, and the future time window is four weeks. Break Point Analysis [2] monitors intra-account behaviour over time. It detects rapid spending or sharp increases in weekly spending within a single account. Accounts are ranked by the t-test. The fixed-length moving transaction window contains twenty-four transactions: the first twenty for training and the next four for evaluation on credit card accounts. Bayesian networks [33] uncovers simulated anthrax attacks from real emergency department data. [32] surveys algorithms for finding suspicious activity in time for disease outbreaks. [9] uses time series analysis to track early symptoms of synthetic anthrax outbreaks from daily sales of retail medication (throat, cough, and nasal) and some grocery items (facial tissues, orange juice, and soup). Control-chart-based statistics, exponential weighted moving averages, and generalised linear models were tested on the same bio-terrorism detection data and alert rate [15]. The SD algorithm, which specifies how much the current prediction is influenced by past observations (subsection 3.3), is related to Exponentially Weighted Moving Average (EWMA) in statistical process control research [23]. In particular, like EWMA, the SD algorithm performs linear forecasting on the smoothed time series, and their advantages include low implementation and computational complexity. In addition, the SD algorithm is similar to change point detection in bio-surveillance research, which maintains a cumulative sum (CUSUM) of positive deviations from the mean [13]. Like CUSUM, the SD algorithm raises an alert when the score/CUSUM exceeds a threshold, and both detects change points This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X faster as they are sensitive to small shifts from the mean. Unlike CUSUM, the SD algorithm weighs and chooses string attributes, not numerical ones. 3 T HE METHODS This section is divided into four subsections to systematically explain the CD algorithm (first two subsections) and the SD algorithm (last two subsections). Each subsection commences with a clearer discussion about its purposes. 3.1 Communal Detection (CD) This subsection motivates the need for CD and its adaptive approach. Suppose there were two credit card applications that provided the same postal address, home phone number, and date of birth, but one stated the applicant’s name to be John Smith, and the other stated the applicant’s name to be Joan Smith. These applications could be interpreted in three ways: 1) Either it is a fraudster attempting to obtain multiple credit cards using near duplicated data 2) possibly there are twins living in the same house who both are applying for a credit card; 3) or it can be the same person applying twice, and there is a typographical error of one character in the first name. With the CD layer, any two similar applications could be easily interpreted as (1) because this paper’s detection methods use the similarity of the current application to all prior applications (not just known frauds) as the suspicion score. However, for this particular scenario, CD would also recognize these two applications as either (2) or (3) by lowering the suspicion score due to the higher possibility that they are legitimate. To account for legal behaviour and data errors, Communal Detection (CD) is the whitelist-oriented approach on a fixed set of attributes. The whitelist, a list of communal and self relationships between applications, is crucial because it reduces the scores of these legal behaviours and false positives. Communal relationships are near duplicates which reflect the social relationships from tight familial bonds to casual acquaintances: family members, housemates, colleagues, neighbours, or friends [17]. The family member relationship can be further broken down into more detailed relationships such as husband-wife, parent-child, brother-sister, male-female cousin (or both male, or both female), as well as uncleniece (or uncle-nephew, auntie-niece, auntie-nephew). Self-relationships highlight the same applicant as a result of legitimate behaviour (for simplicity, self-relationships are regarded as communal relationships). Broadly speaking, the whitelist is constructed by ranking link-types between applicants by volume. The larger the volume for a link-type, the higher the probability of a communal relationship. On when and how the whitelist is constructed, please refer to Section 3.2, Step 6 of the CD algorithm. 4 However, there are two problems with the whitelist. First, there can be focused attacks on the whitelist by fraudsters when they submit applications with synthetic communal relationships. Although it is difficult to make definitive statements that fraudsters will attempt this, it is also wrong to assume that this will not happen. The solution proposed in this paper is to make the contents of the whitelist become less predictable. The values of some parameters (different from an application’s identity value) are automatically changed such that it also changes the whitelist’s link-types. In general, tampering is not limited to hardware, but can also refer to manipulating software such as code. For our domain, tamper-resistance refers to making it more difficult for fraudsters to manipulate or circumvent data mining by providing false data. Second, the volume and ranks of the whitelist’s real communal relationships change over time. To make the whitelist exercise caution with (or more adaptive to) changing legal behaviour, the whitelist is continually being reconstructed. 3.2 CD algorithm design This subsection explains how the CD algorithm works in real-time by giving scores when they are exact or similar matches between categorical data; and in terms of its nine inputs, three outputs, and six steps. This research focuses on one rapid and continous data stream [19] of applications. For clarity, let G represent the overall stream which contains multiple and consecutive {. . . , gx−2 , gx−1 , gx , gx+1 , gx+2 , . . .} Mini-discrete streams. • gx : current Mini-discrete stream which contains multiple and consecutive {ux,1 , ux,2 , . . . , ux,p } micro-discrete streams. • x: fixed interval of the current month, fortnight, or week in the year. • p: variable number of micro-discrete streams in a Mini-discrete stream. Also, let ux,y represent the current micro-discrete stream which contains multiple and consecutive {vx,y,1 , vx,y,2 , . . . , vx,y,q } applications. The current application’s links are restricted to previous applications within a moving window, and this window can be larger than the number of applications within the current micro-discrete stream. • y: fixed interval of the current day, hour, minute, or second. • q: variable number of applications in a microdiscrete stream. Here, it is necessary to describe a single and continuous stream of applications as being made up of separate chunks: a Mini-discrete stream is long-term (for example, a month of applications); while a micro-discrete stream is short-term (for example, a day of applications). They help to specify precisely when and how the detection system will automatically change its configurations. For example, the CD algorithm reconstructs its whitelist at the end of the month and resets its parameter values This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X at the end of the day; the SD algorithm does attribute selection and updates CD attribute weights at the end of the month. Also, for example, long-term previous average score, long-term previous average links, and average density of each attribute are calculated from data in a Mini-discrete stream; short-term current average score and short-term current average links are calculated from data in a micro-discrete stream. With this data stream perspective in mind, the CD algorithm matches the current application against a moving window of previous applications. It accounts for attribute weights which reflect the degree of importance in attributes. The CD algorithm matches all links against the whitelist to find communal relationships and reduce their link score. It then calculates the current application’s score using every link score and previous application score. At the end of the current micro-discrete data stream, the CD algorithm determines the State-ofAlert (SoA) and updates one random parameter’s value such that it trades off effectiveness with efficiency, or vice versa. At the end of the current Mini-discrete data stream, it constructs the new whitelist. Inputs vi (current application) W number of vj (moving window) x,link−type (link-types in current whitelist) Tsimilarity (string similarity threshold) Tattribute (attribute threshold) η (exact duplicate filter) α (exponential smoothing factor) Tinput (input size threshold) SoA (State-of-Alert) Outputs S(vi ) (suspicion score) Same or new parameter value New whitelist CD algorithm Step 1: Multi-attribute link [match vi against W number of vj to determine if a single attribute exceeds Tsimilarity ; and create multi-attribute links if near duplicates’ similarity exceeds Tattribute or an exact duplicates’ time difference exceeds η] Step 2: Single-link score [calculate single-link score by matching Step 1’s multi-attribute links against x,link−type ] Step 3: Single-link average previous score [calculate average previous scores from Step 1’s linked previous applications] Step 4: Multiple-links score [calculate S(vi ) based on weighted average (using α) of Step 2’s link scores and Step 3’s average previous scores] Step 5: Parameter’s value change [determine same or new parameter value through SoA (for example, by comparing input size against Tinput ) at end of ux,y ] Step 6: Whitelist change [determine new whitelist at end of gx ] TABLE 1 Overview of Communal Detection (CD) algorithm Table 1 shows the data input, six most influential parameters, and two adaptive parameters. • vi : unscored current application. N is its number of attributes. ai,k is the value of the k th attribute in 5 application vi . W : moving (or sliding) window of previous applications. It determines the short time search space for the current application. CD utilises an applicationbased window (such as the previous ten thousand applications). vj is the scored previous application. aj,k is the value of the k th attribute in application vj . • ℜx,link−type is a set of unique and sorted link-types (in descending order by number of links), in the link-type attribute of the current whitelist. M is the number of link-types. • Tsimilarity : string similarity threshold between two values. • Tattribute : attribute threshold which requires a minimum number of matched attributes to link two applications. • η: exact duplicate filter at the link level. It removes links of exact duplicates from the same organisation within minutes, likely to be data errors by customers or employees • α: exponential smoothing factor. In CD, α gradually discounts the effect of average previous scores as the older scores become less relevant. • Tinput : input size threshold. When the environment evolves significantly over time, the input size threshold Tinput may have to be manually adjusted. • SoA (State-of-Alert): condition of reduced, same, or heightened watchfulness for each parameter. Table 1 also shows the three outputs. • S(vi ): CD suspicion score of current application. • Same or new parameter values for each parameter. • New whitelist. While Table 1 gives an overview of the CD algorithm’s six steps, the details in each step are presented below. • Step 1: Multi-attribute link The first step of the CD algorithm matches every current application’s value against a moving window of previous applications’ values to find links. ek =  1 0 if Jaro − W inkler(ai,k , aj,k ) ≥ Tsimilarity (1) otherwise where ek is the single-attribute match between the current value and a previous value. The first case uses Jaro-Winkler(.) [30], is case sensitive, and can also be cross-matched between current value and previous values from another similar attribute. The second case is a non-match because values are not similar. ei,j = ⎧ ⎪ ⎪e1 e2 . . . eN ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ ε N if Tattribute ≤ k=1 ek ≤ N − 1 N or [ k=1 ek = N and T ime(ai,k , aj,k ) ≥ η] otherwise (2) This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X where ei,j is the multi-attribute link (or binary string) between the current application and a previous application. ε is the empty string. The first case uses Time(.) which is the time difference in minutes. The second case has no link (empty string) because it is not a near duplicate, or it is an exact duplicate within the time filter. Step 2: Single-link communal detection The second step of the CD algorithm accounts for attribute weights, and matches every current application’s link against the whitelist to find communal relationships and reduce their link score. ⎧ N ⎪ k=1 (ek × wk ) × wz ⎪ ⎪ ⎪ ⎪ ⎨ N S(ei,j ) = k=1 (ek × wk ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 if ei,j ∈ ℜx,link−type and ei,j = ε if ei,j ∈ / ℜx,link−type and ei,j = ε otherwise (3) where S(ei,j ) is the single-link score. This terminology “single-link score” is adopted over “multi-attribute link score” to focus on a single link between two applications, not on the matching of attributes between them. The first case uses wk which is the attribute weight with default values of N1 , and wz which is the weight of the z th link-type in the whitelist. The second case is the greylist (neither in the blacklist nor whitelist) link score. The last case is when there is no multi-attribute link. Step 3: Single-link average previous score The third step of the CD algorithm is the calculation of every linked previous applications’ score for inclusion into the current application’s score. The previous scores act as the established baseline level. ⎧ S(v ) j ⎪ ⎨ EO (vj ) if ei,j = ε (4) βj = and EO (vj ) > 0 ⎪ ⎩ 0 otherwise where βj is the single-link average previous score. As there will be no linked applications, the initial values of βj = 0 since S(vj ) = 0 and Eo (vj ) = 0. S(vj ) is the suspicion score of a previous application to which the current application links. S(vj ) was computed the same way as S(vi ) - a previous application was once a current application. EO (vj ) is the number of outlinks from the previous application. The first case gives the average score of each previous application. The last case is when there is no multi-attribute link. Step 4: Multiple-links score The fourth step of the CD algorithm is the calculation of every current application’s score using every link and previous application score.  [S(ei,j ) + βj ] (5) S(vi ) = vj ∈K(vi ) 6 where S(vi ) is the CD suspicion score of the current application. K(vi ) is the set of previous applications within the moving window to which the current application links. Therefore, a high score is the result of strong links between current application and the previous applications (represented by S(ei,j )), the high scores from linked previous applications (represented by of linked previous applications βj ), and a large number  (represented by vj ∈K(vi ) [.]). S(vi ) =  [(1 − α) × S(ei,j ) + α × βj ] (6) vj ∈K(vi ) where Equation (6) incorporates α [6] into Equation (5). Step 5: Parameter’s value change At the end of the current micro-discrete data stream, the adaptive CD algorithm determines the State-of-Alert (SoA) and updates one random parameter’s value such that there is a trade-off between effectiveness with efficiency, or vice versa. This increases the tamper-resistance in parameters. ⎧ low ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ SoA = high ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ medium if q ≥ Tinput and Ωx−1 ≥ Ωx,y and δx−1 ≥ δx,y if q < Tinput and Ωx−1 < Ωx,y (7) and δx−1 < δx,y otherwise where SoA is the state-of-alert at the end of every micro-discrete data stream. Ωx−1 is the long-term previous average score and Ωx,y is the short-term current average score. δx−1 is the long-term previous average links and δx,y is the short-term current average links. Collectively, these are termed output suspiciousness. The first case sets SoA to low when input size is high and output suspiciousness is low. The adaptive CD algorithm trades off one random parameter’s effectiveness (degrades communal relationship security) for efficiency (improves computation speed). For example, a smaller moving window, fewer link-types in the whitelist, or a larger attribute threshold decreases the algorithm’s effectiveness but increases its efficiency. Conversely, the second case sets SoA to high when its conditions are the opposite of the first case. The adaptive CD algorithm will trade off one random parameter’s efficiency (degrades speed) for effectiveness (improves security). The last case sets SoA to medium. The adaptive CD algorithm will not change any parameter’s value. Step 6: Whitelist change At the end of the current Mini-discrete data stream, the adaptive CD algorithm constructs the new whitelist on the current Mini-discrete stream’s links. This increases the tamper-resistance in the whitelist. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X i or j 1 Given name Family name Unit no. Street name John Smith 1 2 Joan Smith 1 3 Jack Jones 3 4 Ella Jones 3 5 Riley Lee 2 6 Liam Smyth 2 Circular road Circular road Square drive Square drive Circular road Circular road Home phone no. 91234567 Date of birth 1/1/1982 91234567 1/1/1982 93535353 3/2/1955 93535353 6/8/1957 91235678 5/3/1983 91235678 1/1/1982 TABLE 2 Sample of 6 credit applications with 6 attributes Table 2 provides a sample of 6 credit applications with 6 attributes, to show how communal relationships are extracted from credit applications. The whitelist is constructed from multi-attribute links generated from Step 1 of the CD algorithm on the training data. In our simple illustration, the CD algorithm is assumed to have the following parameter settings: Tsimilarity = 0.8, Tattribute = 3, and M = 4. If Table 2 is used as training data, five multi-attribute links will be generated: e1,2 = 011111, e1,6 = 010101, e2,6 = 010101, e3,4 = 011110, and e5,6 = 001110. These multi-attribute links capture communal relationships: John and Joan are twins, Jack and Ella are married, Riley and Liam are housemates, John and Joan are neighbours with Riley and Liam; and John, Joan, and Liam share the same birthday. z 1 2 3 4 Link-type 010101 011111 011110 001110 No. 2 1 1 1 Weight 0.25 0.5 0.75 1 TABLE 3 Sample whitelist Table 3 shows the sample whitelist constructed from credit applications in Table 2. A whitelist contains three attributes. They include the link-type, which is a unique link determined from aggregated links from training data, and its corresponding number of this type of link and its link-type weight. There will be many linktypes, so the quantity of link-types are pre-determined by selecting the most frequent ones to be in the whitelist. Specifically, the link-types in the whitelist are processed in the following manner. The link-types are first sorted in descending order by number of links. For the high1 . est ranked link-type, the link-type weight starts at M Each subsequent link-type weight is then incrementally 1 , until the lowest ranked link-type weight increased by M is one. In other words, a higher ranked link-type is given a smaller link-type weight and is most likely a communal 7 relationship. 3.3 Spike Detection (SD) This subsection contrasts SD with CD; and presents the need for SD, in order to improve resilience and adaptivity. Before proceeding with a description of Spike Detection (SD), it is necessary to reinforce that CD finds real social relationships to reduce the suspicion score, and is tamper-resistant to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. In contrast, SD finds spikes to increase the suspicion score, and is probe-resistant for attributes. Probe-resistance reduces the chances a fraudster will discover attributes used in the SD score calculation. It is the attribute-oriented approach on a variable-size set of attributes. A side note: SD cannot use a whitelistoriented approach because it was not designed to create multi-attribute links on a fixed-size set of attributes. CD has a fundamental weakness in its attribute threshold. Specifically, CD must match at least three values for our dataset. With less than three matched values, our whitelist does not contain real social relationships because some values, such as given name and unit number, are not unique identifiers. The fraudster can duplicate one or two important values which CD cannot detect. SD complements CD. The redundant attributes are either too sparse where no patterns can be detected, or too dense where no denser values can be found. The redundant attributes are continually filtered, only selected attributes in the form of not-too-sparse and nottoo-dense attributes are used for the SD suspicion score. In this way, the exposure of the detection system to probing of attributes is reduced because only one or two attributes are adaptively selected. Suppose there was a bank’s marketing campaign to give attractive benefits for it’s new ladies’ platinum credit card. This will cause a spike in the number of legitimate credit card applications by women, which can be erroneously interpreted by the system as a fraudster attack. To account for the changing legal behaviour caused by external events, SD strengthens CD by providing attribute weights which reflect the degree of importance in attributes. The attributes are adaptive for CD in the sense that its attribute weights are continually determined. This addresses external events such as the entry of new organisations and exit of existing ones, and marketing campaigns of organisations which do not contain any patterns and are likely to cause three natural changes in attribute weights. These changes are volume drift where the overall volume fluctuates, population drift where the volume of both fraud and legal classes fluctuates independent of each other, and concept drift which involves changing legal characteristics that can become similar to fraud characteristics. By tuning attribute weights, the detection system makes allowance for these external events. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X In general, SD trades off effectiveness (degrades security because it has more false positives without filtering out communal relationships and some data errors) for efficiency (improves computation speed because it does not match against the whitelist, and can compute each attribute in parallel on multiple workstations). In contrast, CD trades off efficiency (degrades computation speed) for effectiveness (improves security by accounting for communal relationships and more data errors). 3.4 SD algorithm design This subsection explains how the SD algorithm works in real-time with the CD algorithm, and in terms of its six inputs, two outputs, and five steps. From the data stream point-of-view, using a series of window steps, the SD algorithm matches the current application’s value against a moving window of previous applications’ values. It calculates the current value’s score by integrating all steps to find spikes. Then, it calculates the current application’s score using all values’ scores and attribute weights. Also, at the end of the current Mini-discrete data stream, the SD algorithm selects the attributes for the SD suspicion score, and updates the attribute weights for CD. Inputs vi (current application) W number of vj (moving window) t (current step) Tsimilarity (string similarity threshold) θ (time difference filter) α (exponential smoothing factor) Outputs S(vi ) (suspicion score) wk (attribute weight) SD algorithm Step 1: Single-step scaled counts [match vi against W number of vj to determine if a single value exceeds Tsimilarity and its time difference exceeds θ] Step 2: Single-value spike detection [calculate current value’s score based on weighted average (using α) of t Step 1’s scaled matches] Step 3: Multiple-values score [calculate S(vi ) from Step 2’s value scores and Step 4’s wk ] Step 4: SD attributes selection [determine wk for SD at end of gx ] Step 5: CD attribute weights change [determine wk for CD at end of gx ] 8 Tsimilarity : string similarity threshold between two values (previously described in subsection 3.1). • θ: time difference filter at the link level. It is a simplified version of the exact duplicate filter. • α: In SD, it gradually discounts the effect of previous steps of each value as the older steps become less relevant. Table 4 also shows the two outputs. • S(vi ): SD suspicion score of current application. • wk : In SD, each attribute weight is automatically updated at the end of the current Mini-discrete data stream. While Table 4 gives an overview of the SD algorithm’s five steps, the details in each step are presented below. • Step 1: Single-step scaled count The first step of the SD algorithm matches every current value against a moving window of previous values in steps. ⎧ ⎪ ⎨1 if Jaro − W inkler(ai,k , aj,k ) ≥ Tsimilarity ai,j = and T ime(ai,k , aj,k ) ≥ θ ⎪ ⎩0 otherwise (8) where ai,j is the single-attribute match between the current value and a previous value. The first case uses Jaro-Winkler(.) [30], which is case sensitive, and can also be cross-matched between current value and previous values from another similar attribute, and Time(.) which is the time difference in minutes. The second case is a non-match because the values are not similar, or recur too quickly. sτ (ai,k ) =  aj,k ∈L(ai,k ) (9) where sτ (ai,k ) represents the scaled matches in each step (the moving window is made up of many steps) to remove volume effects. L(ai,k ) is the set of previous values within each step which the current value matches, and κ is the number of values in each step. Step 2: Single-value spike detection The second step of the SD algorithm is the calculation of every current value’s score by integrating all steps to find spikes. The previous steps act as the established baseline level. TABLE 4 Overview of Spike Detection (SD) algorithm S(ai,k ) = (1 − α) × st (ai,k ) + α × Table 4 shows the data input and five parameters. • vi : unscored current application (previously introduced in subsection 3.1). • W : In SD, it is a time-based window (such as previous ten days). • t: current step, also the number of steps in W . ai,j κ t−1 τ =1 sτ (ai,k ) t−1 (10) where S(ai,k ) is the current value score. Step 3: Multiple-values score The third step of the SD algorithm is the calculation of every current application’s score using all values’ scores and attribute weights. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X S(vi ) = N  S(ai,k ) × wk (11) k=1 where S(vi ) is the SD suspicion score of the current application. Step 4: SD attributes selection At the end of every current Mini-discrete data stream, the fourth step of the SD algorithm selects the attributes for the SD suspicion score. This also highlights the probereduction of selected attributes. wk = ⎧ ⎪ ⎪ ⎨1 if ≤ ⎪ ⎪ ⎩ 1 2×N 1 N + ≤ p×q S(ai,k ) i=1  i× N k=1 wk N 1 k=1 (wk N × − 1 2 N) (12) 0 otherwise where wk is the SD attribute weight applied to the SD attributes in Equation (11). The first case is the average density of each attribute, or the sum of all value scores within a Mini-discrete stream for one attribute, relative to all other applications and attribute weights. In addition, the first case retains only the best attributes’ weights within the lowerbound (half of default weight) and upperbound (default weight plus one standard deviation), by setting redundant attributes’ weights to zero. Step 5: CD attribute weights change At the end of every current Mini-discrete data stream, the fifth step of the SD algorithm updates the attribute weights for CD. p×q S(ai,k ) (13) wk = i=1 N i × k=1 wk where wk is the SD attribute weight applied to the CD attributes in Equation (3). Standalone CD assumes all attributes are of equal importance. The resilient combination of CD-SD means that CD is provided attribute weights by SD, and these attribute weights reflect degree of importance in attributes. This is how CD and SD scores are combined to give a single score. 4 E XPERIMENTAL 9 be used. The following publications support this argument: [16] ranks SSN as most important, followed by personal name, DoB and address. [17] assigns highest weights to permanent attributes (such as SSN and DoB), followed by stable attributes (such as last name and state), and transient (or ever changing) attributes (such as mobile phone number and email address). [27] states that DoB, gender, and postcode can uniquely identify more than eighty percent of the United States (US) population. [12], [20] regards name, gender, DoB, and address as the most important attributes. The most important identity attributes differ from database to database. They are least likely to be manipulated, and are easiest to collect and investigate. They also have the least missing values, least spelling and transcription errors, and have no encrypted values. Extra precaution had to be taken in this project since this is the first time, to the best of the researchers’ knowledge, that so much real identity data has been released for original credit application fraud detection research. Issues of privacy, confidentiality, and ethics were of prime concern. This real dataset was chosen because, at experimentation time, it had the most recent fraud behaviour. Although this real dataset cannot be made available, there is a synthetic dataset of fifty thousand credit applications which is available at https://sites.google.com/ site/cliftonphua/communal-fraud-scoring-data.zip. The specific summaries and basic statistics of the real credit application data are discussed below. For purposes of confidentiality, the application volume and fraud percentage in Figure 1 have been deliberately removed. Also, the average fraud percentage (known fraud percentage in all applications) and specific attributes for application fraud detection cannot be revealed. There are thirteen months (m1 to m13) with several million applications (VedaAdvantage, 2006). Each day (d1 to d31) has more than ten thousand applications. This historical data is unsampled, time-stamped to the milliseconds, and modelled as data streams. Figure 1(a) illustrates that the detection system has to handle a more rapid and continuous data stream on weekdays than weekends. RESULTS 4.1 Identity data - Real Application DataSet (RADS) Substantial identity crime can be found in private and commercial databases containing information collected about customers, employees, suppliers, and rule violators. The same situation occurs in public and government-regulated databases such as birth, death, patient and disease registries; taxpayers, residents’ address, bankruptcy, and criminals lists. To reduce identity crime, the most important textual identity attributes such as personal name, Social Security Number (SSN), Date-of-Birth (DoB), and address must (a) Daily application volume for two months (b) Fraud percentage across months Fig. 1. Real Application DataSet (RADS) There are about thirty raw attributes such as personal names, addresses, telephone numbers, driver licence This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X numbers (or SSN), DoB, and other identity attributes (but no link attribute). Only nineteen of the most important identity attributes (I to XIX) are selected. All numerical attributes are treated as string attributes. Some of these identifying attributes, including names, were encrypted to preserve privacy. For our identity crime detection data, its encrypted attributes are limited to exact matching because the particular encryption method was not made known to us. But in a real application, homomorphic encryption [18] or unencrypted attributes would be used to allow string similarity matching. Another two problems are many missing values in some attributes, and hash collisions in encrypted attributes (different original values encrypted into the same encrypted value), but it is beyond the scope of this paper to present any solution. The imbalanced class is extreme, with less than one percent of known frauds in all binary class-labeled (as “fraud” or “legal”) applications. Figure 1(b) depicts that known frauds are significantly understated in the provided applications. The main reason for fewer known frauds is having only eight months (m7 to m14) of known frauds linked to thirteen months of applications. Six months (m1 to m6) of known frauds were not provided. This results in m6 to m10 having the highest fraud percentage, but this is not true. Other reasons include some frauds which were unlabeled, having been inadvertently overlooked. Some known frauds are labeled once but not their duplicates, while some organisations do not contribute known frauds. The impact of fewer known frauds means algorithms will produce poorer results and lead to incorrect evaluation. To reduce this negative impact and improve scalability, the data has been rebalanced by retaining all known frauds but randomly undersampling unknown applications by ninety percent. There are multiple sources, consisting of thirty-one organisations (s1 to s31) that provided the applications. Top-5 of these organisations (s1 to s5) can be considered large (with at least ten thousand applications per month), and more important than others, because they contribute more income to the credit bureau. Each organisation contributes their own number and type of attributes. The data quality was enhanced through the cleaning of two obvious data errors. First, a few organisations’ applications, with slightly more than ten percent of all applications, were filtered. This was because some important unstructured attributes were encrypted into just one value. Also, several “dummy” organisations’ applications, comprising less than two percent of all applications, were filtered. They were actually test values particularly common in some months. After the above data pre-processing activities, the actual experimental data provided significantly improved results. This was observed using the parameter settings in CD and SD (subsection 4.3). These results have been omitted to focus on the results from CD and SD parameter settings and attributes. In addition, which are the training and test datasets? 10 The CD, SD, and classification algorithms use eight consecutive months (m6 to m13) out of thirteen months data (each month is also known as a Mini-discrete stream in this paper) where known frauds are not significantly understated. For creating whitelist, selecting attributes, or setting attribute weights in the next month, the training set is the previous month’s data. For evaluation, the test set is the current month’s data. Both training and test datasets are separate from each other. For example, in CD, the initial whitelist is constructed from m5 training data, applied to m6 test data; and so on, until the final whitelist is constructed from m12 training data, and applied to m13 test data. 4.2 Evaluation measure Known frauds tp fn Alerts Non-alerts Unknowns fp tn TABLE 5 Confusion matrix Table 5 shows four main result categories for binaryclass data with a given decision threshold. Alerts (or alarms) refer to applications with scores which exceed the decision threshold, and subjected to responses such as human investigation or outright rejection. Non-alerts are applications with scores lower than the decision threshold. tp, f p, f n, and tn are the number of true positives (or hits), false positives (or false alarms), false negatives (or misses), and true negatives (or normals), respectively. Measure Description precision tp tp+f p tp tp+f n fp f p+tn 2×precision×recall precision+recall recall, sensitivity (1 - specificity) F -measure curve Receiver Operating Characteristic (ROC) curve plotted against X thresholds All scores are ranked in descending order; and sensitivity versus (1 - specificity) plotted against X thresholds TABLE 6 Evaluation measures Table 6 briefly summarises useful evaluation measures for scores. This paper uses the F -measure curve [31] and Receiver Operating Characteristic (ROC) curve [8] with eleven threshold values from zero to one to compare all experiments. The additional thresholds are needed because F -measure curves seldom dominate one another over all thresholds. The F -measure curve is recommended over other useful measures for the following reasons. First, for confidentiality reasons, precisionrecall curves are not used as they will reveal true positives, false positives, and false negatives. Second, in imbalanced class data, ROC curves, AUC, and accuracy This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X understates false positive percentage because they use true negatives [5]. Third, being a single-value measure, NTOP-k [4] does not evaluate results for more than one threshold. The ROC curve can be used to compliment F measure curve because the former allows the reader to directly interpret if CD and SD are really reducing false positives. Reducing false positives is very important because staff costs for manual investigation, and valuable customers lost due to credit application rejection, are very expensive. In this paper, the scores have two unique characteristics. First, the CD score distribution is heavily skewed to the left, while SD score distribution is more skewed to the right. Most scores are zero as values are usually sparse. All the zero scores have been removed since they are not relevant to decision making. This will result in more realistic F -measures, although the number of applications in each F -measure will most likely be different. Second, some scores can exceed one since each application can be similar to many others. In contrast, classifier scores from naive Bayes, decision trees, logistic regression, or SVM exhibit a normal distribution and each score is between zero and one. 4.3 CD and SD’s experimental design All experiments were performed on a dedicated 2 Xeon Quad Core (8 2.0GHz CPUs) and 12 Gb RAM server, running on Windows Server 2008 platform. Communal and spike detection algorithms, as well as evaluation measures, were coded in Java. The application data was stored in a MySQL database. The plan here is to process all real applications from RADS with the most influential parameters and their values. These influential parameters are known to provide the best results based on the experience from hundreds of previous experiments. However, the best results are also dependent on setting the right value for each influential parameter in practice, as some parameters are sensitive to a change in their value. There are seven experiments which focus on specific claims in this paper: (1) No-whitelist, (2) CD-baseline, (3) CD-adaptive, (4) SD-baseline, (5) SD-adaptive, (6) CDSD-resilient, and (7) CD-SD-resilient-best. The first three experiments address how much the CD algorithm reduces false positives. The no-whitelist experiment uses zero link-types (M = 0) to avoid using the whitelist. The CD-baseline experiment has the following parameter values (based on hundreds of previous CD experiments): • W = set to what is convenient for experimentation (for reasons of confidentiality, the actual W cannot be given) • M = 100 • Tsimilarity = 0.8 • Tattribute = 3 • η = 120 • α = 0.8 In other words, the CD-baseline uses a whitelist with one hundred most frequent link-types, and sets the string 11 similarity threshold, attribute threshold, exact duplicate filter, and the exponential smoothing factor for scores. To validate the usefulness of the adaptive CD algorithm’s changing parameter values, CD-adaptive experiment has three parameters (W , M , Tsimilarity ) where their values can be changed according to the State-of-Alert (SoA). The fourth and fifth experiments show if the SD algorithm increases power. The next experiment, SDbaseline, has the following parameter values (based on hundreds of previous SD experiments): • N = 19 • t = 10 • Tsimilarity = 0.8 • θ = 60 • α = 0.8 In other words, the SD-baseline uses all nineteen attributes, a moving window made up of ten window steps, and sets string similarity threshold, time difference filter, and the exponential smoothing factor for steps. The SD-adaptive experiment selects two best attributes for its suspicion score. The last two experiments highlight how well the CDSD combination works. The CD-SD-resilient experiment is actually CD-baseline which uses attribute weights provided by SD-baseline. To empirically evaluate the detection system, the final experiment is CD-SD-resilientbest experiment with the best parameter setting below (without adaptive CD algorithm’s changing parameter values): • W = set to what is expected to be used in practice • Tsimilarity = 1 • Tattribute = 4 • SD attribute weights 4.4 CD and SD’s results and discussion Fig. 2. F -measure curves of CD and SD experiments The CD F -measure curves skew to the left. The CDrelated F -measure curves start from 0.04 to 0.06 at threshold 0, and peak from 0.08 to 0.25 at thresholds 0.2 or 0.3. On the other hand, the SD F -measure curves skew to the right. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X 12 attribute weights, and with the right parameter setting, delivers superior results, despite an extremely imbalanced class (at least for the given dataset). In addition, results from the CD-SD-resilient experiment supports the view that SD attribute weights strengthen the CD algorithm; and resilience (CD-SD-resilience) is shown to be better than adaptivity (CD-adaptive and SD-adaptive). Fig. 3. ROC curves of CD and SD experiments Without the whitelist, the results are inferior. From Figure 2 at threshold 0.2, the no-whitelist experiment (F -measure below 0.09) performs poorer than the CDbaseline experiment (F -measure above 0.1). From Figure 3, the no-whitelist experiment has about 10% more false positives than the CD-baseline experiment. This verifies the hypothesis that the whitelist is crucial because it reduces the scores of these legal behaviour and false positives; also, the larger the volume for a link-type, the higher the probability of a communal relationship. From Figure 2 at threshold 0.2, the CD-adaptive experiment (F -measure above 0.16) has a significantly higher F -measure than the CD-baseline experiment. From Figure 3, the CD-adaptive experiment has about 5% less false positives in the early part of the ROC curve than the CD-baseline experiment. The interpretation is that the most useful parameters are moving window and number of link-types. More importantly, the adaptive CD algorithm finds the balance between effectiveness and efficiency to produce significantly better results than the CD-baseline experiment. This empirical evidence suggests that there is tamper-resistance in parameters and the whitelist as some parameters’ values and whitelist’s link-types are changed in a principled fashion. From Figure 2 at threshold 0.7, the SD-adaptive experiment (F -measure around 0.1) has a significantly higher F -measure than the SD-baseline experiment. Also, SDadaptive experiment has almost the same F -measure as the CD-baseline experiment but at different thresholds. Since most attributes are redundant, the adaptive SD algorithm only needs to select the two best attributes for calculation of the suspicion score. This means that the adaptive SD algorithm on two best attributes produces better results than the SD algorithm on all attributes, as well as similar results to the basic CD algorithm on all attributes. Across thresholds 0.2 to 0.5, the CD-SD-resilient-best experiment (F -measure above 0.23) has a F -measure which is more than twice the CD-baseline experiment’s. This is the most apparent outcome of all experiments: The CD algorithm, strengthened by the SD algorithm’s Fig. 4. F -measure curves of CD-SD-resilient-best parameters Extending CD-SD-resilient-best experiment, Figure 4 shows the results of doubling the most influential parameters’ values. W and Tattribute have significant increases in F -measure over most thresholds, and M has a slight increase at thresholds 0.2 to 0.4. Results on the data support the argument that successful credit application fraud patterns are characterised by sudden and sharp spikes in duplicates. However, this result is based on some assumptions and conditions shown by the research to be critical for effective detection. A larger moving window and attribute threshold, as well as exact matching and the whitelist must be used. There must also be tamper-resistance in parameters and the whitelist. It is also assumed that SD attribute weights are used for SD attributes selection (probe-reduction of attributes), and SD attribute weights are used for CD attribute weights change. However, the results can be slightly incorrect because of the encryption of some attributes and the significantly understated number of known frauds. Also, the solutions could not account for the effects of the existing defences - business rules and scorecards, and known fraud matching - on the results. 4.5 Drilled-down results and discussion The CD-SD-resilient-best experiment shows that the CDSD combination method works best for all thirty-one organisations as a whole. The same method may not work well for every organisation. Figure 5 shows the detailed breakdown of top-5 organisations’ (s1 to s5) results from the CD-SD-resilient-best experiment. Similar to the CDSD-resilient-best experiment, the top-5 organisations’ F measure curves are skewed to the left. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X Fig. 5. F -measure curves of top-5 organisations Across thresholds 0.2 to 0.5, two organisations, s1 (F -measure above 0.22) and s3 (F -measure above 0.19) have comparable F -measures than the CD-SD-resilientbest experiment (F -measure above 0.23). In contrast, for the same thresholds, three organisations, s4 (F measure above 0.16), s2 (F -measure above 0.08), and s5 (F -measure above 0.05) have significantly lower F -measures. However, in CD-baseline experiment, for thresholds 0.2 to 0.5, s5 performs better than s4. This implies that most methods or parameter settings can work well for only some organisations. 4.6 Classifier-comparison experimental design, results, and discussion Are classification algorithms suitable for the real-time credit application fraud detection domain? To answer the above question, four popular classification algorithms with default parameters in WEKA [31] were chosen for classifier experiments. The algorithms were Naive Bayes (NB), C4.5 Decision Tree (DT), Logistic Regression (LoR), and Support Vector Machines (SVM) - current state-ofthe-art libSVM. A well-known data stream classification algorithm, Very Fast Machine Learner (VFML) which is a Hoeffding decision tree, is also used with default parameters in MOA [1]. They were applied to the same training and test data used by CD and SD algorithms, and there was an extra step to convert the string attributes to word vector ones. The following experiments assume that ground truth is available at training time (see Section 1.2 for a description of the problems in using known frauds). Classification algorithms are not the most accurate and scalable for this domain. Figure 6 compares the five classifiers against CD-SD-resilient-best experiment with F -measure across eleven thresholds. Across thresholds 0.2 to 0.5, CD-SD-resilient-best experiment’s F -measure can be several times higher than the five classifiers: NB (F -measure above 0.08), LoR (F -measure above 0.05), VFML (F -measure above 0.04), SVM and DT (F -measure above 0.03). Also, results did not improve from training 13 Fig. 6. F -measure curves of five classification algorithms Experiment(s) CD-SD-resilient-best NB DT VFML LoR SVM Relative time 1 1.25 5 18 60 156 TABLE 7 Relative time of five classification algorithms the five classifiers on labeled multi-attribute links, and applying the classifiers to multi-attribute links in the test data. Table 7 measures relative time of five classifiers using CD-SD-resilient-best experiment as baseline. Time refers to total system time for the algorithm to complete. CD-SD-resilient-best experiment is orders of magnitude faster than the classifier experiments because it does not need to train on many word vector attributes and with few known frauds. 5 C ONCLUSION The main focus of this paper is Resilient Identity Crime Detection; in other words, the real-time search for patterns in a multi-layered and principled fashion, to safeguard credit applications at the first stage of the credit life cycle. This paper describes an important domain that has many problems relevant to other data mining research. It has documented the development and evaluation in the data mining layers of defence for a real-time credit application fraud detection system. In doing so, this research produced three concepts (or “force multipliers”) which dramatically increase the detection system’s effectiveness (at the expense of some efficiency). These concepts are resilience (multi-layer defence), adaptivity (accounts for changing fraud and legal behaviour), and quality data (real-time removal of data errors). These concepts are fundamental to the design, implementation, and evaluation of all fraud detection, adversarial-related detection, and identity crime-related detection systems. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X The implementation of CD and SD algorithms is practical because these algorithms are designed for actual use to complement the existing detection system. Nevertheless, there are limitations. The first limitation is effectiveness, as scalability issues, extreme imbalanced class, and time constraints dictated the use of rebalanced data in this paper. The counter-argument is that, in practice, the algorithms can search with a significantly larger moving window, number of link-types in the whitelist, and number of attributes. The second limitation is in demonstrating the notion of adaptivity. While in the experiments, CD and SD are updated after every period, it is not a true evaluation as the fraudsters do not get a chance to react and change their strategy in response to CD and SD as would occur if they were deployed in real-life (experiments were performed on historical data). ACKNOWLEDGMENTS The authors are grateful to Dr. Warwick Graco and Mr. Kelvin Sim for their insightful comments. This research was supported by the Australian Research Council (ARC) under Linkage Grant Number LP0454077. R EFERENCES [1] Bifet, A. and Kirkby, R. 2009. Massive Online Analysis, Technical Manual, University of Waikato. [2] Bolton, R. and Hand, D. 2001. Unsupervised Profiling Methods for Fraud Detection, Proc. of CSCC01. [3] Brockett, P., Derrig, R., Golden, L., Levine, A. and Alpert, M. 2002. Fraud Classification using Principal Component Analysis of RIDITs, The Journal of Risk and Insurance 69(3): pp. 341-371. DOI: 10.1111/1539-6975.00027. [4] Caruana, R. and Niculescu-Mizil, A. 2004. Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria, Proc. of SIGKDD04. DOI: 10.1145/1014052.1014063. [5] Christen, P. and Goiser, K. 2007. Quality and Complexity Measures for Data Linkage and Deduplication, in F. Guillet and H. Hamilton (eds), Quality Measures in Data Mining, Vol. 43, Springer, United States. DOI: 10.1007/978-3-540-44918-8. [6] Cortes, C., Pregibon, D. and Volinsky, C. 2003. Computational methods for dynamic graphs, Journal of Computational and Graphical Statistics 12(4): pp. 950-970. DOI: 10.1198/1061860032742. [7] Experian. 2008. Experian Detect: Application Fraud Prevention System. Whitepaper, http://www.experian.com/products/pdf/experian detect.pdf. [8] Fawcett, T. 2006. An Introduction to ROC Analysis, Pattern Recognition Letters 27: pp. 861-874. DOI: 10.1016/j.patrec.2005.10.010. [9] Goldenberg, A., Shmueli, G. and Caruana, R. 2002. Using Grocery Sales Data for the Detection of Bio-Terrorist Attacks, Statistical Medicine. [10] Gordon, G., Rebovich, D., Choo, K. and Gordon, J. 2007. Identity Fraud Trends and Patterns: Building a Data-Based Foundation for Proactive Enforcement, Center for Identity Management and Information Protection, Utica College. [11] Hand, D. 2006. Classifier Technology and the Illusion of Progress, Statistical Science 21(1): pp. 1-15. DOI: 10.1214/088342306000000060. [12] Head, B. 2006. Biometrics Gets in the Picture, Information Age August-September: pp. 10-11. [13] Hutwagner, L., Thompson, W., Seeman, G., Treadwell, T. 2006. The Bioterrorism Preparedness and Response Early Aberration Reporting System (EARS), Journal of Urban Health 80: pp. 89-96. PMID: 12791783. [14] IDAnalytics. 2008. ID Score-Risk: Gain Greater Visibility into Individual Identity Risk. Unpublished. [15] Jackson, M., Baer, A., Painter, I. and Duchin, J. 2007. A Simulation Study Comparing Aberration Detection Algorithms for Syndromic Surveillance, BMC Medical Informatics and Decision Making 7(6). DOI: 10.1186/1472-6947-7-6. 14 [16] Jonas, J. 2006. Non-Obvious Relationship Awareness (NORA), Proc. of Identity Mashup. [17] Jost, A. 2004. Identity Fraud Detection and Prevention. Unpublished. [18] Kantarcioglu, M., Jiang, W. and Malin, B. 2008. A PrivacyPreserving Framework for Integrating Person-Specific Databases, Privacy in Statistical Databases, Lecture Notes in Computer Science, 5262/2008: pp. 298-314. DOI: 10.1007/978-3-540-87471-3 25. [19] Kleinberg, J. 2005. Temporal Dynamics of On-Line Information Streams, in M. Garofalakis, J. Gehrke and R. Rastogi (eds), Data Stream Management: Processing High-Speed Data Streams, Springer, United States. ISBN: 978-3-540-28607-3. [20] Kursun, O., Koufakou, A., Chen, B., Georgiopoulos, M., Reynolds, K. and Eaglin, R. 2006. A Dictionary-Based Approach to Fast and Accurate Name Matching in Large Law Enforcement Databases, Proc. of ISI06. DOI: 10.1007/11760146. [21] Neville, J., Simsek, O., Jensen, D., Komoroske, J., Palmer, K. and Goldberg, H. 2005. Using Relational Knowledge Discovery to Prevent Securities Fraud, Proc. of SIGKDD05. DOI: 10.1145/1081870.1081922. [22] Oscherwitz, T. 2005. Synthetic Identity Fraud: Unseen Identity Challenge, Bank Security News 3: p. 7. [23] Roberts, S. 1959. Control-Charts-Tests based on Geometric Moving Averages, Technometrics 1: pp. 239-250. [24] Romanosky, S., Sharp, R. and Acquisti, A. 2010. Data Breaches and Identity Theft: When is Mandatory Disclosure Optimal?, Proc. of WEIS10 Workshop, Harvard University. [25] Schneier, B. 2003. Beyond Fear: Thinking Sensibly about Security in an Uncertain World, Copernicus, New York. ISBN-10: 0387026207. [26] Schneier, B. 2008. Schneier on Security, Wiley, Indiana. ISBN-10: 0470395354. [27] Sweeney, L. 2002. k-Anonymity: A Model for Protecting Privacy, International Journal of Uncertainty, Fuzziness Knowledge-Based Systems: 10(5): pp. 557-570. [28] VedaAdvantage. 2006. Zero-Interest Credit Cards Cause Record Growth In Card Applications. Unpublished. [29] Wheeler, R. and Aitken, S. 2000. Multiple Algorithms for Fraud Detection, Knowledge-Based Systems 13(3): pp. 93-99. DOI: 10.1016/S0950-7051(00)00050-2. [30] Winkler, W. 2006. Overview of Record Linkage and Current Research Directions, Technical Report RR 2006-2, U.S. Census Bureau. [31] Witten, I. and Frank, E. 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, San Francisco. ISBN-10: 1558605525. [32] Wong, W. 2004. Data Mining for Early Disease Outbreak Detection, PhD thesis, Carnegie Mellon University. [33] Wong, W., Moore, A., Cooper, G. and Wagner, M. 2003. Bayesian Network Anomaly Pattern Detection for Detecting Disease Outbreaks, Proc. of ICML03. ISBN: 1-57735-189-4. Clifton Phua is a Research Fellow at the Data Mining Department of Institute of Infocomm Research (I2 R), Singapore. His current research interests are in security and healthcare-related data mining. Kate Smith-Miles is a Professor and Head of the School of Mathematical Sciences at Monash University, Australia. Her current research interests are in neural networks, intelligent systems, and data mining. Vincent Lee is an Associate Professor in the Clayton School of Information Technology, Monash University, Australia. His current research interests are in data and text mining for business intelligence. Ross Gayler is a Senior Research and Development Consultant in Veda Advantage, Australia. His current research interests are in credit scoring.