This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
1
Resilient Identity Crime Detection
Clifton Phua, Member, IEEE, Kate Smith-Miles, Senior Member, IEEE, Vincent Lee, and Ross Gayler
Abstract—Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity crime. The
existing non-data mining detection systems of business rules and scorecards, and known fraud matching have limitations. To address
these limitations and combat identity crime in real-time, this paper proposes a new multi-layered detection system complemented with
two additional layers: Communal Detection (CD) and Spike Detection (SD). CD finds real social relationships to reduce the suspicion
score, and is tamper-resistant to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD
finds spikes in duplicates to increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on
a variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing legal behaviour,
and remove the redundant attributes. Experiments were carried out on CD and SD with several million real credit applications. Results
on the data support the hypothesis that successful credit application fraud patterns are sudden and exhibit sharp spikes in duplicates.
Although this research is specific to credit application fraud detection, the concept of resilience, together with adaptivity and quality
data discussed in the paper, are general to the design, implementation, and evaluation of all detection systems.
Index Terms—data mining-based fraud detection, security, data stream mining, anomaly detection.
✦
1
I NTRODUCTION
I
DENTITY CRIME is defined as broadly as possible in
this paper. At one extreme, synthetic identity fraud
refers to the use of plausible but fictitious identities.
These are effortless to create but more difficult to apply
successfully. At the other extreme, real identity theft
refers to illegal use of innocent people’s complete identity details. These can be harder to obtain (although large
volumes of some identity data are widely available) but
easier to successfully apply. In reality, identity crime can
be committed with a mix of both synthetic and real
identity details.
Identity crime has become prominent because there is
so much real identity data available on the Web, and confidential data accessible through unsecured mailboxes. It
has also become easy for perpetrators to hide their true
identities. This can happen in a myriad of insurance,
credit, and telecommunications fraud, as well as other
more serious crimes. In addition to this, identity crime
is prevalent and costly in developed countries that do
not have nationally registered identity numbers.
Data breaches which involve lost or stolen consumers’
identity information can lead to other frauds such as
tax returns, home equity, and payment card fraud. Consumers can incur thousands of dollars in out-of-pocket
expenses. The US law requires offending organisations
to notify consumers, so that consumers can mitigate the
harm. As a result, these organisations incur economic
damage, such as notification costs, fines, and lost business [24].
Credit applications are Internet or paper-based forms
• C. Phua is with the Data Mining Department, Institute for Infocomm
Research (I2 R), Singapore.
E-mail: see https://sites.google.com/site/cliftonphua/
• K. Smith-Miles and V. Lee are with Monash University, and R. Gayler is
with Veda Advantage.
Digital Object Indentifier 10.1109/TKDE.2010.262
with written requests by potential customers for credit
cards, mortgage loans, and personal loans. Credit application fraud is a specific case of identity crime, involving
synthetic identity fraud and real identity theft.
As in identity crime, credit application fraud has
reached a critical mass of fraudsters who are highly
experienced, organised, and sophisticated [10]. Their visible patterns can be different to each other and constantly
change. They are persistent, due to the high financial
rewards, and the risk and effort involved are minimal.
Based on anecdotal observations of experienced credit
application investigators, fraudsters can use software
automation to manipulate particular values within an
application and increase frequency of successful values.
Duplicates (or matches) refer to applications which
share common values. There are two types of duplicates:
exact (or identical) duplicates have the all same values;
near (or approximate) duplicates have some same values
(or characters), some similar values with slightly altered
spellings, or both. This paper argues that each successful
credit application fraud pattern is represented by a sudden and sharp spike in duplicates within a short time,
relative to the established baseline level.
Duplicates are hard to avoid from fraudsters’ pointof-view because duplicates increase their’ success rate.
The synthetic identity fraudster has low success rate, and
is likely to reuse fictitious identities which have been
successful before. The identity thief has limited time
because innocent people can discover the fraud early and
take action, and will quickly use the same real identities
at different places.
It will be shown later in this paper that many fraudsters operate this way with these applications and that
their characteristic pattern of behaviour can be detected
by the methods reported. In short, the new methods are
based on white-listing and detecting spikes of similar
applications. White-listing uses real social relationships
on a fixed set of attributes. This reduces false positives
1041-4347/10/$26.00 © 2010 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
by lowering some suspicion scores. Detecting spikes in
duplicates, on a variable set of attributes, increases true
positives by adjusting suspicion scores appropriately.
Throughout this paper, data mining is defined as the
real-time search for patterns in a principled (or systematic) fashion. These patterns can be highly indicative of
early symptoms in identity crime, especially synthetic
identity fraud [22].
1.1 Main challenges for detection systems
Resilience is the ability to degrade gracefully when
under most real attacks. The basic question asked by all
detection systems is whether they can achieve resilience.
To do so, the detection system trades off a small degree
of efficiency (degrades processing speed) for a much
larger degree of effectiveness (improves security by detecting most real attacks). In fact, any form of security
involves trade-offs [26].
The detection system needs “defence-in-depth” with
multiple, sequential, and independent layers of defence
[25] to cover different types of attacks. These layers
are needed to reduce false negatives. In other words,
any successful attack has to pass every layer of defence
without being detected.
The two greatest challenges for the data mining-based
layers of defence are adaptivity and use of quality data.
These challenges need to be addressed in order to reduce
false positives.
Adaptivity accounts for morphing fraud behaviour, as
the attempt to observe fraud changes its behaviour. But
what is not obvious, yet equally important, is the need to
also account for changing legal (or legitimate) behaviour
within a changing environment. In the credit application domain, changing legal behaviour is exhibited by
communal relationships (such as rising/falling numbers
of siblings) and can be caused by external events (such
as introduction of organisational marketing campaigns).
This means legal behaviour can be hard to distinguish
from fraud behaviour, but it will be shown later in this
paper that they are indeed distinguishable from each
other.
The detection system needs to exercise caution with
applications which reflect communal relationships. It
also needs to make allowance for certain external events.
Quality Data is highly desirable for data mining and
data quality can be improved through the real-time
removal of data errors (or noise). The detection system
has to filter duplicates which have been re-entered due to
human error or for other reasons. It also needs to ignore
redundant attributes which have many missing values,
and other issues.
1.2 Existing identity crime detection systems
There are non-data mining layers of defence to protect
against credit application fraud, each with its unique
strengths and weaknesses.
The first existing defence is made up of business rules
and scorecards. In Australia, one business rule is the
2
hundred-point physical identity check test which requires the applicant to provide sufficient point-weighted
identity documents face-to-face. They must add up to
at least one hundred points, where a passport is worth
seventy points. Another business rule is to contact (or
investigate) the applicant over the telephone or Internet. The above two business rules are highly effective,
but human resource intensive. To rely less on human
resources, a common business rule is to match an application’s identity number, address, or phone number
against external databases. This is convenient, but the
public telephone and address directories, semi-public
voters’ register, and credit history data can have data
quality issues of accuracy, completeness, and timeliness.
In addition, scorecards for credit scoring can catch a
small percentage of fraud which does not look creditworthy; but it also removes outlier applications which
have a higher probability of being fraudulent.
The second existing defence is known fraud matching.
Here, known frauds are complete applications which
were confirmed to have the intent to defraud and usually
periodically recorded into a blacklist. Subsequently, the
current applications are matched against the blacklist.
This has the benefit and clarity of hindsight because
patterns often repeat themselves. However, there are two
main problems in using known frauds. First, they are
untimely due to long time delays, in days or months, for
fraud to reveal itself, and be reported and recorded. This
provides a window of opportunity for fraudsters. Second, recording of frauds is highly manual. This means
known frauds can be incorrect [11], expensive, difficult
to obtain [21], [3], and have the potential of breaching
privacy.
In the real-time credit application fraud detection domain, this paper argues against the use of classification
(or supervised) algorithms which use class labels. In addition to the problems of using known frauds, these algorithms, such as logistic regression, neural networks, or
Support Vector Machines (SVM), cannot achieve scalability or handle the extreme imbalanced class [11] in credit
application data streams. As fraud and legal behaviour
changes frequently, the classifiers will deteriorate rapidly
and the supervised classification algorithms will need
to be trained on the new data. But the training time is
too long for real-time credit application fraud detection
because the new training data has too many derived
numerical attributes (converted from the original, sparse
string attributes) and too few known frauds.
This paper acknowledges that in another domain,
real-time credit card transactional fraud detection, there
are the same issues of scalability, extremely imbalanced
classes, and changing behaviour. For example, FairIsaac
- a company renown for their predictive fraud analytics has been successfully applying supervised classification
algorithms, including neural networks and SVM.
1.3 New data mining-based layers of defence
The main objective of this research is to achieve resilience
by adding two new, real-time, data mining-based layers
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
to complement the two existing non-data mining layers
discussed in the subsection. These new layers will improve detection of fraudulent applications because the
detection system can detect more types of attacks, better
account for changing legal behaviour, and remove the
redundant attributes.
These new layers are not human resource intensive.
They represent patterns in a score where the higher the
score for an application, the higher the suspicion of fraud
(or anomaly). In this way, only the highest scores require
human intervention. These two new layers, communal
and spike detection, do not use external databases, but
only the credit application database per se. And crucially,
these two layers are unsupervised algorithms which are
not completely dependent on known frauds but use
them only for evaluation.
The main contribution of this paper is the demonstration of resilience, with adaptivity and quality data
in real-time data mining-based detection algorithms.
The first new layer is Communal Detection (CD): the
whitelist-oriented approach on a fixed set of attributes.
To complement and strengthen CD, the second new layer
is Spike Detection (SD): the attribute-oriented approach
on a variable-size set of attributes.
The second contribution is the significant extension of
knowledge in credit application fraud detection because
publications in this area are rare. In addition, this research uses the key ideas from other related domains to
design the credit application fraud detection algorithms.
Finally, the last contribution is the recommendation of
credit application fraud detection as one of the many
solutions to identity crime. Being at the first stage of the
credit life cycle, credit application fraud detection also
prevents some credit transactional fraud.
Section 2 gives an overview of related work in credit
application fraud detection and other domains. Section
3 presents the justifications and anatomy of the CD
algorithm, followed by the SD algorithm. Before the
analysis and interpretation of CD and SD results, Section
4 considers the legal and ethical responsibility of handling application data, and describes the data, evaluation
measures, and experimental design. Section 5 concludes
the paper.
2
BACKGROUND
Many individual data mining algorithms have been designed, implemented, and evaluated in fraud detection.
Yet until now, to the best of the researchers’ knowledge,
resilience of data mining algorithms in a complete detection system has not been explicitly addressed.
Much work in credit application fraud detection remains proprietary and exact performance figures unpublished, therefore there is no way to compare the CD and
SD algorithms against their leading industry methods
and techniques. For example, [14] has ID Score-Risk
which gives a combined view of each credit application’s
characteristics and their similarity to other industryprovided or Web identity’s characteristics. In another
3
example, [7] has Detect which provides four categories of
policy rules to signal fraud, one of which is checking a
new credit application against historical application data
for consistency.
Case-Based Reasoning (CBR) is the only known prior
publication in the screening of credit applications [29].
CBR analyses the hardest cases which have been misclassified by existing methods and techniques. Retrieval
uses thresholded nearest neighbour matching. Diagnosis
utilises multiple selection criteria (probabilistic curve,
best match, negative selection, density selection, and
default) and resolution strategies (sequential resolutiondefault, best guess, and combined confidence) to analyse
the retrieved cases. CBR has twenty percent higher true
positive and true negative rates than common algorithms
on credit applications.
The CD and SD algorithms, which monitor the significant increase or decrease in amount of something
important (Section 3), are similar in concept to credit
transactional fraud detection and bio-terrorism detection. Peer Group Analysis [2] monitors inter-account
behaviour over time. It compares the cumulative mean
weekly amount between a target account and other
similar accounts (peer group) at subsequent time points.
The suspicion score is a t-statistic which determines
the standardised distance from the centroid of the peer
group. On credit card accounts, the time window to
calculate a peer group is thirteen weeks, and the future
time window is four weeks. Break Point Analysis [2]
monitors intra-account behaviour over time. It detects
rapid spending or sharp increases in weekly spending within a single account. Accounts are ranked by
the t-test. The fixed-length moving transaction window
contains twenty-four transactions: the first twenty for
training and the next four for evaluation on credit card
accounts. Bayesian networks [33] uncovers simulated
anthrax attacks from real emergency department data.
[32] surveys algorithms for finding suspicious activity in
time for disease outbreaks. [9] uses time series analysis
to track early symptoms of synthetic anthrax outbreaks
from daily sales of retail medication (throat, cough, and
nasal) and some grocery items (facial tissues, orange
juice, and soup). Control-chart-based statistics, exponential weighted moving averages, and generalised linear
models were tested on the same bio-terrorism detection
data and alert rate [15].
The SD algorithm, which specifies how much the
current prediction is influenced by past observations
(subsection 3.3), is related to Exponentially Weighted
Moving Average (EWMA) in statistical process control
research [23]. In particular, like EWMA, the SD algorithm
performs linear forecasting on the smoothed time series,
and their advantages include low implementation and
computational complexity. In addition, the SD algorithm
is similar to change point detection in bio-surveillance
research, which maintains a cumulative sum (CUSUM)
of positive deviations from the mean [13]. Like CUSUM,
the SD algorithm raises an alert when the score/CUSUM
exceeds a threshold, and both detects change points
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
faster as they are sensitive to small shifts from the mean.
Unlike CUSUM, the SD algorithm weighs and chooses
string attributes, not numerical ones.
3
T HE
METHODS
This section is divided into four subsections to systematically explain the CD algorithm (first two subsections)
and the SD algorithm (last two subsections). Each subsection commences with a clearer discussion about its
purposes.
3.1 Communal Detection (CD)
This subsection motivates the need for CD and its adaptive approach.
Suppose there were two credit card applications that
provided the same postal address, home phone number,
and date of birth, but one stated the applicant’s name to
be John Smith, and the other stated the applicant’s name
to be Joan Smith. These applications could be interpreted
in three ways:
1) Either it is a fraudster attempting to obtain multiple
credit cards using near duplicated data
2) possibly there are twins living in the same house
who both are applying for a credit card;
3) or it can be the same person applying twice, and
there is a typographical error of one character in
the first name.
With the CD layer, any two similar applications could
be easily interpreted as (1) because this paper’s detection
methods use the similarity of the current application
to all prior applications (not just known frauds) as the
suspicion score. However, for this particular scenario,
CD would also recognize these two applications as either
(2) or (3) by lowering the suspicion score due to the
higher possibility that they are legitimate.
To account for legal behaviour and data errors, Communal Detection (CD) is the whitelist-oriented approach
on a fixed set of attributes. The whitelist, a list of
communal and self relationships between applications,
is crucial because it reduces the scores of these legal
behaviours and false positives. Communal relationships
are near duplicates which reflect the social relationships
from tight familial bonds to casual acquaintances: family
members, housemates, colleagues, neighbours, or friends
[17]. The family member relationship can be further
broken down into more detailed relationships such as
husband-wife, parent-child, brother-sister, male-female
cousin (or both male, or both female), as well as uncleniece (or uncle-nephew, auntie-niece, auntie-nephew).
Self-relationships highlight the same applicant as a result
of legitimate behaviour (for simplicity, self-relationships
are regarded as communal relationships).
Broadly speaking, the whitelist is constructed by ranking link-types between applicants by volume. The larger
the volume for a link-type, the higher the probability of a
communal relationship. On when and how the whitelist
is constructed, please refer to Section 3.2, Step 6 of the
CD algorithm.
4
However, there are two problems with the whitelist.
First, there can be focused attacks on the whitelist by
fraudsters when they submit applications with synthetic
communal relationships. Although it is difficult to make
definitive statements that fraudsters will attempt this,
it is also wrong to assume that this will not happen.
The solution proposed in this paper is to make the
contents of the whitelist become less predictable. The
values of some parameters (different from an application’s identity value) are automatically changed such
that it also changes the whitelist’s link-types. In general,
tampering is not limited to hardware, but can also refer
to manipulating software such as code. For our domain,
tamper-resistance refers to making it more difficult for
fraudsters to manipulate or circumvent data mining by
providing false data.
Second, the volume and ranks of the whitelist’s real
communal relationships change over time. To make the
whitelist exercise caution with (or more adaptive to)
changing legal behaviour, the whitelist is continually
being reconstructed.
3.2 CD algorithm design
This subsection explains how the CD algorithm works in
real-time by giving scores when they are exact or similar
matches between categorical data; and in terms of its
nine inputs, three outputs, and six steps.
This research focuses on one rapid and continous data
stream [19] of applications. For clarity, let G represent the
overall stream which contains multiple and consecutive
{. . . , gx−2 , gx−1 , gx , gx+1 , gx+2 , . . .} Mini-discrete streams.
• gx : current Mini-discrete stream which contains
multiple and consecutive {ux,1 , ux,2 , . . . , ux,p }
micro-discrete streams.
• x: fixed interval of the current month, fortnight, or
week in the year.
• p: variable number of micro-discrete streams in a
Mini-discrete stream.
Also, let ux,y represent the current micro-discrete
stream which contains multiple and consecutive
{vx,y,1 , vx,y,2 , . . . , vx,y,q } applications. The current
application’s links are restricted to previous applications
within a moving window, and this window can be
larger than the number of applications within the
current micro-discrete stream.
• y: fixed interval of the current day, hour, minute, or
second.
• q: variable number of applications in a microdiscrete stream.
Here, it is necessary to describe a single and continuous stream of applications as being made up of separate
chunks: a Mini-discrete stream is long-term (for example,
a month of applications); while a micro-discrete stream
is short-term (for example, a day of applications). They
help to specify precisely when and how the detection
system will automatically change its configurations. For
example, the CD algorithm reconstructs its whitelist at
the end of the month and resets its parameter values
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
at the end of the day; the SD algorithm does attribute
selection and updates CD attribute weights at the end
of the month. Also, for example, long-term previous
average score, long-term previous average links, and
average density of each attribute are calculated from data
in a Mini-discrete stream; short-term current average
score and short-term current average links are calculated
from data in a micro-discrete stream.
With this data stream perspective in mind, the CD
algorithm matches the current application against a moving window of previous applications. It accounts for
attribute weights which reflect the degree of importance
in attributes. The CD algorithm matches all links against
the whitelist to find communal relationships and reduce
their link score. It then calculates the current application’s score using every link score and previous application score. At the end of the current micro-discrete
data stream, the CD algorithm determines the State-ofAlert (SoA) and updates one random parameter’s value
such that it trades off effectiveness with efficiency, or
vice versa. At the end of the current Mini-discrete data
stream, it constructs the new whitelist.
Inputs
vi (current application)
W number of vj (moving window)
x,link−type (link-types in current whitelist)
Tsimilarity (string similarity threshold)
Tattribute (attribute threshold)
η (exact duplicate filter)
α (exponential smoothing factor)
Tinput (input size threshold)
SoA (State-of-Alert)
Outputs
S(vi ) (suspicion score)
Same or new parameter value
New whitelist
CD algorithm
Step 1: Multi-attribute link [match vi against W number of
vj to determine if a single attribute exceeds Tsimilarity ; and
create multi-attribute links if near duplicates’ similarity exceeds
Tattribute or an exact duplicates’ time difference exceeds η]
Step 2: Single-link score [calculate single-link score by matching Step 1’s multi-attribute links against x,link−type ]
Step 3: Single-link average previous score [calculate average
previous scores from Step 1’s linked previous applications]
Step 4: Multiple-links score [calculate S(vi ) based on weighted
average (using α) of Step 2’s link scores and Step 3’s average
previous scores]
Step 5: Parameter’s value change [determine same or new
parameter value through SoA (for example, by comparing input
size against Tinput ) at end of ux,y ]
Step 6: Whitelist change [determine new whitelist at end of gx ]
TABLE 1
Overview of Communal Detection (CD) algorithm
Table 1 shows the data input, six most influential
parameters, and two adaptive parameters.
• vi : unscored current application. N is its number of
attributes. ai,k is the value of the k th attribute in
5
application vi .
W : moving (or sliding) window of previous applications. It determines the short time search space for
the current application. CD utilises an applicationbased window (such as the previous ten thousand
applications). vj is the scored previous application.
aj,k is the value of the k th attribute in application
vj .
• ℜx,link−type is a set of unique and sorted link-types
(in descending order by number of links), in the
link-type attribute of the current whitelist. M is the
number of link-types.
• Tsimilarity : string similarity threshold between two
values.
• Tattribute : attribute threshold which requires a minimum number of matched attributes to link two
applications.
• η: exact duplicate filter at the link level. It removes
links of exact duplicates from the same organisation
within minutes, likely to be data errors by customers
or employees
• α: exponential smoothing factor. In CD, α gradually
discounts the effect of average previous scores as
the older scores become less relevant.
• Tinput : input size threshold. When the environment evolves significantly over time, the input size
threshold Tinput may have to be manually adjusted.
• SoA (State-of-Alert): condition of reduced, same, or
heightened watchfulness for each parameter.
Table 1 also shows the three outputs.
• S(vi ): CD suspicion score of current application.
• Same or new parameter values for each parameter.
• New whitelist.
While Table 1 gives an overview of the CD algorithm’s
six steps, the details in each step are presented below.
•
Step 1: Multi-attribute link
The first step of the CD algorithm matches every
current application’s value against a moving window of
previous applications’ values to find links.
ek =
1
0
if Jaro − W inkler(ai,k , aj,k ) ≥ Tsimilarity
(1)
otherwise
where ek is the single-attribute match between the
current value and a previous value. The first case uses
Jaro-Winkler(.) [30], is case sensitive, and can also be
cross-matched between current value and previous values from another similar attribute. The second case is a
non-match because values are not similar.
ei,j =
⎧
⎪
⎪e1 e2 . . . eN
⎪
⎨
⎪
⎪
⎪
⎩
ε
N
if Tattribute ≤ k=1 ek ≤ N − 1
N
or [ k=1 ek = N
and T ime(ai,k , aj,k ) ≥ η]
otherwise
(2)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
where ei,j is the multi-attribute link (or binary string)
between the current application and a previous application. ε is the empty string. The first case uses Time(.)
which is the time difference in minutes. The second
case has no link (empty string) because it is not a near
duplicate, or it is an exact duplicate within the time filter.
Step 2: Single-link communal detection
The second step of the CD algorithm accounts for attribute weights, and matches every current application’s
link against the whitelist to find communal relationships
and reduce their link score.
⎧ N
⎪
k=1 (ek × wk ) × wz
⎪
⎪
⎪
⎪
⎨
N
S(ei,j ) =
k=1 (ek × wk )
⎪
⎪
⎪
⎪
⎪
⎩
0
if ei,j ∈ ℜx,link−type
and ei,j = ε
if ei,j ∈
/ ℜx,link−type
and ei,j = ε
otherwise
(3)
where S(ei,j ) is the single-link score. This terminology
“single-link score” is adopted over “multi-attribute link
score” to focus on a single link between two applications,
not on the matching of attributes between them. The first
case uses wk which is the attribute weight with default
values of N1 , and wz which is the weight of the z th
link-type in the whitelist. The second case is the greylist
(neither in the blacklist nor whitelist) link score. The last
case is when there is no multi-attribute link.
Step 3: Single-link average previous score
The third step of the CD algorithm is the calculation
of every linked previous applications’ score for inclusion
into the current application’s score. The previous scores
act as the established baseline level.
⎧ S(v )
j
⎪
⎨ EO (vj ) if ei,j = ε
(4)
βj =
and EO (vj ) > 0
⎪
⎩
0
otherwise
where βj is the single-link average previous score. As
there will be no linked applications, the initial values
of βj = 0 since S(vj ) = 0 and Eo (vj ) = 0. S(vj ) is
the suspicion score of a previous application to which
the current application links. S(vj ) was computed the
same way as S(vi ) - a previous application was once
a current application. EO (vj ) is the number of outlinks
from the previous application. The first case gives the
average score of each previous application. The last case
is when there is no multi-attribute link.
Step 4: Multiple-links score
The fourth step of the CD algorithm is the calculation
of every current application’s score using every link and
previous application score.
[S(ei,j ) + βj ]
(5)
S(vi ) =
vj ∈K(vi )
6
where S(vi ) is the CD suspicion score of the current
application. K(vi ) is the set of previous applications
within the moving window to which the current application links. Therefore, a high score is the result
of strong links between current application and the
previous applications (represented by S(ei,j )), the high
scores from linked previous applications (represented by
of linked previous applications
βj ), and a large number
(represented by vj ∈K(vi ) [.]).
S(vi ) =
[(1 − α) × S(ei,j ) + α × βj ]
(6)
vj ∈K(vi )
where Equation (6) incorporates α [6] into Equation
(5).
Step 5: Parameter’s value change
At the end of the current micro-discrete data stream,
the adaptive CD algorithm determines the State-of-Alert
(SoA) and updates one random parameter’s value such
that there is a trade-off between effectiveness with efficiency, or vice versa. This increases the tamper-resistance
in parameters.
⎧
low
⎪
⎪
⎪
⎪
⎪
⎨
SoA = high
⎪
⎪
⎪
⎪
⎪
⎩
medium
if q ≥ Tinput and Ωx−1 ≥ Ωx,y
and δx−1 ≥ δx,y
if q < Tinput and Ωx−1 < Ωx,y (7)
and δx−1 < δx,y
otherwise
where SoA is the state-of-alert at the end of every
micro-discrete data stream. Ωx−1 is the long-term previous average score and Ωx,y is the short-term current
average score. δx−1 is the long-term previous average
links and δx,y is the short-term current average links.
Collectively, these are termed output suspiciousness.
The first case sets SoA to low when input size is high
and output suspiciousness is low. The adaptive CD algorithm trades off one random parameter’s effectiveness
(degrades communal relationship security) for efficiency
(improves computation speed). For example, a smaller
moving window, fewer link-types in the whitelist, or
a larger attribute threshold decreases the algorithm’s
effectiveness but increases its efficiency.
Conversely, the second case sets SoA to high when its
conditions are the opposite of the first case. The adaptive
CD algorithm will trade off one random parameter’s
efficiency (degrades speed) for effectiveness (improves
security).
The last case sets SoA to medium. The adaptive CD
algorithm will not change any parameter’s value.
Step 6: Whitelist change
At the end of the current Mini-discrete data stream, the
adaptive CD algorithm constructs the new whitelist on
the current Mini-discrete stream’s links. This increases
the tamper-resistance in the whitelist.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
i
or
j
1
Given
name
Family
name
Unit
no.
Street
name
John
Smith
1
2
Joan
Smith
1
3
Jack
Jones
3
4
Ella
Jones
3
5
Riley
Lee
2
6
Liam
Smyth
2
Circular
road
Circular
road
Square
drive
Square
drive
Circular
road
Circular
road
Home
phone
no.
91234567
Date
of
birth
1/1/1982
91234567
1/1/1982
93535353
3/2/1955
93535353
6/8/1957
91235678
5/3/1983
91235678
1/1/1982
TABLE 2
Sample of 6 credit applications with 6 attributes
Table 2 provides a sample of 6 credit applications with
6 attributes, to show how communal relationships are
extracted from credit applications.
The whitelist is constructed from multi-attribute links
generated from Step 1 of the CD algorithm on the training data. In our simple illustration, the CD algorithm
is assumed to have the following parameter settings:
Tsimilarity = 0.8, Tattribute = 3, and M = 4. If Table 2
is used as training data, five multi-attribute links will be
generated: e1,2 = 011111, e1,6 = 010101, e2,6 = 010101,
e3,4 = 011110, and e5,6 = 001110. These multi-attribute
links capture communal relationships: John and Joan are
twins, Jack and Ella are married, Riley and Liam are
housemates, John and Joan are neighbours with Riley
and Liam; and John, Joan, and Liam share the same
birthday.
z
1
2
3
4
Link-type
010101
011111
011110
001110
No.
2
1
1
1
Weight
0.25
0.5
0.75
1
TABLE 3
Sample whitelist
Table 3 shows the sample whitelist constructed from
credit applications in Table 2. A whitelist contains three
attributes. They include the link-type, which is a unique
link determined from aggregated links from training
data, and its corresponding number of this type of
link and its link-type weight. There will be many linktypes, so the quantity of link-types are pre-determined
by selecting the most frequent ones to be in the whitelist.
Specifically, the link-types in the whitelist are processed
in the following manner. The link-types are first sorted
in descending order by number of links. For the high1
.
est ranked link-type, the link-type weight starts at M
Each subsequent link-type weight is then incrementally
1
, until the lowest ranked link-type weight
increased by M
is one. In other words, a higher ranked link-type is given
a smaller link-type weight and is most likely a communal
7
relationship.
3.3 Spike Detection (SD)
This subsection contrasts SD with CD; and presents
the need for SD, in order to improve resilience and
adaptivity.
Before proceeding with a description of Spike Detection (SD), it is necessary to reinforce that CD finds
real social relationships to reduce the suspicion score,
and is tamper-resistant to synthetic social relationships.
It is the whitelist-oriented approach on a fixed set of
attributes. In contrast, SD finds spikes to increase the
suspicion score, and is probe-resistant for attributes.
Probe-resistance reduces the chances a fraudster will
discover attributes used in the SD score calculation. It
is the attribute-oriented approach on a variable-size set
of attributes. A side note: SD cannot use a whitelistoriented approach because it was not designed to create
multi-attribute links on a fixed-size set of attributes.
CD has a fundamental weakness in its attribute threshold. Specifically, CD must match at least three values
for our dataset. With less than three matched values,
our whitelist does not contain real social relationships
because some values, such as given name and unit
number, are not unique identifiers. The fraudster can
duplicate one or two important values which CD cannot
detect.
SD complements CD. The redundant attributes are
either too sparse where no patterns can be detected,
or too dense where no denser values can be found.
The redundant attributes are continually filtered, only
selected attributes in the form of not-too-sparse and nottoo-dense attributes are used for the SD suspicion score.
In this way, the exposure of the detection system to
probing of attributes is reduced because only one or two
attributes are adaptively selected.
Suppose there was a bank’s marketing campaign to
give attractive benefits for it’s new ladies’ platinum
credit card. This will cause a spike in the number of
legitimate credit card applications by women, which can
be erroneously interpreted by the system as a fraudster
attack.
To account for the changing legal behaviour caused
by external events, SD strengthens CD by providing attribute weights which reflect the degree of importance in
attributes. The attributes are adaptive for CD in the sense
that its attribute weights are continually determined.
This addresses external events such as the entry of new
organisations and exit of existing ones, and marketing
campaigns of organisations which do not contain any
patterns and are likely to cause three natural changes in
attribute weights. These changes are volume drift where
the overall volume fluctuates, population drift where the
volume of both fraud and legal classes fluctuates independent of each other, and concept drift which involves
changing legal characteristics that can become similar
to fraud characteristics. By tuning attribute weights, the
detection system makes allowance for these external
events.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
In general, SD trades off effectiveness (degrades security because it has more false positives without filtering
out communal relationships and some data errors) for
efficiency (improves computation speed because it does
not match against the whitelist, and can compute each attribute in parallel on multiple workstations). In contrast,
CD trades off efficiency (degrades computation speed)
for effectiveness (improves security by accounting for
communal relationships and more data errors).
3.4 SD algorithm design
This subsection explains how the SD algorithm works in
real-time with the CD algorithm, and in terms of its six
inputs, two outputs, and five steps.
From the data stream point-of-view, using a series
of window steps, the SD algorithm matches the current application’s value against a moving window of
previous applications’ values. It calculates the current
value’s score by integrating all steps to find spikes. Then,
it calculates the current application’s score using all
values’ scores and attribute weights. Also, at the end of
the current Mini-discrete data stream, the SD algorithm
selects the attributes for the SD suspicion score, and
updates the attribute weights for CD.
Inputs
vi (current application)
W number of vj (moving window)
t (current step)
Tsimilarity (string similarity threshold)
θ (time difference filter)
α (exponential smoothing factor)
Outputs
S(vi ) (suspicion score)
wk (attribute weight)
SD algorithm
Step 1: Single-step scaled counts [match vi against W number
of vj to determine if a single value exceeds Tsimilarity and its
time difference exceeds θ]
Step 2: Single-value spike detection [calculate current value’s
score based on weighted average (using α) of t Step 1’s scaled
matches]
Step 3: Multiple-values score [calculate S(vi ) from Step 2’s
value scores and Step 4’s wk ]
Step 4: SD attributes selection [determine wk for SD at end of
gx ]
Step 5: CD attribute weights change [determine wk for CD at
end of gx ]
8
Tsimilarity : string similarity threshold between two
values (previously described in subsection 3.1).
• θ: time difference filter at the link level. It is a
simplified version of the exact duplicate filter.
• α: In SD, it gradually discounts the effect of previous
steps of each value as the older steps become less
relevant.
Table 4 also shows the two outputs.
• S(vi ): SD suspicion score of current application.
• wk : In SD, each attribute weight is automatically
updated at the end of the current Mini-discrete data
stream.
While Table 4 gives an overview of the SD algorithm’s
five steps, the details in each step are presented below.
•
Step 1: Single-step scaled count
The first step of the SD algorithm matches every
current value against a moving window of previous
values in steps.
⎧
⎪
⎨1 if Jaro − W inkler(ai,k , aj,k ) ≥ Tsimilarity
ai,j =
and T ime(ai,k , aj,k ) ≥ θ
⎪
⎩0 otherwise
(8)
where ai,j is the single-attribute match between the
current value and a previous value. The first case uses
Jaro-Winkler(.) [30], which is case sensitive, and can also
be cross-matched between current value and previous
values from another similar attribute, and Time(.) which
is the time difference in minutes. The second case is a
non-match because the values are not similar, or recur
too quickly.
sτ (ai,k ) =
aj,k ∈L(ai,k )
(9)
where sτ (ai,k ) represents the scaled matches in each
step (the moving window is made up of many steps)
to remove volume effects. L(ai,k ) is the set of previous
values within each step which the current value matches,
and κ is the number of values in each step.
Step 2: Single-value spike detection
The second step of the SD algorithm is the calculation
of every current value’s score by integrating all steps
to find spikes. The previous steps act as the established
baseline level.
TABLE 4
Overview of Spike Detection (SD) algorithm
S(ai,k ) = (1 − α) × st (ai,k ) + α ×
Table 4 shows the data input and five parameters.
• vi : unscored current application (previously introduced in subsection 3.1).
• W : In SD, it is a time-based window (such as
previous ten days).
• t: current step, also the number of steps in W .
ai,j
κ
t−1
τ =1 sτ (ai,k )
t−1
(10)
where S(ai,k ) is the current value score.
Step 3: Multiple-values score
The third step of the SD algorithm is the calculation of
every current application’s score using all values’ scores
and attribute weights.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
S(vi ) =
N
S(ai,k ) × wk
(11)
k=1
where S(vi ) is the SD suspicion score of the current
application.
Step 4: SD attributes selection
At the end of every current Mini-discrete data stream,
the fourth step of the SD algorithm selects the attributes
for the SD suspicion score. This also highlights the probereduction of selected attributes.
wk =
⎧
⎪
⎪
⎨1 if
≤
⎪
⎪
⎩
1
2×N
1
N
+
≤
p×q
S(ai,k )
i=1
i× N
k=1 wk
N
1
k=1 (wk
N ×
−
1 2
N)
(12)
0 otherwise
where wk is the SD attribute weight applied to the
SD attributes in Equation (11). The first case is the
average density of each attribute, or the sum of all value
scores within a Mini-discrete stream for one attribute,
relative to all other applications and attribute weights.
In addition, the first case retains only the best attributes’
weights within the lowerbound (half of default weight)
and upperbound (default weight plus one standard deviation), by setting redundant attributes’ weights to zero.
Step 5: CD attribute weights change
At the end of every current Mini-discrete data stream,
the fifth step of the SD algorithm updates the attribute
weights for CD.
p×q
S(ai,k )
(13)
wk = i=1
N
i × k=1 wk
where wk is the SD attribute weight applied to the CD
attributes in Equation (3).
Standalone CD assumes all attributes are of equal
importance. The resilient combination of CD-SD means
that CD is provided attribute weights by SD, and these
attribute weights reflect degree of importance in attributes. This is how CD and SD scores are combined
to give a single score.
4
E XPERIMENTAL
9
be used. The following publications support this argument: [16] ranks SSN as most important, followed by
personal name, DoB and address. [17] assigns highest
weights to permanent attributes (such as SSN and DoB),
followed by stable attributes (such as last name and
state), and transient (or ever changing) attributes (such
as mobile phone number and email address). [27] states
that DoB, gender, and postcode can uniquely identify
more than eighty percent of the United States (US) population. [12], [20] regards name, gender, DoB, and address
as the most important attributes. The most important
identity attributes differ from database to database. They
are least likely to be manipulated, and are easiest to
collect and investigate. They also have the least missing
values, least spelling and transcription errors, and have
no encrypted values.
Extra precaution had to be taken in this project since
this is the first time, to the best of the researchers’
knowledge, that so much real identity data has been
released for original credit application fraud detection
research. Issues of privacy, confidentiality, and ethics
were of prime concern.
This real dataset was chosen because, at experimentation time, it had the most recent fraud behaviour.
Although this real dataset cannot be made available,
there is a synthetic dataset of fifty thousand credit applications which is available at https://sites.google.com/
site/cliftonphua/communal-fraud-scoring-data.zip.
The specific summaries and basic statistics of the real
credit application data are discussed below. For purposes
of confidentiality, the application volume and fraud
percentage in Figure 1 have been deliberately removed.
Also, the average fraud percentage (known fraud percentage in all applications) and specific attributes for
application fraud detection cannot be revealed.
There are thirteen months (m1 to m13) with several
million applications (VedaAdvantage, 2006). Each day
(d1 to d31) has more than ten thousand applications.
This historical data is unsampled, time-stamped to the
milliseconds, and modelled as data streams. Figure 1(a)
illustrates that the detection system has to handle a more
rapid and continuous data stream on weekdays than
weekends.
RESULTS
4.1 Identity data - Real Application DataSet (RADS)
Substantial identity crime can be found in private
and commercial databases containing information collected about customers, employees, suppliers, and rule
violators. The same situation occurs in public and
government-regulated databases such as birth, death,
patient and disease registries; taxpayers, residents’ address, bankruptcy, and criminals lists.
To reduce identity crime, the most important textual
identity attributes such as personal name, Social Security
Number (SSN), Date-of-Birth (DoB), and address must
(a) Daily application volume
for two months
(b) Fraud percentage
across months
Fig. 1. Real Application DataSet (RADS)
There are about thirty raw attributes such as personal
names, addresses, telephone numbers, driver licence
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
numbers (or SSN), DoB, and other identity attributes
(but no link attribute). Only nineteen of the most important identity attributes (I to XIX) are selected. All
numerical attributes are treated as string attributes. Some
of these identifying attributes, including names, were
encrypted to preserve privacy. For our identity crime
detection data, its encrypted attributes are limited to exact matching because the particular encryption method
was not made known to us. But in a real application,
homomorphic encryption [18] or unencrypted attributes
would be used to allow string similarity matching. Another two problems are many missing values in some
attributes, and hash collisions in encrypted attributes
(different original values encrypted into the same encrypted value), but it is beyond the scope of this paper
to present any solution.
The imbalanced class is extreme, with less than one
percent of known frauds in all binary class-labeled (as
“fraud” or “legal”) applications. Figure 1(b) depicts that
known frauds are significantly understated in the provided applications. The main reason for fewer known
frauds is having only eight months (m7 to m14) of
known frauds linked to thirteen months of applications.
Six months (m1 to m6) of known frauds were not provided. This results in m6 to m10 having the highest fraud
percentage, but this is not true. Other reasons include
some frauds which were unlabeled, having been inadvertently overlooked. Some known frauds are labeled
once but not their duplicates, while some organisations
do not contribute known frauds.
The impact of fewer known frauds means algorithms
will produce poorer results and lead to incorrect evaluation. To reduce this negative impact and improve
scalability, the data has been rebalanced by retaining all
known frauds but randomly undersampling unknown
applications by ninety percent.
There are multiple sources, consisting of thirty-one
organisations (s1 to s31) that provided the applications.
Top-5 of these organisations (s1 to s5) can be considered
large (with at least ten thousand applications per month),
and more important than others, because they contribute
more income to the credit bureau. Each organisation
contributes their own number and type of attributes.
The data quality was enhanced through the cleaning
of two obvious data errors. First, a few organisations’
applications, with slightly more than ten percent of
all applications, were filtered. This was because some
important unstructured attributes were encrypted into
just one value. Also, several “dummy” organisations’
applications, comprising less than two percent of all
applications, were filtered. They were actually test values
particularly common in some months.
After the above data pre-processing activities, the actual experimental data provided significantly improved
results. This was observed using the parameter settings
in CD and SD (subsection 4.3). These results have been
omitted to focus on the results from CD and SD parameter settings and attributes.
In addition, which are the training and test datasets?
10
The CD, SD, and classification algorithms use eight
consecutive months (m6 to m13) out of thirteen months
data (each month is also known as a Mini-discrete stream
in this paper) where known frauds are not significantly
understated. For creating whitelist, selecting attributes,
or setting attribute weights in the next month, the training set is the previous month’s data. For evaluation, the
test set is the current month’s data. Both training and test
datasets are separate from each other. For example, in
CD, the initial whitelist is constructed from m5 training
data, applied to m6 test data; and so on, until the final
whitelist is constructed from m12 training data, and
applied to m13 test data.
4.2
Evaluation measure
Known frauds
tp
fn
Alerts
Non-alerts
Unknowns
fp
tn
TABLE 5
Confusion matrix
Table 5 shows four main result categories for binaryclass data with a given decision threshold. Alerts (or
alarms) refer to applications with scores which exceed
the decision threshold, and subjected to responses such
as human investigation or outright rejection. Non-alerts
are applications with scores lower than the decision
threshold. tp, f p, f n, and tn are the number of true
positives (or hits), false positives (or false alarms), false
negatives (or misses), and true negatives (or normals),
respectively.
Measure
Description
precision
tp
tp+f p
tp
tp+f n
fp
f p+tn
2×precision×recall
precision+recall
recall, sensitivity
(1 - specificity)
F -measure curve
Receiver
Operating
Characteristic
(ROC)
curve
plotted against X
thresholds
All scores are ranked in descending order; and sensitivity versus (1 - specificity) plotted against X thresholds
TABLE 6
Evaluation measures
Table 6 briefly summarises useful evaluation measures for scores. This paper uses the F -measure curve
[31] and Receiver Operating Characteristic (ROC) curve
[8] with eleven threshold values from zero to one to
compare all experiments. The additional thresholds are
needed because F -measure curves seldom dominate one
another over all thresholds. The F -measure curve is
recommended over other useful measures for the following reasons. First, for confidentiality reasons, precisionrecall curves are not used as they will reveal true positives, false positives, and false negatives. Second, in
imbalanced class data, ROC curves, AUC, and accuracy
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
understates false positive percentage because they use
true negatives [5]. Third, being a single-value measure,
NTOP-k [4] does not evaluate results for more than one
threshold. The ROC curve can be used to compliment F measure curve because the former allows the reader to
directly interpret if CD and SD are really reducing false
positives. Reducing false positives is very important because staff costs for manual investigation, and valuable
customers lost due to credit application rejection, are
very expensive.
In this paper, the scores have two unique characteristics. First, the CD score distribution is heavily skewed
to the left, while SD score distribution is more skewed
to the right. Most scores are zero as values are usually
sparse. All the zero scores have been removed since
they are not relevant to decision making. This will
result in more realistic F -measures, although the number
of applications in each F -measure will most likely be
different. Second, some scores can exceed one since each
application can be similar to many others. In contrast,
classifier scores from naive Bayes, decision trees, logistic
regression, or SVM exhibit a normal distribution and
each score is between zero and one.
4.3 CD and SD’s experimental design
All experiments were performed on a dedicated 2 Xeon
Quad Core (8 2.0GHz CPUs) and 12 Gb RAM server,
running on Windows Server 2008 platform. Communal
and spike detection algorithms, as well as evaluation
measures, were coded in Java. The application data was
stored in a MySQL database. The plan here is to process
all real applications from RADS with the most influential
parameters and their values. These influential parameters are known to provide the best results based on
the experience from hundreds of previous experiments.
However, the best results are also dependent on setting
the right value for each influential parameter in practice,
as some parameters are sensitive to a change in their
value.
There are seven experiments which focus on specific
claims in this paper: (1) No-whitelist, (2) CD-baseline, (3)
CD-adaptive, (4) SD-baseline, (5) SD-adaptive, (6) CDSD-resilient, and (7) CD-SD-resilient-best.
The first three experiments address how much the CD
algorithm reduces false positives. The no-whitelist experiment uses zero link-types (M = 0) to avoid using the
whitelist. The CD-baseline experiment has the following
parameter values (based on hundreds of previous CD
experiments):
• W = set to what is convenient for experimentation
(for reasons of confidentiality, the actual W cannot
be given)
• M = 100
• Tsimilarity = 0.8
• Tattribute = 3
• η = 120
• α = 0.8
In other words, the CD-baseline uses a whitelist with one
hundred most frequent link-types, and sets the string
11
similarity threshold, attribute threshold, exact duplicate
filter, and the exponential smoothing factor for scores. To
validate the usefulness of the adaptive CD algorithm’s
changing parameter values, CD-adaptive experiment has
three parameters (W , M , Tsimilarity ) where their values
can be changed according to the State-of-Alert (SoA).
The fourth and fifth experiments show if the SD
algorithm increases power. The next experiment, SDbaseline, has the following parameter values (based on
hundreds of previous SD experiments):
• N = 19
• t = 10
• Tsimilarity = 0.8
• θ = 60
• α = 0.8
In other words, the SD-baseline uses all nineteen attributes, a moving window made up of ten window
steps, and sets string similarity threshold, time difference
filter, and the exponential smoothing factor for steps. The
SD-adaptive experiment selects two best attributes for its
suspicion score.
The last two experiments highlight how well the CDSD combination works. The CD-SD-resilient experiment
is actually CD-baseline which uses attribute weights
provided by SD-baseline. To empirically evaluate the detection system, the final experiment is CD-SD-resilientbest experiment with the best parameter setting below
(without adaptive CD algorithm’s changing parameter
values):
• W = set to what is expected to be used in practice
• Tsimilarity = 1
• Tattribute = 4
• SD attribute weights
4.4
CD and SD’s results and discussion
Fig. 2. F -measure curves of CD and SD experiments
The CD F -measure curves skew to the left. The CDrelated F -measure curves start from 0.04 to 0.06 at
threshold 0, and peak from 0.08 to 0.25 at thresholds
0.2 or 0.3. On the other hand, the SD F -measure curves
skew to the right.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
12
attribute weights, and with the right parameter setting,
delivers superior results, despite an extremely imbalanced class (at least for the given dataset). In addition,
results from the CD-SD-resilient experiment supports the
view that SD attribute weights strengthen the CD algorithm; and resilience (CD-SD-resilience) is shown to be
better than adaptivity (CD-adaptive and SD-adaptive).
Fig. 3. ROC curves of CD and SD experiments
Without the whitelist, the results are inferior. From
Figure 2 at threshold 0.2, the no-whitelist experiment
(F -measure below 0.09) performs poorer than the CDbaseline experiment (F -measure above 0.1). From Figure
3, the no-whitelist experiment has about 10% more false
positives than the CD-baseline experiment. This verifies
the hypothesis that the whitelist is crucial because it
reduces the scores of these legal behaviour and false
positives; also, the larger the volume for a link-type, the
higher the probability of a communal relationship.
From Figure 2 at threshold 0.2, the CD-adaptive experiment (F -measure above 0.16) has a significantly higher
F -measure than the CD-baseline experiment. From Figure 3, the CD-adaptive experiment has about 5% less
false positives in the early part of the ROC curve than
the CD-baseline experiment. The interpretation is that
the most useful parameters are moving window and
number of link-types. More importantly, the adaptive CD
algorithm finds the balance between effectiveness and
efficiency to produce significantly better results than the
CD-baseline experiment. This empirical evidence suggests that there is tamper-resistance in parameters and
the whitelist as some parameters’ values and whitelist’s
link-types are changed in a principled fashion.
From Figure 2 at threshold 0.7, the SD-adaptive experiment (F -measure around 0.1) has a significantly higher
F -measure than the SD-baseline experiment. Also, SDadaptive experiment has almost the same F -measure as
the CD-baseline experiment but at different thresholds.
Since most attributes are redundant, the adaptive SD
algorithm only needs to select the two best attributes for
calculation of the suspicion score. This means that the
adaptive SD algorithm on two best attributes produces
better results than the SD algorithm on all attributes, as
well as similar results to the basic CD algorithm on all
attributes.
Across thresholds 0.2 to 0.5, the CD-SD-resilient-best
experiment (F -measure above 0.23) has a F -measure
which is more than twice the CD-baseline experiment’s.
This is the most apparent outcome of all experiments:
The CD algorithm, strengthened by the SD algorithm’s
Fig. 4. F -measure curves of CD-SD-resilient-best parameters
Extending CD-SD-resilient-best experiment, Figure 4
shows the results of doubling the most influential parameters’ values. W and Tattribute have significant increases in F -measure over most thresholds, and M has
a slight increase at thresholds 0.2 to 0.4.
Results on the data support the argument that successful credit application fraud patterns are characterised by
sudden and sharp spikes in duplicates. However, this result is based on some assumptions and conditions shown
by the research to be critical for effective detection. A
larger moving window and attribute threshold, as well
as exact matching and the whitelist must be used. There
must also be tamper-resistance in parameters and the
whitelist. It is also assumed that SD attribute weights
are used for SD attributes selection (probe-reduction of
attributes), and SD attribute weights are used for CD
attribute weights change. However, the results can be
slightly incorrect because of the encryption of some
attributes and the significantly understated number of
known frauds. Also, the solutions could not account for
the effects of the existing defences - business rules and
scorecards, and known fraud matching - on the results.
4.5
Drilled-down results and discussion
The CD-SD-resilient-best experiment shows that the CDSD combination method works best for all thirty-one organisations as a whole. The same method may not work
well for every organisation. Figure 5 shows the detailed
breakdown of top-5 organisations’ (s1 to s5) results from
the CD-SD-resilient-best experiment. Similar to the CDSD-resilient-best experiment, the top-5 organisations’ F measure curves are skewed to the left.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
Fig. 5. F -measure curves of top-5 organisations
Across thresholds 0.2 to 0.5, two organisations, s1
(F -measure above 0.22) and s3 (F -measure above 0.19)
have comparable F -measures than the CD-SD-resilientbest experiment (F -measure above 0.23). In contrast,
for the same thresholds, three organisations, s4 (F measure above 0.16), s2 (F -measure above 0.08), and
s5 (F -measure above 0.05) have significantly lower
F -measures. However, in CD-baseline experiment, for
thresholds 0.2 to 0.5, s5 performs better than s4. This
implies that most methods or parameter settings can
work well for only some organisations.
4.6 Classifier-comparison experimental design, results, and discussion
Are classification algorithms suitable for the real-time
credit application fraud detection domain? To answer the
above question, four popular classification algorithms
with default parameters in WEKA [31] were chosen for
classifier experiments. The algorithms were Naive Bayes
(NB), C4.5 Decision Tree (DT), Logistic Regression (LoR),
and Support Vector Machines (SVM) - current state-ofthe-art libSVM. A well-known data stream classification
algorithm, Very Fast Machine Learner (VFML) which
is a Hoeffding decision tree, is also used with default
parameters in MOA [1]. They were applied to the same
training and test data used by CD and SD algorithms,
and there was an extra step to convert the string attributes to word vector ones. The following experiments
assume that ground truth is available at training time
(see Section 1.2 for a description of the problems in using
known frauds).
Classification algorithms are not the most accurate and
scalable for this domain. Figure 6 compares the five
classifiers against CD-SD-resilient-best experiment with
F -measure across eleven thresholds. Across thresholds
0.2 to 0.5, CD-SD-resilient-best experiment’s F -measure
can be several times higher than the five classifiers: NB
(F -measure above 0.08), LoR (F -measure above 0.05),
VFML (F -measure above 0.04), SVM and DT (F -measure
above 0.03). Also, results did not improve from training
13
Fig. 6. F -measure curves of five classification algorithms
Experiment(s)
CD-SD-resilient-best
NB
DT
VFML
LoR
SVM
Relative time
1
1.25
5
18
60
156
TABLE 7
Relative time of five classification algorithms
the five classifiers on labeled multi-attribute links, and
applying the classifiers to multi-attribute links in the test
data. Table 7 measures relative time of five classifiers
using CD-SD-resilient-best experiment as baseline. Time
refers to total system time for the algorithm to complete.
CD-SD-resilient-best experiment is orders of magnitude
faster than the classifier experiments because it does not
need to train on many word vector attributes and with
few known frauds.
5
C ONCLUSION
The main focus of this paper is Resilient Identity
Crime Detection; in other words, the real-time search
for patterns in a multi-layered and principled fashion,
to safeguard credit applications at the first stage of the
credit life cycle.
This paper describes an important domain that has
many problems relevant to other data mining research.
It has documented the development and evaluation in
the data mining layers of defence for a real-time credit
application fraud detection system. In doing so, this
research produced three concepts (or “force multipliers”)
which dramatically increase the detection system’s effectiveness (at the expense of some efficiency). These
concepts are resilience (multi-layer defence), adaptivity
(accounts for changing fraud and legal behaviour), and
quality data (real-time removal of data errors). These
concepts are fundamental to the design, implementation,
and evaluation of all fraud detection, adversarial-related
detection, and identity crime-related detection systems.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 201X
The implementation of CD and SD algorithms is practical because these algorithms are designed for actual
use to complement the existing detection system. Nevertheless, there are limitations. The first limitation is
effectiveness, as scalability issues, extreme imbalanced
class, and time constraints dictated the use of rebalanced
data in this paper. The counter-argument is that, in practice, the algorithms can search with a significantly larger
moving window, number of link-types in the whitelist,
and number of attributes. The second limitation is in
demonstrating the notion of adaptivity. While in the
experiments, CD and SD are updated after every period,
it is not a true evaluation as the fraudsters do not get
a chance to react and change their strategy in response
to CD and SD as would occur if they were deployed in
real-life (experiments were performed on historical data).
ACKNOWLEDGMENTS
The authors are grateful to Dr. Warwick Graco and
Mr. Kelvin Sim for their insightful comments. This research was supported by the Australian Research Council (ARC) under Linkage Grant Number LP0454077.
R EFERENCES
[1] Bifet, A. and Kirkby, R. 2009. Massive Online Analysis, Technical
Manual, University of Waikato.
[2] Bolton, R. and Hand, D. 2001. Unsupervised Profiling Methods for
Fraud Detection, Proc. of CSCC01.
[3] Brockett, P., Derrig, R., Golden, L., Levine, A. and Alpert, M.
2002. Fraud Classification using Principal Component Analysis of
RIDITs, The Journal of Risk and Insurance 69(3): pp. 341-371. DOI:
10.1111/1539-6975.00027.
[4] Caruana, R. and Niculescu-Mizil, A. 2004. Data Mining in Metric
Space: An Empirical Analysis of Supervised Learning Performance
Criteria, Proc. of SIGKDD04. DOI: 10.1145/1014052.1014063.
[5] Christen, P. and Goiser, K. 2007. Quality and Complexity Measures
for Data Linkage and Deduplication, in F. Guillet and H. Hamilton
(eds), Quality Measures in Data Mining, Vol. 43, Springer, United
States. DOI: 10.1007/978-3-540-44918-8.
[6] Cortes, C., Pregibon, D. and Volinsky, C. 2003. Computational methods for dynamic graphs, Journal of Computational and Graphical Statistics 12(4): pp. 950-970. DOI:
10.1198/1061860032742.
[7] Experian.
2008.
Experian
Detect:
Application
Fraud
Prevention
System.
Whitepaper,
http://www.experian.com/products/pdf/experian detect.pdf.
[8] Fawcett, T. 2006. An Introduction to ROC Analysis, Pattern Recognition Letters 27: pp. 861-874. DOI: 10.1016/j.patrec.2005.10.010.
[9] Goldenberg, A., Shmueli, G. and Caruana, R. 2002. Using Grocery
Sales Data for the Detection of Bio-Terrorist Attacks, Statistical
Medicine.
[10] Gordon, G., Rebovich, D., Choo, K. and Gordon, J. 2007. Identity
Fraud Trends and Patterns: Building a Data-Based Foundation
for Proactive Enforcement, Center for Identity Management and
Information Protection, Utica College.
[11] Hand, D. 2006. Classifier Technology and the Illusion
of Progress, Statistical Science 21(1): pp. 1-15. DOI:
10.1214/088342306000000060.
[12] Head, B. 2006. Biometrics Gets in the Picture, Information Age
August-September: pp. 10-11.
[13] Hutwagner, L., Thompson, W., Seeman, G., Treadwell, T. 2006.
The Bioterrorism Preparedness and Response Early Aberration
Reporting System (EARS), Journal of Urban Health 80: pp. 89-96.
PMID: 12791783.
[14] IDAnalytics. 2008. ID Score-Risk: Gain Greater Visibility into
Individual Identity Risk. Unpublished.
[15] Jackson, M., Baer, A., Painter, I. and Duchin, J. 2007. A Simulation
Study Comparing Aberration Detection Algorithms for Syndromic
Surveillance, BMC Medical Informatics and Decision Making 7(6).
DOI: 10.1186/1472-6947-7-6.
14
[16] Jonas, J. 2006. Non-Obvious Relationship Awareness (NORA),
Proc. of Identity Mashup.
[17] Jost, A. 2004. Identity Fraud Detection and Prevention. Unpublished.
[18] Kantarcioglu, M., Jiang, W. and Malin, B. 2008. A PrivacyPreserving Framework for Integrating Person-Specific Databases,
Privacy in Statistical Databases, Lecture Notes in Computer Science, 5262/2008: pp. 298-314. DOI: 10.1007/978-3-540-87471-3
25.
[19] Kleinberg, J. 2005. Temporal Dynamics of On-Line Information Streams, in M. Garofalakis, J. Gehrke and R. Rastogi (eds),
Data Stream Management: Processing High-Speed Data Streams,
Springer, United States. ISBN: 978-3-540-28607-3.
[20] Kursun, O., Koufakou, A., Chen, B., Georgiopoulos, M., Reynolds,
K. and Eaglin, R. 2006. A Dictionary-Based Approach to Fast and
Accurate Name Matching in Large Law Enforcement Databases,
Proc. of ISI06. DOI: 10.1007/11760146.
[21] Neville, J., Simsek, O., Jensen, D., Komoroske, J., Palmer,
K. and Goldberg, H. 2005. Using Relational Knowledge Discovery to Prevent Securities Fraud, Proc. of SIGKDD05. DOI:
10.1145/1081870.1081922.
[22] Oscherwitz, T. 2005. Synthetic Identity Fraud: Unseen Identity
Challenge, Bank Security News 3: p. 7.
[23] Roberts, S. 1959. Control-Charts-Tests based on Geometric Moving Averages, Technometrics 1: pp. 239-250.
[24] Romanosky, S., Sharp, R. and Acquisti, A. 2010. Data Breaches
and Identity Theft: When is Mandatory Disclosure Optimal?, Proc.
of WEIS10 Workshop, Harvard University.
[25] Schneier, B. 2003. Beyond Fear: Thinking Sensibly about Security in an Uncertain World, Copernicus, New York. ISBN-10:
0387026207.
[26] Schneier, B. 2008. Schneier on Security, Wiley, Indiana. ISBN-10:
0470395354.
[27] Sweeney, L. 2002. k-Anonymity: A Model for Protecting Privacy,
International Journal of Uncertainty, Fuzziness Knowledge-Based
Systems: 10(5): pp. 557-570.
[28] VedaAdvantage. 2006. Zero-Interest Credit Cards Cause Record
Growth In Card Applications. Unpublished.
[29] Wheeler, R. and Aitken, S. 2000. Multiple Algorithms for
Fraud Detection, Knowledge-Based Systems 13(3): pp. 93-99. DOI:
10.1016/S0950-7051(00)00050-2.
[30] Winkler, W. 2006. Overview of Record Linkage and Current Research Directions, Technical Report RR 2006-2, U.S. Census Bureau.
[31] Witten, I. and Frank, E. 2000. Data Mining: Practical Machine
Learning Tools and Techniques with Java, Morgan Kauffman Publishers, San Francisco. ISBN-10: 1558605525.
[32] Wong, W. 2004. Data Mining for Early Disease Outbreak Detection, PhD thesis, Carnegie Mellon University.
[33] Wong, W., Moore, A., Cooper, G. and Wagner, M. 2003. Bayesian
Network Anomaly Pattern Detection for Detecting Disease Outbreaks, Proc. of ICML03. ISBN: 1-57735-189-4.
Clifton Phua is a Research Fellow at the Data Mining Department of
Institute of Infocomm Research (I2 R), Singapore. His current research
interests are in security and healthcare-related data mining.
Kate Smith-Miles is a Professor and Head of the School of Mathematical Sciences at Monash University, Australia. Her current research
interests are in neural networks, intelligent systems, and data mining.
Vincent Lee is an Associate Professor in the Clayton School of Information Technology, Monash University, Australia. His current research
interests are in data and text mining for business intelligence.
Ross Gayler is a Senior Research and Development Consultant in Veda
Advantage, Australia. His current research interests are in credit scoring.