Cdccf-Converted 111
Cdccf-Converted 111
Cdccf-Converted 111
A PROJECT REPORT ON
BACHELOR OF ENGINEERING
IN
By
ASHWINI H V
1NH15IS143
CERTIFICATE
Certified that the project work entitled “CREDIT CARD FRAUD DETECTION”, carried out
by Ms. ASHWINI H V, USN-1NH15IS143, a bonafide student of NEW HORIZON COLLEGE
OF ENGINEERING, Bengaluru, in partial fulfillment for the award of Bachelor of
Engineering in Information Science and Engineering of the Visveswaraiah Technological
University, Belgaum during the year 2018-19.It is certified that all
corrections/suggestions indicated for Internal Assessment have been incorporated in the
Report deposited in the departmental library.
The project report has been approved as it satisfies the academic requirements in
respect of Project work prescribed for the said Degree.
External Viva
1.
2.
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING
DECLARATION
I hereby declare that I have followed the guidelines provided by the Institution in
preparing the project report and presented report of project titled “CREDIT CARD
FRAUD DETECTION”, and is uniquely prepared by me after the completion of the project
work. I also confirm that the report is only prepared for my academic requirement and
the results embodied in this report have not been submitted to any other University or
Institution for the award of any degree.
Name: ASHWINI H V
USN: 1NH15IS143
ABSTRACT
Nowadays the usage of credit cards has dramatically increased. As credit card becomes the most
popular mode of payment for both online as well as regular purchase, cases of fraud associated
with it are also rising. Here we model the sequence of operations in credit card transaction
processing using a Hidden Markov Model (HMM) and show how it can be used for the detection of
frauds. An HMM is initially trained with the nonnal behavior of a cardholder. If an incoming credit
card transaction is not accepted by the trained HMM with sufficiently high probability, it is
considered to be fraudulent. At the same time, we try to ensure that genuine transactions are not
rejected. We present experimental results to show the effectiveness of our approach and
compare it with other techniques available in the literature.
ACKNOWLEDGEMENT
Any achievement, be it scholastic or otherwise does not depend solely on the individual efforts
but on the guidance, encouragement and cooperation of intellectuals, elders and friends. A
number of personalities, in their own capacities have helped me in carrying out this project. I
would like to take an opportunity to thank them all.
First and foremost I thank the management, Dr. Mohan Manghnani, Chairman, New Horizon
Educational Institutions for providing us the necessary state of art infrastructure to do Project.
I would like to thank Dr. R J Anandhi, Professor and Head of the Department, Information
Science and Engineering, New Horizon College of Engineering, Bengaluru, for constant
encouragement and support extended towards completing my Project.
I deeply express my sincere gratitude to my guide Mrs. Baswaraju Swathi, Assistant professor,
Department of ISE, New Horizon College of Engineering, Bengaluru, for his/her able guidance,
regular source of encouragement and assistance throughout my project period.
Last, but not the least, I would like to thank my peers and friends who provided me with valuable
suggestions to improve my project.
ASHWINI H V
(1NH15IS143)
TABLE OF CONTENTS
ABSTRACT i
ACKNOWLEDGEMENT ii
LIST OF FIGURES vi
1. PREAMBLE 01
1.1Introduction 01
1.4Purpose 03
1.6Existing System 05
1.7Proposed System 06
2. LITERATURE SURVEY 07
5. IMPLEMENTATION 20
5.1 Different Modules of the Project 31
5.2Flow Chart of the Proposed System 34
6. EXPERIMENTS RESULTS 35
6.1 Outcomes of the Proposed System 35
7. TESTING 38
7.1 Testing and Validations 38
7.2 Testing Levels 38
7.2.1 Functional Testing 39
7.2.2 Non-Functional Testing 39
7.3 White Box Testing 41
7.4 Different Stages of Testing 43
7.4.1 Unit Testing 43
7.4.2 Integration Testing 44
7.4.3 System Testing 46
7.4.4 Acceptance testing 47
REFERENCES 49
LIST OF FIGURES
1 Data Requirements 13
2 System Architecture 16
7 Training test 32
CHAPTER 1
PREAMBLE
1.1 INTRODUCTION:
Today, all around the world data is available very easily; from small to big organizations are storing
information that has high volume, variety, speed and worth. This information comes from tons of
sources like social media followers, likes and comments, user’s purchase behaviors. All this information
used for analysis and visualization of the hidden data pattern. Early analysis of big data was centered
primarily on data volume, for example, general public database, biometrics, financial analysis.
For frauds, the credit card is an easy and friendly target because without any risk a significant amount of
money is obtained within a short period. To commit credit card fraud, fraudsters try to steal sensitive
information such as credit card number, bank account and social security number. Fraudsters try to make
every fraudulent transaction legitimate which makes fraud detection a challenging problem. Increased
credit card transactions show that approximately 70% of the people in the US can fall into the trap of
these fraudsters.
Credit card dataset is highly imbalanced dataset because it carries more legitimate transactions as
compared to the fraudulent one. That means prediction will get very high accuracy score without
detecting a fraud transaction. To handle this kind of problem one better way is to class distribution, i.e.,
sampling minority classes. In sampling minority, class training example can be increased in proportion
the majority class to raise the chance of correct prediction by the algorithm.
1.2 Relevance of the Project
In general, a fraud is defined as a crime committed with intention to damage a person and is also a
violation. Fraud may be committed for various reasons: for entertainment, to exploit a business / an
organization, to take revenge, to cause financial loss, to damage identity and etc. Also there are several
types of frauds: bankruptcy frauds, identity thefts, health frauds, religious frauds, credit card frauds,
insurance frauds, forgery, tax frauds and many more. Here considering only the credit card frauds, they
can be of two kinds:
To run a sustainable business, merchants need to make a profit, which is what’s left after deducting the
costs of doing business from a company's revenue. Therefore, a business’s tolerance for payment fraud is
a function of, among other things, its gross margin (sell price - cost of goods sold). The lower the margin ,
the lower the tolerance for payment fraud.
In practice, when fraud occurs, the cardholder disputes the charge and the debit is usually cancelled,
which means either the cardholder’s bank or the merchant absorbs the loss (see [1] for more details).
Cumulatively, fraud represents a significant financial risk to the merchant and the issuing bank. To reduce
fraud, chip and pin technology, 3DSecure, and fraud detection techniques are used.
But if chip and pin technology and 3DSecure exist, why is fraud detection used? There are two main
reasons.
First, the total cost of chip and pin technology and 3DSecure is relatively high compared to the cost of
fraud detection. e.g., while online merchants care about conversion, 3DSecure reduces it by several
percents (> 5%). Hence, when they have the option, many online merchants decide to deactivate
3DSecure and manage the risk of payment fraud themselves.
Second, adding more security layers to the buying process greatly reduces checkout velocity and, in turn,
convenience for the buyer. While convenience for buyers may look like a fuzzy concept at first, for
companies like Amazon, which pioneered one-click checkout, it’s a marketing argument and a means to
convert and grow revenues.
The credit card fraud detection data has imbalanced nature. It means that very
small percentages of all credit card transactions are fraudulent. This causes the
detection of fraud transactions very difficult and imprecise.
• Overlapping data:
Classification algorithms are usually faced with the problem of detecting new
types of normal or fraudulent patterns. The supervised and unsupervised fraud
detection systems are inefficient in detecting new patterns of normal and fraud
behaviors, respectively.
• Fraud detection cost:
The system should take into account both the cost of fraudulent behavior that is
detected and the cost of preventing it.
• Lack of standard metrics:
In this research, “A Novel Approach for Credit Card Fraud Detection” is designed. Credit Card frauds are
increasing as there are millions of users worldwide. To stop these fallacious transactions a technique is
designed which uses the combination of Hidden Markov Model, Behavior Based Technique, and Genetic
Algorithm. Each and every transaction is tested with above mentioned technique and Fraud Detection
system test the transaction and detects fraud. The goal is to detect least and accurate false fraud
detection.
2. Bayesian Network
3. Neural Network
5. Genetic Algorithm
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Preprocessing is necessary because of the presence of unformatted real-world data. Mostly real-
world data is composed of –
Inaccurate data (missing data) - There are many reasons for missing data such as data is not continuously
collected, a mistake in data entry, technical problems with biometrics and much more.
The presence of noisy data (erroneous data and outliers) - The reasons for the existence of noisy data
could be a technological problem of gadget that gathers data, a human mistake during data entry and
much more.
Inconsistent data - The presence of inconsistencies are due to the reasons such that existence of
duplication within data, human data entry, containing mistakes in codes or names, i.e., violation of data
constraints and much more.
Train and Test Data Creation
The data we use is usually split into training data and test data. The training set contains a known output
and the model learns on this data in order to be generalized to other data later on. We have the test
dataset (or subset) in order to test our model’s prediction on this subset.
Model Creation
The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm)
with training data to learn from. The term ML model refers to the model artifact that is created by the
training process.
The training data must contain the correct answer, which is known as a target or target attribute. The
learning algorithm finds patterns in the training data that map the input data attributes to the target (the
answer that you want to predict), and it outputs an ML model that captures these patterns.
You can use the ML model to get predictions on new data for which you do not know the target. For
example, let's say that you want to train an ML model to predict if an email is spam or not spam. You
would provide training data that contains emails for which you know the target (that is, a label that tells
whether an email is spam or not spam). Machine would train an ML model by using this data, resulting in
a model that attempts to predict whether new email will be spam or not spam.
In our project we are using Random Forest Algorithm to build our Model on Credit Card Fraud Dataset.
Result Analysis
In this final phase, we will test our model on our prepared dataset and also measure the Fraud detection
performance on our dataset. To evaluate the performance of our created classification and make it
comparable to current approaches, we use Accuracy to measure the effectiveness of classifiers. We
consider the Fraud class as a negative class and Non-Fraud class as a positive class.
Flexibility
Sometimes you just don’t want to use what is already there but you want to define something of your
own (for example a cost function, a metric, a layer, etc.).
Although Keras 2 has been designed in such a way that you can implement almost everything you want
but we all know that low-level libraries provides more flexibility. Same is the case with TF. You can tweak
TF much more as compared to Keras.
Functionality
Although Keras provides all the general purpose functionalities for building Deep learning models, it
doesn’t provide as much as TF. TensorFlow offers more advanced operations as compared to Keras. This
comes very handy ifyou are doing a research or developing some special kind of deep learning models.
Some examples regarding high level operations are:
Threading and Queues
Queues are a powerful mechanism for computing tensors asynchronously in a graph. Similarly, you can
execute multiple threads for the same Session for parallel computations and hence speed up your
operations.
Debugger
Another extra power of TF. With TensorFlow, you get a specialized debugger. It provides visibility into the
internal structure and states of running TensorFlow graphs. Insights from debugger can be used to
facilitate debugging of various types of bugs during both training and inference.
Control
The more control you have over your network, more better understanding you have of what’s going on
with your network.
With TF, you get such a control over your network. You can control whatever you want in your network.
Operations on weights or gradients can be done like a charm in TF.
Numpy
Numpy, which stands for Numerical Python, is a library consisting of multidimensional array objects and
a collection of routines for processing those arrays. Using NumPy, mathematical and logical operations
on arrays can be performed. This tutorial explains the basics of NumPy such as its architecture and
environment. It also discusses the various array functions, types of indexing, etc. An introduction to
Matplotlib is also provided. All this is explained with the help of examples for better understanding.
Numpy is a Python package. It stands for 'Numerical Python'. It is a library consisting of multidimensional
array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was also
developed, having some additional functionality. In 2005, Travis Oliphant created NumPy package by
incorporating the features of Numarray into Numeric package. There are many contributors to this open
source project.
Operations using NumPy
Using NumPy, a developer can perform the following operations −
Mathematical and logical operations on arrays.
Fourier transforms and routines for shape manipulation.
Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number
generation.
NumPy – A Replacement for Mat Lab
NumPy is often used along with packages like SciPy (Scientific Python) and Mat−plotlib (plotting library).
This combination is widely used as a replacement for MatLab, a popular platform for technical
computing. However, Python alternative to MatLab is now seen as a more modern and complete
programming language.
It is open source, which is an added advantage of NumPy.
The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes
the collection of items of the same type. Items in the collection can be accessed using a zero-based
index.
Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an
object of data-type object (called dtype).
Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array
scalar types. The following diagram shows a relationship between ndarray, data type object (dtype) and
array scalar type −
An instance of ndarray class can be constructed by different array creation routines described later in the
tutorial. The basic ndarray is created using an array function in NumPy as follows −
numpy.array
It creates an ndarray from any object exposing array interface, or from any method that returns an array.
The ndarray objects can be saved to and loaded from the disk files. The IO functions available are −
load() and save() functions handle /numPy binary files (with npyextension)
loadtxt() and savetxt() functions handle normal text files
NumPy introduces a simple file format for ndarray objects. This .npy file stores data, shape, dtype and
other information required to reconstruct the ndarray in a disk file such that the array is correctly
retrieved even if the file is on another machine with different architecture.
numpy.save()
The numpy.save() file stores the input array in a disk file with npyextension.
import numpy as np
a = np.array([1,2,3,4,5])
np.save('outfile',a)
To reconstruct array from outfile.npy, use load() function.
import numpy as np
b = np.load('outfile.npy')
print b
It will produce the following output −
array([1, 2, 3, 4, 5])
The save() and load() functions accept an additional Boolean parameter allow_pickles. A pickle in Python
is used to serialize and de-serialize objects before saving to or reading from a disk file.
savetxt()
The storage and retrieval of array data in simple text file format is done
with savetxt() and loadtxt() functions.
Example
import numpy as np
a = np.array([1,2,3,4,5])
np.savetxt('out.txt',a)
b = np.loadtxt('out.txt')
print b
It will produce the following output −
[ 1. 2. 3. 4. 5.]
The savetxt() and loadtxt() functions accept additional optional parameters such as header, footer, and
delimiter.
CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
card_data = pd.read_csv('input/GermanDataset.csv')
card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
over_draft 1000 non-null int64
credit_usage 1000 non-null int64
credit_history 1000 non-null int64
purpose 1000 non-null object
current_balance 1000 non-null int64
Average_Credit_Balance 1000 non-null int64
employment 1000 non-null int64
location 1000 non-null int64
personal_status 1000 non-null int64
other_parties 1000 non-null int64
residence_since 1000 non-null int64
property_magnitude 1000 non-null int64
cc_age 1000 non-null int64
other_payment_plans 1000 non-null int64
housing 1000 non-null int64
existing_credits 1000 non-null int64
job 1000 non-null int64
num_dependents 1000 non-null int64
own_telephone 1000 non-null int64
foreign_worker 1000 non-null int64
class 1000 non-null int64
dtypes: int64(20), object(1)
memory usage: 160.2+ KB
card_data.describe()
plt.figure(figsize=(12,6))
card_data[card_data['num_dependents']==1]['current_balance'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
card_data[card_data['num_dependents']==2]['current_balance'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=2')
plt.legend()
plt.xlabel('Current Balance')
plt.figure(figsize=(15,7), dpi=90)
sns.countplot(x='purpose',hue='class',data=card_data,palette='Set1')
Out[7]:<matplotlib.axes._subplots.AxesSubplot at 0xb784170>
plt.figure(figsize=(15,7), dpi=90)
sns.countplot(x='purpose',hue='class',data=card_data,palette='Set1')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xb784170>
sns.jointplot(x='credit_usage',y='current_balance',data=card_data,color='purple')
plt.figure(figsize=(11,7))
sns.lmplot(y='current_balance',x='credit_usage',data=card_data,hue='num_dependents',
col='class',palette='Set1')
plt.figure(figsize=(11,7))
sns.lmplot(y='current_balance',x='credit_usage',data=card_data,hue='num_dependents',
col='class',palette='Set1')
card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
over_draft 1000 non-null int64
credit_usage 1000 non-null int64
credit_history 1000 non-null int64
purpose 1000 non-null object
current_balance 1000 non-null int64
Average_Credit_Balance 1000 non-null int64
employment 1000 non-null int64
location 1000 non-null int64
personal_status 1000 non-null int64
other_parties 1000 non-null int64
residence_since 1000 non-null int64
property_magnitude 1000 non-null int64
cc_age 1000 non-null int64
other_payment_plans 1000 non-null int64
housing 1000 non-null int64
existing_credits 1000 non-null int64
job 1000 non-null int64
num_dependents 1000 non-null int64
own_telephone 1000 non-null int64
foreign_worker 1000 non-null int64
class 1000 non-null int64
dtypes: int64(20), object(1)
memory usage: 160.2+ KB
cat_feats = ['purpose']
final_data = pd.get_dummies(card_data, columns=cat_feats, drop_first=True)
final_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
over_draft 1000 non-null int64
credit_usage 1000 non-null int64
credit_history 1000 non-null int64
current_balance 1000 non-null int64
Average_Credit_Balance 1000 non-null int64
employment 1000 non-null int64
location 1000 non-null int64
personal_status 1000 non-null int64
other_parties 1000 non-null int64
residence_since 1000 non-null int64
property_magnitude 1000 non-null int64
cc_age 1000 non-null int64
other_payment_plans 1000 non-null int64
housing 1000 non-null int64
existing_credits 1000 non-null int64
job 1000 non-null int64
num_dependents 1000 non-null int64
own_telephone 1000 non-null int64
foreign_worker 1000 non-null int64
class 1000 non-null int64
purpose_'new car' 1000 non-null uint8
purpose_'used car' 1000 non-null uint8
purpose_business 1000 non-null uint8
purpose_education 1000 non-null uint8
purpose_furniture/equipment 1000 non-null uint8
purpose_other 1000 non-null uint8
purpose_radio/tv 1000 non-null uint8
purpose_repairs 1000 non-null uint8
purpose_retraining 1000 non-null uint8
dtypes: int64(20), uint8(9)
memory usage: 165.1 KB
final_data.head()
from sklearn.model_selection import train_test_split
X = final_data.drop('class', axis=1)
y = final_data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
X_test.head(10)
X_train.to_excel('Traning_Testing/X_train.xlsx')
X_test.to_excel('Traning_Testing/X_test.xlsx')
y_train.to_excel('Traning_Testing/y_train.xlsx')
y_test.to_excel('Traning_Testing/y_test.xlsx')
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
5.1 Different modules of the project
• Training Testing
• Credit card
• German Dataset
Training Testing:
X-test
Y-train
Credit card:
German Dataset
Snapshot 1
Snapshot 2
Snapshot 3
Snapshot 4
Snapshot 5
CHAPTER 7
TESTING
7.1 Testing and Validations
Validation is a complex process with many possible variations and options, so specifics vary from
database to database, but the general outline is:
Requirement Gathering
The Sponsor decides what the database is required to do based on regulations, company needs, and any
other important factors.
The requirements are documented and approved.
System Testing
Procedures to test the requirements are created and documented.
The version of the database that will be used for validation is set up.
The Sponsor approves the test procedures.
The tests are performed and documented.
Any needed changes are made. This may require another, shorter round of testing and documentation.
System Release
The validation documentation is finalized.
The database is put into production.
Benefits
Unit testing increases confidence in changing/ maintaining code. If good unit tests are written and if they
are run every time any code is changed, we will be able to promptly catch any defects introduced due to
the change. Also, if codes are already made less interdependent to make unit testing possible, the
unintended impact of changes to any code is less.
Codes are more reusable. In order to make unit testing possible, codes need to be modular. This means
that codes are easier to reuse.
Development is faster. How? If you do not have unit testing in place, you write your code and perform
that fuzzy ‘developer test’ (You set some breakpoints, fire up the GUI, provide a few inputs that hopefully
hit your code and hope that you are all set.) But, if you have unit testing in place, you write the test,
write the code and run the test. Writing tests takes time but the time is compensated by the less amount
of time it takes to run the tests; You need not fire up the GUI and provide all those inputs. And, of course,
unit tests are more reliable than ‘developer tests’. Development is faster in the long run too. How? The
effort required to find and fix defects found during unit testing is very less in comparison to the effort
required to fix defects found during system testing or acceptance testing.
The cost of fixing a defect detected during unit testing is lesser in comparison to that of defects detected
at higher levels. Compare the cost (time, effort, destruction, humiliation) of a defect detected during
acceptance testing or when the software is live.
Debugging is easy. When a test fails, only the latest changes need to be debugged. With testing at higher
levels, changes made over the span of several days/weeks/months need to be scanned.
7.4.2 Integration Testing
INTEGRATION TESTING is a level of software testing where individual units are combined and tested as a group.
The purpose of this level of testing is to expose faults in the interaction between integrated units. Test drivers and
test stubs are used to assist in Integration Testing.
Integration testing: Testing performed to expose defects in the interfaces and in the interactions
between integrated components or systems. See also component integration
testing, system integration testing.
Component integration testing: Testing performed to expose defects in the interfaces and
interaction between integrated components.
System integration testing: Testing the integration of systems and packages; testing
interfaces to external organizations (e.g. Electronic Data Interchange, Internet).
Tasks
Integration Test Plan
Prepare
Review
Rework
Baseline
Integration Test Cases/Scripts
Prepare
Review
Rework
Baseline
Integration Test
7.4.3 System Testing
SYSTEM TESTING is a level of software testing where a complete and integrated software is tested. The
purpose of this test is to evaluate the system’s compliance with the specified requirements.
System testing: The process of testing an integrated system to verify that it meets specified
requirements.
7.4.4 Acceptance Testing
ACCEPTANCE TESTING is a level of software testing where a system is tested for acceptability. The
purpose of this test is to evaluate the system’s compliance with the business requirements and assess
whether it is acceptable for delivery.
acceptance testing: Formal testing with respect to user needs, requirements, and business processes
conducted to determine whether or not a system satisfies the acceptance criteria and to enable the user,
customers or other authorized entity to determine whether or not to accept the system.
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT
8.1 Conclusion
This survey has explored almost all published fraud detection studies. It defines the adversary, the types
and subtypes of fraud, the technical nature of data, performance metrics, and the methods and
techniques. After identifying the limitations in methods and techniques of fraud detection, this paper
shows that this field can benefit from other related fields. Specifically, unsupervised approaches from
counterterrorism work, actual monitoring systems and text mining from law enforcement, and semi
supervised and game-theoretic approaches from intrusion and spam detection communities can
contribute to future fraud detection research. However, Fawcett and Provost (1999) show that there are
no guarantees when they successfully applied their fraud detection method to news story monitoring but
unsuccessfully to intrusion detection.
1. Anderson M. (2008). ‗From Subprime Mortgages to Subprime Credit Cards‘. Communities and
Banking, Federal Reserve Bank of Boston, pp. 21-23.
2. Anwer et al. (2009-2010). ‗Online Credit Card Fraud Prevention System for Developing Countries‘,
International Journal of Reviews in Computing, ISSN: 2076-3328, Vol. 2, pp. 62-70. 3. Arias, J.C. & Miller
R. (2009). ‗Market Analysis of Student about Credit Cards‘. Business Intelligence Journal, Vol.
3, No. 1, pp. 23-36. 4. Bhatla T.P. et al. (2003). ‗Understanding Credit Card Frauds‘. Cards Business
Review, 01, pp. 01-15.