Credit Card Fraud Detection (Data Analyst)
Credit Card Fraud Detection (Data Analyst)
Credit Card Fraud Detection (Data Analyst)
Dataset : Dataset is available in the given link. You can download it at your
convenience.
About Dataset
Problem Statement:
Company ABC, a major credit card company, faces challenges with their existing fraud detection system. The
current system exhibits slow responsiveness in recognizing new patterns of fraud, leading to significant financial
losses. To address this issue, they have contracted us to design and implement an algorithm that can efficiently
identify and flag potentially fraudulent transactions for further investigation. The data provided consists of two tables:
"cc_info," containing general credit card and cardholder information, and "transactions," containing details of credit
card transactions that occurred between August 1st and October 30th.
Objective:
The primary goal of this project is to build an advanced fraud detection system using neural networks to identify
transactions that appear unusual and potentially fraudulent. By applying object-oriented programming (OOPs)
concepts, we aim to develop a scalable and modular solution that can handle large volumes of data and provide
valuable insights to Company ABC.
Data Dictionary
The credit limit associated with the credit card used in the
credit_card_limit
transaction
transaction_dollar_amo
The dollar amount of the transaction
unt
Long The longitude coordinate of the transaction location
The credit limit associated with the credit card used in the
Lat
transaction
Here's a comprehensive step-by-step guide and implementation for a credit card fraud detection
machine learning project. The project involves data preprocessing, model training using a Random
Forest Classifier, and anomaly detection using Isolation Forest.
Project Steps
1. Data Preprocessing
2. Handling Imbalanced Data
3. Feature Scaling
4. Model Training and Evaluation
Implementation Code
# Feature Scaling
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data['Time'] = scaler.fit_transform(data['Time'].values.reshape(-1, 1))
Explanation of Code
1. Data Preprocessing:
○ Load the dataset using pandas.
○ Display basic information about the dataset to understand its structure.
2. Handling Imbalanced Data:
○ Use SMOTE to oversample the minority class (fraud cases) in the training set.
3. Feature Scaling:
○ Standardize the Amount and Time features using StandardScaler.
4. Model Training:
○ Split the dataset into training and testing sets using train_test_split.
○ Train a RandomForestClassifier on the oversampled training data.
○ Predict on the test set and evaluate using accuracy, ROC AUC score, and a
classification report.
○ Train an IsolationForest for anomaly detection on the training data.
○ Predict anomalies on the test set and evaluate.
Additional Resources
This implementation provides a solid foundation for building and evaluating a credit card fraud
detection model using both classification and anomaly detection techniques. Feel free to customize
the code to explore other machine learning models or techniques.
SAMPLE CODE
# Importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
In [2]:
# Reading dataset
df=pd.read_csv('/kaggle/input/creditcard/creditcard.csv')
df.head(10)
Out[2]:
-0. 1.4 1.0 -0. 0.9 1.1 -3. 0.6 1.9 -1. 0.0 -0. -0. -0. -1. -1. 4
7 0.4 .
64 17 74 49 48 20 80 15 43 01 57 64 41 05 20 08 0.
7 . 28 . 0
42 96 38 21 93 63 78 37 46 54 50 97 52 16 69 53 8
0 118 .
69 4 0 99 4 1 64 5 5 55 4 09 67 34 21 39 0
-0. 0.2 -0. -0. 2.6 3.7 0.3 0.8 -0. -0. -0. -0. 0.3 -0. 0.1 9
7 . 1.0 0.0
89 86 11 27 69 21 70 51 39 07 26 20 73 38 42 3.
8 . . 115 117 0
42 15 31 15 59 81 14 08 20 34 80 42 20 41 40 2
0 . 92 47
86 7 92 26 9 8 5 4 48 25 92 33 5 57 4 0
-0. 1.1 1.0 -0. 0.4 -0. 0.6 0.0 -0. -0. -0. -0. -0. -0. 0.0 0.2 0.0
9 . 3.
33 19 44 22 99 24 51 69 73 24 63 12 38 06 94 46 83
9 . . 6 0
82 59 36 21 36 67 58 53 67 69 37 07 50 97 19 21 07
0 . 8
62 3 7 87 1 61 3 9 27 14 53 94 50 33 9 9 6
10 rows × 31 columns
In [3]:
# shape of dataset
df.shape
Out[3]:
(284807, 31)
In [4]:
# null values
df.isnull().sum()
Out[4]:
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
No null Values
In [5]:
Out[5]:
Class
0 284315
1 492
In [6]:
# create X and Y
X=df.drop('Class', axis=1)
y=df['Class']
X,y
Out[6]:
( Time V1 V2 V3 V4 V5 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193
... ... ... ... ... ... ...
284802 172786.0 -11.881118 10.071785 -9.834783 -2.066656 -5.364473
284803 172787.0 -0.732789 -0.055080 2.035030 -0.738589 0.868229
284804 172788.0 1.919565 -0.301254 -3.249640 -0.557828 2.630515
284805 172788.0 -0.240440 0.530483 0.702510 0.689799 -0.377961
284806 172792.0 -0.533413 -0.189733 0.703337 -0.506271 -0.012546
Amount
0 149.62
1 2.69
2 378.66
3 123.50
4 69.99
... ...
284802 0.77
284803 24.79
284804 67.88
284805 10.00
284806 217.00
In [7]:
# import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV, train_test_split
1. Cross Validation like KFold
In [8]:
# KFold
log_class=LogisticRegression()
grid={'C':10.0**np.arange(-2,3), 'penalty': ['l1', 'l2']}
cv=KFold(n_splits=5, random_state=None, shuffle=False)
In [9]:
In [10]:
# grid search cv
clf=GridSearchCV(log_class, grid, cv=cv, n_jobs=-1, scoring='f1_macro')
clf.fit(X_train, y_train)
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Out[10]:
GridSearchCV
estimator: LogisticRegression
LogisticRegression
In [11]:
# Prediction and scores
y_pred=clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[85245 42]
[ 52 104]]
0.9988998513628969
precision recall f1-score support
2. RandomForest Classifier
In [12]:
In [13]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train, y_train)
Out[13]:
RandomForestClassifier
RandomForestClassifier()
In [14]:
[[85288 5]
[ 38 112]]
0.9994967405170698
precision recall f1-score support
In KFold FP, FN were 43, 41. Now it has decreased to 7, 46. So Decision trees are not much impacted by
imbalanced dataset. After hyperparameter tuning we will get better result.
In [15]:
In [16]:
y_train.value_counts()
Out[16]:
Class
0 199020
1 344
In [17]:
Out[17]:
{0: 1, 1: 100}
In [18]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(class_weight=class_weight)
classifier.fit(X_train, y_train)
Out[18]:
RandomForestClassifier
RandomForestClassifier(class_weight={0: 1, 1: 100})
In [19]:
Previously FP, FN was 7, 46, now it is 6, 46 .It has improved using class-weight.
4. Under Sampling
When to use:
Under-sampling is useful when you have a large dataset, and the majority class significantly dominates the
minority class.
Advantages:
Helps reduce the size of the majority class, making the dataset more balanced. Can lead to faster training
times, especially when dealing with large datasets.
Disadvantages:
Loss of potentially valuable information from the majority class. May increase the risk of overfitting on the
reduced dataset.
In [20]:
# Installing library
!pip install imbalanced-learn
Requirement already satisfied: imbalanced-learn in
/opt/conda/lib/python3.10/site-packages (0.11.0)
Requirement already satisfied: numpy>=1.17.3 in
/opt/conda/lib/python3.10/site-packages (from imbalanced-learn) (1.23.5)
Requirement already satisfied: scipy>=1.5.0 in
/opt/conda/lib/python3.10/site-packages (from imbalanced-learn) (1.11.2)
Requirement already satisfied: scikit-learn>=1.0.2 in
/opt/conda/lib/python3.10/site-packages (from imbalanced-learn) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in
/opt/conda/lib/python3.10/site-packages (from imbalanced-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/opt/conda/lib/python3.10/site-packages (from imbalanced-learn) (3.1.0)
In [21]:
In [22]:
# import library
from collections import Counter
from imblearn.under_sampling import NearMiss
In [23]:
# performing fit
ns=NearMiss()
X_res,Y_res=ns.fit_resample(X_train, y_train)
print('NO of class before fit ',Counter(y_train))
print('NO of class after fit ',Counter(Y_res))
In [24]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_res, Y_res)
Out[24]:
RandomForestClassifier
RandomForestClassifier()
In [25]:
[[57841 27451]
[ 13 138]]
0.6785693386234097
precision recall f1-score support
5. Over Sampling
When to use:
Over-sampling is beneficial when you have a small dataset and the minority class is underrepresented.
Advantages:
Helps increase the size of the minority class, balancing the dataset. Mitigates the risk of losing important
information from the majority class.
Disadvantages:
Can lead to overfitting, as it duplicates the minority class examples. May result in increased training time
due to the larger dataset.
In [26]:
In [27]:
# import library
from imblearn.over_sampling import RandomOverSampler
In [28]:
# performing fit
ns=RandomOverSampler()
X_res,Y_res=ns.fit_resample(X_train, y_train)
print('NO of class before fit ',Counter(y_train))
print('NO of class after fit ',Counter(Y_res))
In [29]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_res, Y_res)
Out[29]:
RandomForestClassifier
RandomForestClassifier()
In [30]:
[[85289 5]
[ 30 119]]
0.9995903701883126
precision recall f1-score support
6. SMOTETomek
When to use:
SMOTETomek is suitable when you want to address class imbalance while simultaneously cleaning noisy or
borderline examples. It's especially useful when you suspect that there are noisy samples or overlapping
classes in your dataset. You might use SMOTETomek when you have a relatively low-dimensional feature
space.
Advantages:
It combines the strengths of both over-sampling (SMOTE) and under-sampling (Tomek links) techniques.
SMOTE generates synthetic examples for the minority class, making it larger. Tomek links are used to remove
noisy and borderline examples that could potentially confuse the classifier. It helps in improving the balance
of the dataset while reducing noise and potentially enhancing the classifier's performance.
Disadvantages:
Like SMOTE, the effectiveness of SMOTETomek can vary based on the dataset's characteristics. It may
require parameter tuning to balance the trade-off between over-sampling and under-sampling.
In [31]:
In [32]:
# import library
from imblearn.combine import SMOTETomek
In [33]:
# performing fit
ns=SMOTETomek()
X_res,Y_res=ns.fit_resample(X_train, y_train)
In [34]:
In [35]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_res, Y_res)
Out[35]:
RandomForestClassifier
RandomForestClassifier()
In [36]:
[[85261 28]
[ 20 134]]
0.9994382219725431
precision recall f1-score support
When to use:
Easy Ensemble is a good choice when you have a highly imbalanced dataset and you are willing to invest
computational resources to create balanced subsets.
Advantages:
Creates multiple balanced subsets of the dataset. Reduces the risk of overfitting and provides robustness.
Disadvantages:
Requires multiple iterations and models, which can be computationally expensive. May not be suitable for
very large datasets.
In [37]:
In [38]:
# import library
from imblearn.ensemble import EasyEnsembleClassifier
In [39]:
eec = EasyEnsembleClassifier(random_state=42)
eec.fit(X_train, y_train)
Out[39]:
EasyEnsembleClassifier
EasyEnsembleClassifier(random_state=42)
In [40]:
[[82742 2550]
[ 16 135]]
0.9699682829488665
precision recall f1-score support
Reference link