IDS 575 Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

IDS 575: Statistical Models and Methods for Business

Analytics

Project Report

CREDIT RISK MODEL-TAIWAN CREDIT CARD

m
er as
CUSTOMERS

co
eH w
o.
rs e
ou urc
o

Team Members:
aC s
vi y re

Jeyenthi Venkitaraman(UIN:665390415)
Shalini Singh (UIN: 663877768)
ed d
ar stu
sh is
Th

03rd December 2018

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
Abstract
This paper uses four data mining techniques to analyze the probability of default by the
credit card customers in Taiwan by comparing the accuracy among them. A binary classification
model is used to label the customers as defaulters or non-defaulters through the best parameters
estimated from various techniques. The aim was to come up with features that will give the best
predictability. Among the four techniques used, Adaptive Boosting is the one which works best
for this data in estimating the defaulters.

Introduction
The credit card industry of the banking domain has always been a major concern for the

m
er as
banks in terms of identifying legitimate customers. There is a strong need for risk prediction

co
eH w
especially in the financial industry to help manage uncertainty. Banking operations are something
that we all come across in our daily lives. In the recent years the use of credit card has become

o.
rs e
very popular as it is one of the most convenient payment options for everyone. However, this
ou urc
convenience does come with its own risk for the banks. As the number of customers using credit
cards increase, more efforts need to be taken to consider managing the risk involved in terms of
o

delinquency. The overall objective of risk management is to utilize the past behavioral information
aC s
vi y re

of the customers- financial, demographic, personal information, and understand the patterns to
make sound decisions for optimizing their profit.
The traditional approach to building a credit risk model, wherein the probability of default
ed d
ar stu

is to be estimated, utilizes a Logistic Regression methodology which not only gives good accuracy
rate but also has easily interpretable results. But with the recent advancements in machine learning
techniques, it is a good time to explore other ways to build risk prediction models. Thus, for the
sh is

purpose of this paper four data mining techniques were explored: Naïve Bayes, Logistic
Th

Regression, Classification Trees, AdaBoost. Credit risk here means the probability of a delay in
the repayment of the credit granted (Paolo, 2001).
We expect to address one main question: Are there other methods apart from Logistic
Regression that can perform well on this credit risk data to predict the defaulters. In the next
section, we try and understand the four data mining techniques and their implications from related
work that has been done on this subject. In Section 3 we discuss the problem settings, the models,
and parameter estimation methods. Section 4 will walk through the experimental results and

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
include model performance comparison. In Section 5, we try to analyze why certain models works
better than others and further analysis. Section 6 concludes with the relevant findings.

Related Work
In order to predict the probability of default Yeh and Lien, 2009 used 6 different data
mining techniques namely, - K-nearest neighbor (KNN), Logistic Regression, Naïve Bayes,
Artificial Neural Networks, Classification trees and Discriminant Analysis. Their study focused
on predicting the probabilities rather than just classifying the customers as defaulters and non-
defaulters. It applied the Sorting Smoothing Method (SSM) for estimating the real probability of
default from the model. For evaluating and comparing the performance of different models the

m
er as
Area ratios in the lift charts were used.

co
eH w
From the lift curves and the accuracy rate of the 6 techniques, they observed that on the
training data, K-nearest neighbor classifiers and classification trees had the lowest error rate with

o.
rs e
KNN having a higher area ratio than other models. However, on the validation data, Artificial
ou urc
Neural Networks achieved the best performance with lowest area ratio and relatively lower error
rate.
o

The above-mentioned research paper used scatter plot diagrams, regression line and R
aC s
vi y re

square to estimate the real default probability. Of the above methods, only Artificial Neural
Networks had the highest explanatory ability in terms of R square as well as regression line.
In this we try to study 4 data mining techniques: Logistic Regression, Classification Trees,
ed d
ar stu

AdaBoost and Naive Bayes and check their performance through cross validations, ROC and other
metrics. Following section explains the models and methods followed by the comparative analysis
of the experimental results.
sh is
Th

Models and Methods


Data Description:
The dataset contains information on default payments, demographic factors, credit data,
history of payment, and monthly bill statements of 30,000 credit card clients in Taiwan from April
2005 to September 2005. We used the following 23 variables as explanatory variables:
X1: Amount of the given credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university;3= high school; 4 = others).

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6–X11: History of past payment. X6 = the repayment status in September, 2005; X7 = the
repayment status in August, 2005; . . .; X11 = the repayment status in April, 2005. The
measurement scale for the repayment status is: -1= pay duly; 1 = payment delay for one month; 2
= payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for
nine months and above.
X12–X17: Amount of bill statement. X12 = amount of bill statement in September, 2005; X13 =
amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18–X23: Amount of previous payment. X18 = amount paid in September, 2005; X19 = amount
paid in August, 2005; . . .; X23 = amount paid in April, 2005.

m
Exploratory Data Analysis:

er as
co
There are 25 columns, all numeric values. Our target attribute is 'default payment next

eH w
month', there are almost 4times cases of non-default versus default cases.

o.
Average age of the applicants being 35.5 years, with a standard deviation of 9.2. The average value
rs e
of the amount of credit card limit is 167,484. The standard deviation is unusually large, max value
ou urc
being 1M indicating more variance in the credit limit amount.
Females constitute higher proportion of credit card applicants (60%) and the education level of the
o

applicants are mostly graduate school and university. The marital status is either married or single.
aC s

Also, the correlation of the repayment status is decreasing between months with lowest
vi y re

correlations between Sept-April.


Predictive Modelling:
The dataset is randomly sampled by splitting into 70-30 proportion for training and test
ed d

respectively. Since there are only 24 features in the dataset, we have used all of them for creating
ar stu

our baseline models. Moreover, we are using derived variables, which are the percentage
transformation of the payment amounts. These variables have a better predictive power and are
very stable. Hence, using these variables will give better prediction.
sh is
Th

Decision Tree
The core idea is to recursively split a sample of the data with the best possible choice until some
conditions are met.

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
Baseline model: Final model: Only the important variables,
derived payment variables considered and
maxdepth=6.

Naïve Bayes Classifier


The naïve Bayesian classifier is based on Bayes theory and assumes that the effect of an attribute
value on a given class is independent of the values of the other attributes.

m
er as
Baseline model: Final model: The important variables,

co
derived payment variables considered and no

eH w
laplace smoothening.

o.
rs e
ou urc
o
aC s
vi y re

Logistic Regression
Logistic regression is used to predict the probability of occurrence of an event by fitting data to a
ed d

logistic curve. A logistic regression model specifies that an appropriate function of the fitted
ar stu

probability of the event is a linear function of the observed values of the available explanatory
variables.
Baseline model: Final model:
sh is

The important variables and derived


Th

payment variables considered.

Adaptive Boosting
AdaBoost is a method in which the output of the other learning algorithms ('weak learners') is

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost
is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances
misclassified by previous classifiers.
Baseline model: Final model: The important variables,
derived payment variables considered and
mfinal=100.

Experimental Results

m
er as
Baseline

co
eH w
Train CV F1-
Methods Accuracy accuracy Precision Recall Score

o.
Decision Tree 0.9945 0.7232 0.8246 0.8154 0.82
Naïve Bayes rs e 0.7469 0.7376 0.8688 0.778 0.8209
ou urc
Logistic
Regression 0.8117 0.808 0.8142 0.974 0.8869
o

Adaptive Boosting 0.8226 0.8146 0.8339 0.9492 0.8878


aC s
vi y re

Final Model
Train CV F1-
Methods Accuracy accuracy Precision Recall Score AUC
ed d

Decision Tree 0.8236 0.815 0.832 0.9532 0.8885 0.6997


Naïve Bayes 0.8128 0.8091 0.8403 0.9299 0.8828 0.7173
ar stu

Logistic
Regression 0.8107 0.8055 0.812 0.974 0.8856 0.7002
Adaptive Boosting 0.8232 0.8166 0.8353 0.9501 0.889 0.7496
sh is
Th

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
ROC curve: PR curve:

m
er as
co
eH w
Below are the findings from various model techniques:

o.
1. Almost all the model performs similar in Accuracy rate.
rs e
2. The decision tree model has low AUC of ROC curve values as compared to other models.
ou urc
3. Naïve Bayes and Logistic Regression have low accuracy and AUC of ROC curve values
compared to Adaptive Boosting.
o

4. As we can see the precision value of AdaBoost model for low true positive rates is very
aC s
vi y re

high which indicates the correct predictions by the model even at low TPR levels.
5. The AUC of ROC curve for AdaBoost model is highest among all models which can also
be seen from the plot above.
ed d
ar stu

Thus, from the above observation, we conclude that Adaptive Boosting works the best for our data.

Discussion and Further Analysis


sh is
Th

Adaptive Boosting is a linear combination of weak learners. This becomes easy to tune the models
while improving the performance. AdaBoost is a powerful classification algorithm and require less
tweaking of parameters. Thus, this is a good method to use when there are not many features in
the model. We can see that this method

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
Decision trees produce rules that are relatively easy to use, understand and implement. However,
they are sensitive to training data. Even a small change in the training data can cause large changes
in the tree. Thus, care needs to be taken while changing the features or altering the dataset.

Naive Bayes Classifier as a method is computationally fast and simple to implement. However, it
relies more on the independence assumption which may not give accurate results. Also, this
method is susceptible to a single zero probability if Laplace estimator is not used.

Logistic Regression is a traditional approach used in building probability of default models. It is


easily interpretable and computational. However, it might lead to high bias at times. Also, it works

m
er as
better for large sample size.

co
eH w
There are certain examples in our dataset that were predicted correctly by one model but

o.
rs e
incorrectly by other. No model is perfect but there are several reasons for this to happen as each
ou urc
model gives importance to certain characteristics of the explanatory variables while prediction.
The reason for these changes needs to be further analyzed to get an idea about the unique features
o

of different models.
aC s
vi y re

Conclusion
This paper used 4 data mining techniques to analyze the probability of default by the credit
ed d
ar stu

card customers in Taiwan by comparing the accuracy of each. As a traditional approach, logistic
regression had been used for a credit risk model classification as it is not only easy to interpret but
also gives the probabilities of default computation easily. The aim of this paper was thus to come
sh is

up with other better methods that can be used in machine learning. The 4 models showed little
Th

differences in accuracy rates. However, the AUC metric was the decision maker for this study
where Ada Boost turned out to be better performing than others. Therefore, industry must start to
explore methods like Ada Boost that goes beyond the logistic regression model in classifying the
defaulters. Further study needs to be done in order to implement various machine learning
algorithms in not only classifying the defaulters but also to take it a step further in predicting the
probability of default which will be more appropriate for banks in taking decisions while
formulating various policies.

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/
References
1. Berry, M.,&Linoff, G. (2000). Mastering data mining: The art and science of customer relationship
management. New York: John Wiley & Sons, Inc.

2. Paolo, G. (2001). Bayesian data mining, with application to benchmarking and credit scoring. Applied
Stochastic Models in Business and Society,17, 69–81.

3. UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

4. Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of
probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

m
er as
co
eH w
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
sh is
Th

https://www.coursehero.com/file/45740343/IDS-575-Project-Reportpdf/

Powered by TCPDF (www.tcpdf.org)

You might also like