Luận văn tốt nghiệp

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

UNIVERSITY OF ECONOMICS AND LAW

FACULTY OF FINANCE AND BANKING

______________

GRADUATION THESIS PROJECT

APPLICATION OF MACHINE LEARNING IN ANALYZING


LOAN REPAYMENT CAPABILITY OF INDIVIDUAL
CUSTOMERS AT BAC A COMMERCIAL JOINT STOCK
BANK - BINH THUAN BRANCH

Instructor: Ph.D LE DUC QUANG TU

Student: NGUYEN KHANH HUYEN

Student code: K204141919

Class: K20414C

HO CHI MINH CITY, 2024


UNIVERSITY OF ECONOMICS AND LAW
FACULTY OF FINANCE AND BANKING

NGUYEN KHANH HUYEN

APPLICATION OF MACHINE LEARNING IN


ANALYZING LOAN REPAYMENT CAPABILITY OF
INDIVIDUAL CUSTOMERS AT BAC A COMMERCIAL
JOINT STOCK BANK - BINH THUAN BRANCH

GRADUATION THESIS PROJECT

HO CHI MINH CITY, 2024


COMMENTS FROM INSTRUCTOR
……………………………………………………………………………………………
……………………………………………………….……………………………………
……………………………………………………………………………………………
………………….…………………………………………………………………………
………………………………………………………………………….…………………
……………………………………………………………………………………………
…………………………………….………………………………………………………
…………………………………………………………………………………………….
……………………………………………………………………………………………
……………………………………………………….……………………………………
……………………………………………………………………………………………
………………….…………………………………………………………………………
…………………………………………………………
Ho Chi Minh City, ……………………..............
Instructor
(Signature/Name)
STATEMENT OF AUTHORSHIP
I hereby declare that the research topic "Application of Machine learning in
analyzing loan repayment capability of individual customers at Bac A commercial joint
stock bank - Binh Thuan branch” is my independent research work. The data set used in
the article is a public data set that has been approved by the leadership of Bac A Bank -
Binh Thuan branch. The data set is only for learning and research purposes. The content
and results of the report are honest and do not violate research ethics. All sources
referenced and used by the author are clearly stated in the references section.
In addition, the report also uses a number of comments, assessments and data from
other authors and other organizations, which are all quoted and annotated.
I take full responsibility for my commitment.

Ho Chi Minh City, March 25th, 2024


Author
ACKNOWLEDGEMENT
During the process of completing this report, I received valuable guidance and
help from my instructors and the bank's leadership team. With deep respect and
gratitude, I would like to express my sincere thanks to:
PhD. Le Duc Quang Tu, my beloved teacher, has wholeheartedly helped and
guided me in the process of studying and completing this report.
I would like to thank the Board of Directors of Bac A Commercial Joint Stock
Bank - Binh Thuan Branch for creating favorable conditions for me in the process of
collecting research data, providing data, documents and information. Other necessary
information to complete the report.
I sincerely appreciate your support ./.
TABLE OF CONTENTS
STATEMENT OF AUTHORSHIP

ACKNOWLEDGEMENT

LIST OF ACRONYMS

ABSTRACT

1. INTRODUCTION .................................................................................................... 1

1.1 Reason for choosing the topic. ............................................................................. 1

1.2 Research objectives. ............................................................................................. 1

1.3 Research subjects and Scope. ............................................................................... 2

1.4 Research Significance. ......................................................................................... 2

1.5 Outline. ................................................................................................................. 2

2. LITERATURE REVIEWS ..................................................................................... 3

2.1 Overview of Personal Credit Theory, Repayment ability, and Factors


Influencing Repayment Ability. ................................................................................. 3

2.2 Literature Review. ................................................................................................ 7

3. RESEARCH METHODOLOGY ........................................................................... 9

3.1 Data. ..................................................................................................................... 9

3.2 Data processing methods .................................................................................... 10

3.3 Models. ............................................................................................................... 14

4. RESULTS AND DISCUSSION ............................................................................ 16

4.1 Descriptive statistical results. ............................................................................. 16

4.2 Model results. ..................................................................................................... 19


4.2.1 Logistic Regression Model. ......................................................................... 19
4.2.2 Support Vector Machine Model. ................................................................. 21
4.2.3 Decision Tree model results ........................................................................ 23
4.3 Discussion. ......................................................................................................... 25

5. CONCLUSION AND RECOMMENDATIONS ................................................. 27

5.1 Conclusion. ......................................................................................................... 27

5.2 Recommendations. ............................................................................................. 28

5.3 Limitations ......................................................................................................... 28

REFERENCES

APPENDIX 1
LIST OF FIGURES
Figure 3. 1: Data Visualization Process. ................................................................................. 10
Figure 3. 2: Correlation Chart ................................................................................................. 11
Figure 3. 3: Describe varible Experience ................................................................................ 12
Figure 3. 4: Mortgage distribution........................................................................................... 12
Figure 3. 5:Result of missing and duplicate values ................................................................. 13
Figure 3. 6: Frequency percentage of Target classes among Training and Test sets .............. 14

Figure 4. 1: Categorical feature vs Target stacked barplots .................................................... 17


Figure 4. 2: Numberical features vs Target distribution.......................................................... 18
Figure 4. 3: Confusion matrix and ROC Cure for Test Data (Logistic Regression) ............... 20
Figure 4. 4: Features Importance of Logistic regression ......................................................... 21
Figure 4. 5: Confusion matrix and ROC Cure for Test Data (SVM) ...................................... 22
Figure 4. 6: Features importance of SVM model .................................................................... 23
Figure 4. 7: Confusion matrix and ROC Cure for Test Data (Decision Tree) ........................ 24
Figure 4. 8: Features importance of Decision tree model ....................................................... 24
Figure 4. 9: F1-Score for class "0" .......................................................................................... 25
Figure 4. 10: Decision Tree ..................................................................................................... 26

LIST OF TABLE
Table 3. 1: Data Description ...................................................................................................... 9

Table 4. 1: Logistic regression model result ........................................................................... 20


Table 4. 2: SVM model result ................................................................................................. 21
Table 4. 3: Decision Tree model results .................................................................................. 23
LIST OF ACRONYMS

Acronyms Meaning
KHCN individual customers
SVM Support Vector Machine
ABSTRACT
This report is conducted with the purpose of "Analyzing the loan repayment
ability of individual customers at Bac A Bank - Binh Thuan Branch". To assess the
repayment ability of customers, we need to rely on criteria that reflect the customers'
situation accurately to minimize risks for the bank. The study will clarify which factors
affect the loan repayment ability of individual customers and how these factors
influence loan repayment when considering credit approval for customers of Bac A
Bank - Binh Thuan Branch.
T he research utilizes secondary data information from a random sample of 1779
individual customers with complete records in the Bac A Bank - Binh Thuan Branch
database. Machine learning models: Logistic Regression, Support Vector Machine
classification model, and Decision Tree model are used to evaluate the impact of
independent factors on customers' loan repayment ability. Data processing and model
validation are conducted using the Python programming language.
Out of the initial 11 expected factors that could influence customers' loan
repayment ability, the results from the machine learning models indicate that Income,
Education level, average credit card spending, and deposit account are significant
factors affecting customers' debt repayment ability. Compared to the Logistic
Regression model results, Income and Education level also have a significant impact on
customers' loan repayment ability.
Based on the analysis results, the author proposes some solutions and
recommendations related to the operations of Bac A Bank - Binh Thuan Branch to
enhance customers' loan repayment ability in the future.
Keywords: machine learning in finance, loan repayment, the repayment ability of
customers.
1. INTRODUCTION
1.1 Reason for choosing the topic.
Currently, the activities of commercial banks continue to expand and improve
service quality, among which credit activities stand out as the most traditional and
crucial. This activity brings in the largest income for commercial banks but also entails
the highest risks. Therefore, bank managers always pay special attention to this activity.
Risk in credit activities is something banks cannot eliminate but can only mitigate and
minimize damages through various measures, significantly contributing to building safe
and effective credit operations for growth. Success in bank management, in general, and
risk management in credit activities, in particular, is when the bank incurs losses equal
to or less than expected.
The next business strategy of Bac A Bank is to increase retail credit. In the recent
past, due to the establishment of the branch in Binh Thuan coinciding with the outbreak
of Covid-19, credit risk management activities for individual customers have not been
adequately emphasized and thus not truly effective. Therefore, a problem arises: how
to manage credit risks for individual customers rigorously and effectively to enhance
credit quality, which is essential for the branch's survival.
In the current trend, applying technology to bank management is crucial.
Technological application yields more accurate results. The significant development of
Machine Learning has changed the traditional methods of bank management.
Understanding this, I decided to use Machine Learning technology to address the issue
of managing credit risks for individual customers.
During my internship at the bank, I directly conducted and managed credit
operations, especially personal loans. I am highly interested in credit risks, as reflected
in the ability of individual customers to repay debts. Hence, I propose measures and
solutions to minimize credit risks. Therefore, I chose the research topic "Analyzing the
loan repayment ability of individual customers at Bac A Bank - Binh Thuan Branch."
1.2 Research objectives.
Identifying and measuring the factors influencing the loan repayment ability of
individual customers (KHCN) at Bac A Bank - Binh Thuan Branch. Comparing the
2

results between machine learning and traditional regression models. Proposing


solutions based on the analysis results to enhance credit operations.
1.3 Research subjects and Scope.
Research subject: Factors influencing the loan repayment ability of individual
customers at Bac A Bank - Binh Thuan Branch.
Scope of the study: The research focuses on investigating the factors influencing
the loan repayment ability of individual customers at BAC Bank - Binh Thuan Branch.
1.4 Research Significance.
The research results serve as reference materials for the leadership of Bac A
Bank - Binh Thuan Branch, providing insights into factors affecting the loan repayment
ability of individual customers at the branch. This contributes to establishing criteria
and conditions in the lending process and credit approval, limiting risks, reducing bad
debts, and ensuring safe, effective, and sustainable credit growth at the Binh Thuan
Branch in the future.
1.5 Outline.
Chapter 1: Introduction.
Chapter 2: Literature Reviews
Chapter 3: Methodology
Chapter 4: Results and Discussion
Chapter 5: Conclusion and Recommendations
3

2. LITERATURE REVIEWS
2.1 Overview of Personal Credit Theory, Repayment ability, and Factors
Influencing Repayment Ability.
2.1.1 Personal Credit.
According to "Monetary Banking" (2008) by author Nguyen Minh Kieu: "Credit,
derived from the Latin word creditium and the English word credit, means trust and
confidence. In the Vietnamese folk language, credit means borrowing and lending.
Financially, credit is the transfer of capital usage rights from the owner to the user for
a certain period with a certain cost." And "Credit activities can be divided into various
forms based on different criteria, among which, if based on the borrowing party, it can
be divided into personal credit and corporate credit."
According to the Law on Credit Institutions in 2010, Article 4, Clause 14
stipulates: "Credit granting is an agreement to allow organizations or individuals to use
a sum of money or commit to use a sum of money, repayable through lending,
discounting, financial leasing, payment guarantee, bank guarantee, and other credit
granting transactions." "Lending is a form of credit granting, whereby the lender
provides or commits to provide the customer with a sum of money for a specific purpose
for a specified period under agreed-upon terms with the principle of repayment of both
principal and interest."
The concept of personal credit is mentioned in the "Dictionary of Finance and
Banking" by Law and Smullen (2005), where personal credit "is the amount of money
or assets provided by financial institutions to an individual after assessing the risk of
that individual, and the lending institution will receive the principal and interest after a
certain period as agreed upon."
Personal credit in this study is a form of credit where the Bank plays the role of
providing a sum of money (the lender) to individual customers (the borrower), after
assessing the risk of the customer, and the bank will receive the principal and interest
after an agreed-upon period.
Personal credit is a type of credit, so personal credit also has the characteristics
of credit:
4

- Credit is based on trust. Banks only extend credit to customers, individuals, or


businesses when they have confidence that the customer will use the borrowed
capital for the agreed-upon purpose effectively and have the ability to repay the
debt (principal, interest) on time.
- Credit is the temporary transfer of a value amount. Banks act as financial
intermediaries, both as borrowers and lenders. The bank's loan capital is sourced
from funds mobilized from customer sources in the economy. Therefore, all bank
loans to customers must have a time limit to ensure that the bank can repay the
mobilized capital.
- Credit is the temporary transfer of a value amount based on the principle of
repayment of both principal and interest. The borrower must pay additional
interest on top of the principal, which is the cost of using borrowed capital. This
is a source of offsetting operating costs as well as generating profits for the bank.
Personal credit also has some specific characteristics: Small loan amounts but a large
number of loans. The interest rate on personal loans is usually higher than the interest
rate for business loans. The demand for personal loans is sensitive to economic
conditions, increasing as the economy expands and decreasing as the economy
contracts. The source of repayment for individual customers mainly depends on income
from salaries, property rentals, and business income. This repayment source may
experience significant fluctuations, depending on their work process, skills, and
experience, making it difficult to control these income sources. Personal loans often
have high risks due to the low quality of financial information provided by customers.
In addition, credit quality is used to reflect the level of risk in the overall loan
portfolio of a credit institution (or also called lending quality). According to Decision
493/2005/QD-NHNN, "The bad debt ratio on the total outstanding balance is the ratio
used to evaluate the credit quality of credit institutions." Thus, the lower the bad debt
ratio, the higher the credit quality, and vice versa. A fully repaid loan, on time, is a loan
with good credit quality, meaning that the customer has a good repayment ability, able
to repay the debt; whereas a loan that the customer fails to repay in full or cannot repay
represents poor credit quality, or in other words, a loan with potential credit risk.
5

According to Joel Bessis' book "Risk Management in Banking" (2015), credit


risk is understood as "the losses due to customers' failure to repay debts or the
deterioration of the credit quality of loans."
2.1.2 Repayment ability.
According to Alex White (2008) in a study on the repayment ability of individual
customers, "the repayment ability of customers is the ability of customers to generate
enough income during the loan period to ensure regular repayments."
According to the regulations of Vietnamese law, specifically Decision
493/2005/QD-NHNN, which regulates debt classification, provisioning, and use of
reserves to handle credit risks in banking activities of credit institutions, "standard debt
is debt evaluated by credit institutions as having the ability to fully recover the principal
and interest on time." Therefore, a loan is considered effective when it is repaid with
both interest and principal on time.
2.1.3 Factors Influencing the Repayment Ability of Individual Customers
2.1.3.1 Factors from the Bank.
Firstly, the bank's business strategy. The business strategy will make decisions
about products that meet customer needs, thereby gaining a competitive advantage in
the market. Based on the business strategy, the bank will plan activities to achieve its
goals.
Secondly, the bank's policies and regulations. Policies on customer care before
and after lending are always of concern to the bank. Regulations on interest rates, credit
fees, and other regulations set by the State Bank.
Thirdly, the quality of credit staff. Credit staff are the ones who directly interact
with customers, receive information files, guide loan procedures, and appraise files.
Credit staff must have professional qualifications, business skills, analytical ability, and
evaluation ability, and be responsible in their work based on selecting customers with
sufficient legal capacity, sufficient financial capacity, and ethical qualifications. Then,
credit activities will be safe, fast, and efficient.
Fourthly, information work. With the information received, the bank will
conduct credit analysis to assess the current and potential ability of customers to use
6

capital and repay loans. This is also a very important factor in deciding whether to grant
a loan or not. It serves as the basis for making credit granting decisions.
Fifthly, the bank's technology. Modern technology helps banks provide modern,
diverse services to meet the increasing and diverse needs of customers. Meanwhile, the
nature of individual customer lending operations involves transactions with a large and
diverse customer base, requiring the bank to process a large number of loan contracts.
Therefore, the modern technology system of the bank both saves time and effort for
credit staff and minimizes errors during transactions with customers.
2.1.3.2 Factors from the Borrower.
Firstly, the financial capacity of the borrower. Customers will be granted credit
when they meet all requirements regarding financial capacity to fulfill repayment
obligations. The bank needs to pay attention to the transparency of the repayment
source.
Secondly, customer needs, habits, and ethics. In addition to the above factors,
external objective factors also influence lending to individual customers, such as
customer ethics. If customers have a good repayment conscience and low credit risk, it
will stimulate banks to expand lending activities, and regulations will not be overly
stringent.
2.1.3.3 Other Factors.
Firstly, the market characteristics where the bank operates. If it is an urban area
or an area with a dense population, a fairly high income level, and a high level of
education, the demand for loans from individual customers will be higher compared to
rural or remote areas where farmers only know about farming all year round.
Secondly, the economic and political environment. The economic and political
environment affects the lending activities of individual customers. If the economy is
developing well, the average income per capita is high, and the political environment is
stable, lending activities for individual customers will also proceed smoothly, develop
steadily, and minimize complications. If there is fierce competition among banks to
attract customers, lending activities of the bank for individual customers will face many
difficulties...
7

2.2 Literature Review.


In their study, Kleimeier and Thanh (2006) proposed a credit scoring model for
loans at retail banks in Vietnam. The authors used 16 factors to assess customers'
repayment ability, listed in order of influence: Transaction history with the bank,
Gender, Number of previous loans, Credit debt, Loan duration, Deposits account,
Region, Education level, Housing status, Current account, Additional collateral,
Duration of residence at current address, Marital status, Landline phone, Loan purpose,
and Collateral assets. These factors were incorporated into a logistic regression model
to differentiate customers based on their repayment ability. The study identified 8
factors significantly impacting customers' repayment ability: Transaction history with
the bank, Gender, Number of previous loans, Loan duration, Deposit account, Duration
of residence at current address, Landline phone, Loan purpose.
Wongnaa (2013) collected primary and secondary data. Primary data was
gathered from 100 households growing cassava in four areas of Sene district, including
Kwame Danso, Lemu, Kyeamerom, and Bassa, through convenience sampling
questionnaires. Additionally, secondary data was collected from the internet, scientific
articles, and library documents. The study analyzed 12 factors: Education level, Years
of farming experience, Age, Frequency of monitoring and supervision, Income, Gender,
Marital status, Household size, Occupation, Loan amount, Loan interest rate, Farm size.
The results showed that factors such as "Education level," "Years of farming
experience," "Age," "Frequency of monitoring and supervision," and "Income"
positively influenced households' repayment ability. Moreover, the "Gender" factor
affected repayment ability, with female customers demonstrating higher willingness to
repay loans compared to male customers, and the "Marital status" factor indicated that
unmarried customers were more willing to repay loans compared to married ones.
Joseph (2013) research identified factors influencing credit risk for loans at
SACCOS Bank, Tanzania, among customers in Morogoro, Dodoma, and Kilimanjaro.
The study focused on factors such as age, education level, marital status, household
size, debt, loan term, and borrower experience. The findings indicated that only
"Education" and "Loan debt" significantly impacted credit risk, while other factors did
8

not. The study's limitations included a small sample size of 431 borrowers and the
omission of important factors such as "Loan term" and "Gender" from the analysis.
Author Vuong Quan Hoang (2006) used statistical methods with a Logistic
regression model on 1,727 customers. Independent variables such as monthly income,
income disparity, expenditure, and customer asset value had a positive correlation with
customers' repayment ability. Conversely: age, education level, occupation type, marital
status, residence, residence duration, number of dependents, and transportation
negatively impacted repayment ability.
The study "Factors affecting household's timely repayment ability in the Mekong
Delta, Vietnam" by Nguyễn Thị Hồng Nga and Võ Thị Vân Na (2018) aimed to
determine factors influencing households' ability to repay loans in the Mekong Delta,
Vietnam. The survey involved 326 farmers, with 135 currently indebted to credit
institutions. Using binary logistic regression and SPSS ver. 22, the study measured the
impact of factors including age, education level, household size, number of dependents,
farm size, farm income, natural disasters, and interest rates on households' repayment
ability. The results revealed that education level, household size, farm size, and farm
income significantly affected households' repayment ability, with farm size, farm
income, natural disasters, and interest rates negatively impacting repayment ability.
Nguyen's (2012) study evaluated the repayment ability of individual customers
at Viet Thai Bank. Employing both qualitative and quantitative methods, the study
predominantly used Excel and SPSS software to analyze data using a Logistic
regression model. The results identified 15 factors affecting customers' repayment
ability, including age, education level, occupation, work experience, current job tenure,
housing status, personal income, household income, repayment history with VSB, late
payment history, total current debt, other bank services, and sdeposits account balance.
9

3. RESEARCH METHODOLOGY
3.1 Data.
In this study, the author utilizes the credit dataset of individual customers at
North Asia Bank - Binh Thuan Branch to assess the influence of various factors on the
loan repayment ability of individual customers. The study employs 1779 repayment
data points from individual customers. The dependent variable is the binary variable Y
(Y= 0: timely repayment, Y = 1: late repayment), and there are 11 explanatory
independent variables including customer information and repayment data.
Table 3. 1: Data Description

Data
Name Symbol Description References
Types
Are customers paying their debts on
Personal time?
Personal Loan Boolean Thanh (2006)
Loan (0: on time, 1: not on time)
Target variable
Nguyen (2012)
Age Age Customer's age in completed years Numerical
Wongnaa (2013)

Experience Experience #years of professional experience Numerical Wongnaa (2013)

Annual income of the customer


Income Income Numerical Hoang (2006)
(million VND)
Household
Family Family size of the customer Categorical Hoang (2006)
size
Average
Avg. spending on credit cards per Nguyen (2012)
Credit Card Avg of CC Numerical
month (million VND)
spending
Education Education Level. (1: other; 2: high
Education Categorical Joseph (2013)
level school; 3: University)
Mortgage Value of house mortgage if any. Nguyen (2018)
Mortgage Numerical
Value (million VND) Thanh (2006)
Securities The customer have a securities account
SA Boolean Thanh (2006)
Account with the bank (0: yes, 1:no)
Deposit Have a certificate of deposit account Nguyen (2012)
DA Boolean
account with the bank (0: yes, 1:no) Thanh (2006)
Banking Use internet banking services (0: yes,
Online Boolean Nguyen (2012)
services 1:no)
Have credit card issued by bank (0:
Credit Card CC Boolean Nguyen (2012)
yes, 1:no)
10

3.2 Data processing methods


The research employs the Python programming language running on Jupyter
Notebook within the Anaconda Navigator software to conduct the data visualization
process. To support model validation, data cleaning, outlier treatment, data balancing,
and analysis are crucial. Through the data analysis process, we will identify the
characteristics of the data, thereby addressing any issues within the dataset. Firstly, we
perform Spearman's rank correlation test to check for multicollinearity. Next, we
identify noisy and outlier values and proceed to remove them. After removing outliers
and noise, we check for any missing values. We also examine for duplicate values.

Raw data Correlation

Missing Value
Noise Treatment
Treatment

Duplicate Values
Outliers Treatment
Treatment

Clean data

Figure 3. 1: Data Visualization Process.


11

Figure 3. 2: Correlation Chart

The Correlation Chart (Figure 3.2) has revealed that the Personal Loan variable
exhibits high multicollinearity with the variables Income, Avg of CC, and DA. The
variable Experience shows high multicollinearity with the variable Age. The Avg of CC
variable demonstrates acceptable levels of multicollinearity with the Income variable (p
= 0.59).
12

Figure 3. 3: Describe varible Experience

The variable Experience contains 19 negative values (Figure 3.3). According to


the description, this variable represents the years of work experience, so it cannot have
negative values. These negative values are considered as noise and need to be removed.
After checking Noise Treatment, only the Mortgage variable has a Kurtosis below 3.
Therefore, we proceed to use the Z-score technique to detect outlier values. The result
in the figure shows that there are 42 outlier values (Figure 6). These values will be
removed from the dataset.

Figure 3. 4: Mortgage distribution


13

Next, we proceed to check for missing values and duplicated values. The result
in the figure shows that there are no missing values or duplicate values (Figure 3.4).
The dataset is complete with all observations and clean.

Figure 3. 5:Result of missing and duplicate values

After processing the data, we split the dataset into a training set and a test set.
Next, we select relevant features for model construction and proceed with validation.
After building and validating the model, we choose the most suitable model with high
accuracy. Based on the model's results, we identify the factors influencing the
repayment ability of KHCN.
In supervised machine learning, it's essential to conduct a train-test split to
evaluate the model's performance post-training. This involves partitioning the dataset
into two subsets: the training set, utilized to train the model, and the test set, employed
to gauge the model's effectiveness on fresh data. By dividing the data in this manner,
we can assess how well the model generalizes to new instances. This approach aids in
pinpointing any biases or variances within the model and ensures its ability to generalize
effectively across unseen examples. Currently, the dataset is imbalanced. We use
stratification techniques to ensure balance across the training and test sets.
14

Figure 3. 6: Frequency percentage of Target classes among Training and Test sets

As observed (Figure 3.6), the samples are randomly partitioned to maintain


consistent proportions of each class within both the training and test sets. This is the
final step in the data preprocessing process
3.3 Models.
The machine learning models used to analyze the factors influencing the
repayment ability of individual customers include: Logistic Regression model, SVM
classification model, and Decision Tree model. The Logistic Regression model has been
widely used in previous studies, and to ensure objectivity, this study also employs this
model. The SVM and Decision Tree models are machine learning models, and their
application provides more objective and accurate results.
The Logistic Regression model predicts an outcome mapped from 0 to 1 through
the logistic function, meaning the prediction can be interpreted as the probability of the
class. The models themselves are still "linear," so they perform well when your classes
can be linearly separated (meaning they can be separated by a single decision surface).
Logistic regression can also be regularized by penalizing coefficients with an adjustable
penalty intensity. The Logistic Regression model is represented as follows:
! !"#!$%$#⋯#!'%'
𝜌(𝑥) = 𝑃[𝑌 = 1[𝑥 ] = "#! !"#!$%$#⋯#!'%'

The regression coefficients 𝛽$ , 𝛽" , … , 𝛽% are estimated using the method of


maximum likelihood.
15

The Support Vector Machine (SVM) model uses a mechanism called a kernel,
which essentially computes the distance between two observations. The SVM algorithm
then finds a decision boundary that maximizes the distance between the nearest
members of different classes. For instance, an SVM with a linear kernel is similar to
logistic regression. Therefore, in practice, the benefits of SVM often come from using
nonlinear kernels to model nonlinear decision boundaries, and there are multiple kernels
to choose from. They are also quite robust against overfitting, especially in high-
dimensional spaces.
Decision trees are a type of white-box machine learning algorithm. They
partition the feature space to make internal decisions, which are not available in black-
box algorithms like Neural Networks. Their training time is faster compared to neural
networks. The time complexity of decision trees is a function of the number of records
and the number of attributes in the given data. Decision trees are a non-parametric or
non-distributional method and do not rely on probability distribution assumptions. They
can handle high-dimensional data with good accuracy.
16

4. RESULTS AND DISCUSSION


4.1 Descriptive statistical results.
To understand the data set, we need to consider the characteristics of each
variable in the data set, the results are shown in Appendix 1:
Numeric variables include ID, Age, Experience, Income, Avg of CC, and
Mortgage. Categorical variables consist of Family and Education. Additionally, there
are 5 Boolean variables: Personal Loan, SA, DA Online, and CC. Notably, there are no
missing values or duplicates in the dataset. However, the Experience variable contains
negative values, which is illogical and requires correction. The ID variable, uniformly
distributed, serves merely as an identifier and lacks substantive information for
modeling purposes.
Categorical Variables:
- Education: 50% of customers hold a bachelor's degree, 29% have a high school
education, and 28% have other educational levels.
- Family: Customers have dependents ranging from 1 to 4. Approximately 29% of
customers have 1 dependent, 25% have 2, 19.5% have 3, and 26.4% have 4
dependents.
Boolean Variables:
- Personal Loan: 89.9% of customers repay their loans on time. The dataset is
imbalanced.
- DA: About 94% of customers do not have a CD account at the bank.
- CC: 70.8% of customers do not use credit cards.
- Online: Approximately 60% of customers do not use internet banking services.
- SA: Around 90% of customers do not have a Securities Account at the bank.
Numerical Variables:
- Age: The curve is fairly balanced. The average age of customers is 45 years with
a standard deviation of 11.5.
- Avg of CC: The curve is positively skewed. The average monthly spending on
credit cards is 1.96 million VND, with a standard deviation of 1.78.
- Income: The curve is skewed towards the positive side. The average income of
customers in the dataset is 74.17 million VND, with a standard deviation of 46.
17

- Mortgage: The average mortgage value is 55.86 million VND. The curve is
positively skewed and contains many outliers.
- Experience: The results show values below 0.
The dataset includes an ID variable, which has unique values, so it will be removed
from the dataset.

Figure 4. 1: Categorical feature vs Target stacked barplots

Observing Figure 4.1, we notice that:


- Customers with a CD Account tend to have a higher rate of timely loan
repayment compared to those without a CD Account.
- Customers with higher education levels are less likely to have late loan
repayment compared to those with a high school or other education level.
- Customers with more dependents tend to have a higher tendency for late loan
repayment.
18

- Having a Securities Account, online account, or using a credit card seems to have
little impact on the likelihood of timely loan repayment.

Figure 4. 2: Numberical features vs Target distribution

Figure 4.2 illustrates:


- Customers who spend more on credit cards are more likely to have late loan
repayment.
- Customers with higher income are more likely to have timely loan repayment.
- For loans with larger mortgage values, customers are more likely to have timely
loan repayment.
19

- Customer age and work experience do not affect their loan repayment ability.
Since the Experience variable has 19 removed values, its information content is
lower than that of the Age variable. Therefore, the Experience variable is
4.2 Model results.
After examining the correlation and relationships between variables, changes
have been made to the explanatory variables. The model retains the target variable,
Personal Loan, and the explanatory variables include: Income, Education, CD Account,
Family, Credit Card, CCAvg, Online, Securities Account, Age, Mortgage, and
ZipCode. In this project, our primary objective was to accurately identify factors
influencing customers' ability to make timely payments. The selected data values play
a crucial role in assessing the model's efficacy in identifying these customers.
The assessment of the actual positive cases' recovery rates is accurately
described. A high recall rate signifies fewer false negatives, which is advantageous as
it ensures the model doesn't overlook customers capable of timely repayment.
Precision: accurately measuring positive physical cases determined by the
model, is crucial. High precision implies fewer false positives, which is essential in
ensuring the model doesn't misidentify customers with limited repayment capabilities
as potential candidates.
The F1 score provides a balanced evaluation of recall and precision, calculated
as their harmonic mean. A high F1 score indicates a compromise between identifying
as many customers as possible who repay their loans on time (high recall) and
minimizing false positives (high precision).
In this project, both recall and precision for class '0' were crucial, making the F1
score for class '0' the most important metric. A high F1 score signifies a balance between
identifying customers likely to pay their debts on time (high recall) and minimizing
false positives (high precision). This equilibrium is vital for banks as it aims to enhance
secure credit.
4.2.1 Logistic Regression Model.
20

Table 4. 1: Logistic regression model result

Accuracy Precision Recall F1-Score AUC


96,11% 78,57% 66,67% 72,13% 96,51%
The results of the Logistic Regression model evaluation (Table 4.1) show an
accuracy of 95%, an F1-score of 72%, and an AUC of 96.5%. Out of 348 cases
analyzed, the model misclassified 17 cases when predicting customers' loan repayment
capabilities. (Figure 4.3)

Figure 4. 3: Confusion matrix and ROC Cure for Test Data (Logistic Regression)

The logistic regression model indicates that there are 8 variables influencing
customers' repayment ability (Figure 4.4). The influential variables include: Income,
Education, DA, CC, SA, Avg of CC, Online, and Mortgage. Among these, income has
the most significant impact, followed by education level, while the mortgage value has
the least impact on customers' repayment ability.
21

Figure 4. 4: Features Importance of Logistic regression

4.2.2 Support Vector Machine Model.


Table 4. 2: SVM model result

Accuracy Precision Recall F1-Score AUC


96,28% 93,55% 87,88% 90,62% 98,61%
The SVM model achieved an AUC of 98%; however, the F1-score reached 90%,
and precision was 93% (table 4.2). Out of 348 customers, the model correctly predicted
342 cases (with 6 misclassifications) regarding customers' ability to make timely loan
payments. (Figure 4.5)
22

Figure 4. 5: Confusion matrix and ROC Cure for Test Data (SVM)

The SVM model also indicates that there are 8 variables influencing customers'
repayment ability, similar to the logistic regression model (Figure 4.6). However, in
the SVM model, the credit card variable and mortgage value are insignificant. Income
and education level have the most significant impact, followed by the number of
dependents, deposit account, average credit card spending, and banking services.
Securities account and age variables have negligible impact.
23

Figure 4. 6: Features importance of SVM model

4.2.3 Decision Tree model results


Table 4. 3: Decision Tree model results

Accuracy Precision Recall F1-Score AUC


98,85% 93,94% 93,94% 93,94% 99,52%
The Decision Tree model has an AUC of 99.5%, with F1-Score and Precision both
achieving 93.94% (Table 4.3). Out of 348 customers, the model correctly predicts 344
cases (with 4 errors) when predicting customers' ability to repay debts on time (Figure
4.7).
24

Figure 4. 7: Confusion matrix and ROC Cure for Test Data (Decision Tree)

Income and education level are the variables with the greatest impact. Following them
are average credit card spending and household size. Similar to the results of the
previous two models, the Decision Tree model's outcome also includes 8 influential
variables. Credit card and mortgage value have minimal impact. Securities account
and banking services are two variables that do not influence the model. (Figure 4.8)

Figure 4. 8: Features importance of Decision tree model


25

4.3 Discussion.
The Logistic regression model has been widely used by researchers in previous
studies to investigate customers' loan repayment ability. However, when compared to
machine learning models, the Logistic model has the lowest accuracy (Figure 4.9).
The Decision Tree model has the highest accuracy, making fewer prediction errors.

Figure 4. 9: F1-Score for class "0"

The Decision Tree model has the highest accuracy, so the authors decide to
build a decision tree to provide conditions for the bank to decide on disbursing loans
to customers. According to the results of the decision tree (Figure 4.10), customers
with an income of 108,500,000 VND will be those who repay their loans on time. If
the income of the customers falls between 108,500,000 dong and 92,000,000 dong,
then consideration must be given to credit card average spending below 35,400,000
dong and having an educational level of high school or above. Additionally, the bank
should household size, which should be two or fewer.
26

Figure 4. 10: Decision Tree


27

5. CONCLUSION AND RECOMMENDATIONS


5.1 Conclusion.
Understanding and analyzing the factors influencing customers' loan repayment
ability is crucial for banks to better assess and evaluate loan applicants, thereby
minimizing risks in lending activities. From the data processing methods and model
testing conducted, the study has yielded the following results:
The Logistic regression model, commonly used in previous research, yielded the
lowest accuracy. Leveraging the advantages, machine learning models provided high
accuracy and made fewer prediction errors. The study has demonstrated the
effectiveness of machine learning models in analyzing customer loan repayment ability
better than conventional Logistic regression models.
In general, variables such as income, education level, average credit card
spending, and deposit account are certain factors that will affect customers' loan
repayment ability. Variables such as age, mortgage, credit cards, and securities accounts
have little or no impact. Income and education level are the most influential factors
across all three models.
The study identified the factors influencing customers' loan repayment ability
from the dataset provided by Bac A Bank - Binh Thuan branch. The dataset comprises
11 variables and 1779 observations for each variable. After processing, the data had no
missing values but contained noise, specifically, 19 noise values in the Experience
variable. The noisy values of the Experience variable, with negative values lacking
logic, were removed from the model. Additionally, the dataset was heavily skewed, and
the author conducted upsampling to ensure the models were more objective.
Compared to previous studies, my research used fewer variables and
observations. However, we have also validated the factors influencing customers' loan
repayment ability. The study has also shown the significance of variables such as
average credit card spending, sdeposit accounts, and education level. However, for
variables like household size and mortgage value, my model's results indicate minimal
impact, while in Thanh's (2006) study, these were considered insignificantly influential
variables. In specific situations like natural disasters, as per Nguyen's (2018) study, the
income variable positively influences loan repayment ability. However, in my study,
28

we are only considering normal conditions. The research has demonstrated that income
and education level are the most influential variables, consistent with previous studies
by Vuong (2006), Wongnaa (2013), and Joseph (2013). Regarding age, while the
machine learning model indicates it has little impact on customers' loan repayment
ability, quantitative analysis and traditional binary Logistic regression models
conducted using Nguyen's (2012) and Nguyen's (2018) SPSS econometrics software
suggest it significantly affects individual customers' loan repayment ability.
5.2 Recommendations.
In addition to the assessment tools currently in use, banks should consider additional
analytical and predictive models to determine priority factors for loan assessment.
Income is a direct factor that significantly influences customers' repayment ability.
Customers with higher incomes are more likely to repay their loans on time compared
to those with lower incomes. However, banks need to be cautious in assessing
customers' income to avoid customers falsely declaring income to borrow more than
their actual capacity. Accurate information in loan applications is crucial as it
determines the quality of assessment. Therefore, proper storage and management of
information and credit profiles are essential to ensure accurate assessment. To mitigate
credit risks, banks should consider customers' educational levels. Higher education
levels imply better understanding of expertise and legal matters. To verify customers'
educational levels accurately and prevent false declarations, banks can request
additional credentials from customers.
5.3 Limitations
Due to time and scope constraints, the study still has several limitations:
Firstly, compared to previous experimental studies, this research uses a smaller
sample size and fewer variables. Factors influencing customers' abilities consist of both
objective and subjective factors. The study only considers 10 factors and needs to
expand and examine more factors.
Secondly, the dependent variable only considers two possibilities: timely
repayment and late repayment. Customers' repayment behavior should consider
whether they fully repay or partially repay their debts.
REFERENCES
Vietnamese documents:
1. Hoang, V. Q., Hung, D. G., Huu, N. V., & Ngoc, T. M. (2006). Statistical
method for constructing individual credit rating models. Vietnam Journal of
Mathematical Applications, 4(2), 1-16.
2. Law No. 47/2010/QH12 of the National Assembly dated June 16, 2010
promulgated on the law on credit institutions
3. Vo, D. L., Nguyen, H. S., Dao, N. L., Do, D. D., Trinh, T. H. M., Pham, V. D.,
... & Nguyen, T. K. T. (2008). Current Inflation in Vietnam: Causes and
Solutions. National University of Hanoi.
4. Vo, T. V. N., & Nguyen, T. H. N. Factors Affecting the Household's
Repayment Ability on Time in the Mekong Delta, Vietnam.
Foreign documents:
1. Bessis, J. (2011). Risk Management in Banking. John Wiley & Sons
2. Ivan Idris - Python Data Analysis
3. Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques,
Second Edition.
4. Kang, D., Raghavan, D., Bailis, P., & Zaharia, M. (2020). Model assertions for
monitoring and improving ML models. Proceedings of Machine Learning and
Systems, 2, 481-496.
5. Magali, J. J. (2013). Factors Affecting Credit Default Risks for Rural Savings
and Credits Cooperative Societies (SACCOs) in Tanzania. European Journal
of Business and Management, 5(32), 60-73.
6. Wongnaa, C. A., & Awunyo-Vitor, D. (2013). Factors Affecting Loan
Repayment Performance Among Yam Farmers in the Sene District, Ghana.
Agris On-line Papers in Economics and Informatics, 5(665-2016-44943), 111-
122.
1

APPENDIX 1
2
3
4
5
6

You might also like