Luận văn tốt nghiệp
Luận văn tốt nghiệp
Luận văn tốt nghiệp
______________
Class: K20414C
ACKNOWLEDGEMENT
LIST OF ACRONYMS
ABSTRACT
1. INTRODUCTION .................................................................................................... 1
REFERENCES
APPENDIX 1
LIST OF FIGURES
Figure 3. 1: Data Visualization Process. ................................................................................. 10
Figure 3. 2: Correlation Chart ................................................................................................. 11
Figure 3. 3: Describe varible Experience ................................................................................ 12
Figure 3. 4: Mortgage distribution........................................................................................... 12
Figure 3. 5:Result of missing and duplicate values ................................................................. 13
Figure 3. 6: Frequency percentage of Target classes among Training and Test sets .............. 14
LIST OF TABLE
Table 3. 1: Data Description ...................................................................................................... 9
Acronyms Meaning
KHCN individual customers
SVM Support Vector Machine
ABSTRACT
This report is conducted with the purpose of "Analyzing the loan repayment
ability of individual customers at Bac A Bank - Binh Thuan Branch". To assess the
repayment ability of customers, we need to rely on criteria that reflect the customers'
situation accurately to minimize risks for the bank. The study will clarify which factors
affect the loan repayment ability of individual customers and how these factors
influence loan repayment when considering credit approval for customers of Bac A
Bank - Binh Thuan Branch.
T he research utilizes secondary data information from a random sample of 1779
individual customers with complete records in the Bac A Bank - Binh Thuan Branch
database. Machine learning models: Logistic Regression, Support Vector Machine
classification model, and Decision Tree model are used to evaluate the impact of
independent factors on customers' loan repayment ability. Data processing and model
validation are conducted using the Python programming language.
Out of the initial 11 expected factors that could influence customers' loan
repayment ability, the results from the machine learning models indicate that Income,
Education level, average credit card spending, and deposit account are significant
factors affecting customers' debt repayment ability. Compared to the Logistic
Regression model results, Income and Education level also have a significant impact on
customers' loan repayment ability.
Based on the analysis results, the author proposes some solutions and
recommendations related to the operations of Bac A Bank - Binh Thuan Branch to
enhance customers' loan repayment ability in the future.
Keywords: machine learning in finance, loan repayment, the repayment ability of
customers.
1. INTRODUCTION
1.1 Reason for choosing the topic.
Currently, the activities of commercial banks continue to expand and improve
service quality, among which credit activities stand out as the most traditional and
crucial. This activity brings in the largest income for commercial banks but also entails
the highest risks. Therefore, bank managers always pay special attention to this activity.
Risk in credit activities is something banks cannot eliminate but can only mitigate and
minimize damages through various measures, significantly contributing to building safe
and effective credit operations for growth. Success in bank management, in general, and
risk management in credit activities, in particular, is when the bank incurs losses equal
to or less than expected.
The next business strategy of Bac A Bank is to increase retail credit. In the recent
past, due to the establishment of the branch in Binh Thuan coinciding with the outbreak
of Covid-19, credit risk management activities for individual customers have not been
adequately emphasized and thus not truly effective. Therefore, a problem arises: how
to manage credit risks for individual customers rigorously and effectively to enhance
credit quality, which is essential for the branch's survival.
In the current trend, applying technology to bank management is crucial.
Technological application yields more accurate results. The significant development of
Machine Learning has changed the traditional methods of bank management.
Understanding this, I decided to use Machine Learning technology to address the issue
of managing credit risks for individual customers.
During my internship at the bank, I directly conducted and managed credit
operations, especially personal loans. I am highly interested in credit risks, as reflected
in the ability of individual customers to repay debts. Hence, I propose measures and
solutions to minimize credit risks. Therefore, I chose the research topic "Analyzing the
loan repayment ability of individual customers at Bac A Bank - Binh Thuan Branch."
1.2 Research objectives.
Identifying and measuring the factors influencing the loan repayment ability of
individual customers (KHCN) at Bac A Bank - Binh Thuan Branch. Comparing the
2
2. LITERATURE REVIEWS
2.1 Overview of Personal Credit Theory, Repayment ability, and Factors
Influencing Repayment Ability.
2.1.1 Personal Credit.
According to "Monetary Banking" (2008) by author Nguyen Minh Kieu: "Credit,
derived from the Latin word creditium and the English word credit, means trust and
confidence. In the Vietnamese folk language, credit means borrowing and lending.
Financially, credit is the transfer of capital usage rights from the owner to the user for
a certain period with a certain cost." And "Credit activities can be divided into various
forms based on different criteria, among which, if based on the borrowing party, it can
be divided into personal credit and corporate credit."
According to the Law on Credit Institutions in 2010, Article 4, Clause 14
stipulates: "Credit granting is an agreement to allow organizations or individuals to use
a sum of money or commit to use a sum of money, repayable through lending,
discounting, financial leasing, payment guarantee, bank guarantee, and other credit
granting transactions." "Lending is a form of credit granting, whereby the lender
provides or commits to provide the customer with a sum of money for a specific purpose
for a specified period under agreed-upon terms with the principle of repayment of both
principal and interest."
The concept of personal credit is mentioned in the "Dictionary of Finance and
Banking" by Law and Smullen (2005), where personal credit "is the amount of money
or assets provided by financial institutions to an individual after assessing the risk of
that individual, and the lending institution will receive the principal and interest after a
certain period as agreed upon."
Personal credit in this study is a form of credit where the Bank plays the role of
providing a sum of money (the lender) to individual customers (the borrower), after
assessing the risk of the customer, and the bank will receive the principal and interest
after an agreed-upon period.
Personal credit is a type of credit, so personal credit also has the characteristics
of credit:
4
capital and repay loans. This is also a very important factor in deciding whether to grant
a loan or not. It serves as the basis for making credit granting decisions.
Fifthly, the bank's technology. Modern technology helps banks provide modern,
diverse services to meet the increasing and diverse needs of customers. Meanwhile, the
nature of individual customer lending operations involves transactions with a large and
diverse customer base, requiring the bank to process a large number of loan contracts.
Therefore, the modern technology system of the bank both saves time and effort for
credit staff and minimizes errors during transactions with customers.
2.1.3.2 Factors from the Borrower.
Firstly, the financial capacity of the borrower. Customers will be granted credit
when they meet all requirements regarding financial capacity to fulfill repayment
obligations. The bank needs to pay attention to the transparency of the repayment
source.
Secondly, customer needs, habits, and ethics. In addition to the above factors,
external objective factors also influence lending to individual customers, such as
customer ethics. If customers have a good repayment conscience and low credit risk, it
will stimulate banks to expand lending activities, and regulations will not be overly
stringent.
2.1.3.3 Other Factors.
Firstly, the market characteristics where the bank operates. If it is an urban area
or an area with a dense population, a fairly high income level, and a high level of
education, the demand for loans from individual customers will be higher compared to
rural or remote areas where farmers only know about farming all year round.
Secondly, the economic and political environment. The economic and political
environment affects the lending activities of individual customers. If the economy is
developing well, the average income per capita is high, and the political environment is
stable, lending activities for individual customers will also proceed smoothly, develop
steadily, and minimize complications. If there is fierce competition among banks to
attract customers, lending activities of the bank for individual customers will face many
difficulties...
7
not. The study's limitations included a small sample size of 431 borrowers and the
omission of important factors such as "Loan term" and "Gender" from the analysis.
Author Vuong Quan Hoang (2006) used statistical methods with a Logistic
regression model on 1,727 customers. Independent variables such as monthly income,
income disparity, expenditure, and customer asset value had a positive correlation with
customers' repayment ability. Conversely: age, education level, occupation type, marital
status, residence, residence duration, number of dependents, and transportation
negatively impacted repayment ability.
The study "Factors affecting household's timely repayment ability in the Mekong
Delta, Vietnam" by Nguyễn Thị Hồng Nga and Võ Thị Vân Na (2018) aimed to
determine factors influencing households' ability to repay loans in the Mekong Delta,
Vietnam. The survey involved 326 farmers, with 135 currently indebted to credit
institutions. Using binary logistic regression and SPSS ver. 22, the study measured the
impact of factors including age, education level, household size, number of dependents,
farm size, farm income, natural disasters, and interest rates on households' repayment
ability. The results revealed that education level, household size, farm size, and farm
income significantly affected households' repayment ability, with farm size, farm
income, natural disasters, and interest rates negatively impacting repayment ability.
Nguyen's (2012) study evaluated the repayment ability of individual customers
at Viet Thai Bank. Employing both qualitative and quantitative methods, the study
predominantly used Excel and SPSS software to analyze data using a Logistic
regression model. The results identified 15 factors affecting customers' repayment
ability, including age, education level, occupation, work experience, current job tenure,
housing status, personal income, household income, repayment history with VSB, late
payment history, total current debt, other bank services, and sdeposits account balance.
9
3. RESEARCH METHODOLOGY
3.1 Data.
In this study, the author utilizes the credit dataset of individual customers at
North Asia Bank - Binh Thuan Branch to assess the influence of various factors on the
loan repayment ability of individual customers. The study employs 1779 repayment
data points from individual customers. The dependent variable is the binary variable Y
(Y= 0: timely repayment, Y = 1: late repayment), and there are 11 explanatory
independent variables including customer information and repayment data.
Table 3. 1: Data Description
Data
Name Symbol Description References
Types
Are customers paying their debts on
Personal time?
Personal Loan Boolean Thanh (2006)
Loan (0: on time, 1: not on time)
Target variable
Nguyen (2012)
Age Age Customer's age in completed years Numerical
Wongnaa (2013)
Missing Value
Noise Treatment
Treatment
Duplicate Values
Outliers Treatment
Treatment
Clean data
The Correlation Chart (Figure 3.2) has revealed that the Personal Loan variable
exhibits high multicollinearity with the variables Income, Avg of CC, and DA. The
variable Experience shows high multicollinearity with the variable Age. The Avg of CC
variable demonstrates acceptable levels of multicollinearity with the Income variable (p
= 0.59).
12
Next, we proceed to check for missing values and duplicated values. The result
in the figure shows that there are no missing values or duplicate values (Figure 3.4).
The dataset is complete with all observations and clean.
After processing the data, we split the dataset into a training set and a test set.
Next, we select relevant features for model construction and proceed with validation.
After building and validating the model, we choose the most suitable model with high
accuracy. Based on the model's results, we identify the factors influencing the
repayment ability of KHCN.
In supervised machine learning, it's essential to conduct a train-test split to
evaluate the model's performance post-training. This involves partitioning the dataset
into two subsets: the training set, utilized to train the model, and the test set, employed
to gauge the model's effectiveness on fresh data. By dividing the data in this manner,
we can assess how well the model generalizes to new instances. This approach aids in
pinpointing any biases or variances within the model and ensures its ability to generalize
effectively across unseen examples. Currently, the dataset is imbalanced. We use
stratification techniques to ensure balance across the training and test sets.
14
Figure 3. 6: Frequency percentage of Target classes among Training and Test sets
The Support Vector Machine (SVM) model uses a mechanism called a kernel,
which essentially computes the distance between two observations. The SVM algorithm
then finds a decision boundary that maximizes the distance between the nearest
members of different classes. For instance, an SVM with a linear kernel is similar to
logistic regression. Therefore, in practice, the benefits of SVM often come from using
nonlinear kernels to model nonlinear decision boundaries, and there are multiple kernels
to choose from. They are also quite robust against overfitting, especially in high-
dimensional spaces.
Decision trees are a type of white-box machine learning algorithm. They
partition the feature space to make internal decisions, which are not available in black-
box algorithms like Neural Networks. Their training time is faster compared to neural
networks. The time complexity of decision trees is a function of the number of records
and the number of attributes in the given data. Decision trees are a non-parametric or
non-distributional method and do not rely on probability distribution assumptions. They
can handle high-dimensional data with good accuracy.
16
- Mortgage: The average mortgage value is 55.86 million VND. The curve is
positively skewed and contains many outliers.
- Experience: The results show values below 0.
The dataset includes an ID variable, which has unique values, so it will be removed
from the dataset.
- Having a Securities Account, online account, or using a credit card seems to have
little impact on the likelihood of timely loan repayment.
- Customer age and work experience do not affect their loan repayment ability.
Since the Experience variable has 19 removed values, its information content is
lower than that of the Age variable. Therefore, the Experience variable is
4.2 Model results.
After examining the correlation and relationships between variables, changes
have been made to the explanatory variables. The model retains the target variable,
Personal Loan, and the explanatory variables include: Income, Education, CD Account,
Family, Credit Card, CCAvg, Online, Securities Account, Age, Mortgage, and
ZipCode. In this project, our primary objective was to accurately identify factors
influencing customers' ability to make timely payments. The selected data values play
a crucial role in assessing the model's efficacy in identifying these customers.
The assessment of the actual positive cases' recovery rates is accurately
described. A high recall rate signifies fewer false negatives, which is advantageous as
it ensures the model doesn't overlook customers capable of timely repayment.
Precision: accurately measuring positive physical cases determined by the
model, is crucial. High precision implies fewer false positives, which is essential in
ensuring the model doesn't misidentify customers with limited repayment capabilities
as potential candidates.
The F1 score provides a balanced evaluation of recall and precision, calculated
as their harmonic mean. A high F1 score indicates a compromise between identifying
as many customers as possible who repay their loans on time (high recall) and
minimizing false positives (high precision).
In this project, both recall and precision for class '0' were crucial, making the F1
score for class '0' the most important metric. A high F1 score signifies a balance between
identifying customers likely to pay their debts on time (high recall) and minimizing
false positives (high precision). This equilibrium is vital for banks as it aims to enhance
secure credit.
4.2.1 Logistic Regression Model.
20
Figure 4. 3: Confusion matrix and ROC Cure for Test Data (Logistic Regression)
The logistic regression model indicates that there are 8 variables influencing
customers' repayment ability (Figure 4.4). The influential variables include: Income,
Education, DA, CC, SA, Avg of CC, Online, and Mortgage. Among these, income has
the most significant impact, followed by education level, while the mortgage value has
the least impact on customers' repayment ability.
21
Figure 4. 5: Confusion matrix and ROC Cure for Test Data (SVM)
The SVM model also indicates that there are 8 variables influencing customers'
repayment ability, similar to the logistic regression model (Figure 4.6). However, in
the SVM model, the credit card variable and mortgage value are insignificant. Income
and education level have the most significant impact, followed by the number of
dependents, deposit account, average credit card spending, and banking services.
Securities account and age variables have negligible impact.
23
Figure 4. 7: Confusion matrix and ROC Cure for Test Data (Decision Tree)
Income and education level are the variables with the greatest impact. Following them
are average credit card spending and household size. Similar to the results of the
previous two models, the Decision Tree model's outcome also includes 8 influential
variables. Credit card and mortgage value have minimal impact. Securities account
and banking services are two variables that do not influence the model. (Figure 4.8)
4.3 Discussion.
The Logistic regression model has been widely used by researchers in previous
studies to investigate customers' loan repayment ability. However, when compared to
machine learning models, the Logistic model has the lowest accuracy (Figure 4.9).
The Decision Tree model has the highest accuracy, making fewer prediction errors.
The Decision Tree model has the highest accuracy, so the authors decide to
build a decision tree to provide conditions for the bank to decide on disbursing loans
to customers. According to the results of the decision tree (Figure 4.10), customers
with an income of 108,500,000 VND will be those who repay their loans on time. If
the income of the customers falls between 108,500,000 dong and 92,000,000 dong,
then consideration must be given to credit card average spending below 35,400,000
dong and having an educational level of high school or above. Additionally, the bank
should household size, which should be two or fewer.
26
we are only considering normal conditions. The research has demonstrated that income
and education level are the most influential variables, consistent with previous studies
by Vuong (2006), Wongnaa (2013), and Joseph (2013). Regarding age, while the
machine learning model indicates it has little impact on customers' loan repayment
ability, quantitative analysis and traditional binary Logistic regression models
conducted using Nguyen's (2012) and Nguyen's (2018) SPSS econometrics software
suggest it significantly affects individual customers' loan repayment ability.
5.2 Recommendations.
In addition to the assessment tools currently in use, banks should consider additional
analytical and predictive models to determine priority factors for loan assessment.
Income is a direct factor that significantly influences customers' repayment ability.
Customers with higher incomes are more likely to repay their loans on time compared
to those with lower incomes. However, banks need to be cautious in assessing
customers' income to avoid customers falsely declaring income to borrow more than
their actual capacity. Accurate information in loan applications is crucial as it
determines the quality of assessment. Therefore, proper storage and management of
information and credit profiles are essential to ensure accurate assessment. To mitigate
credit risks, banks should consider customers' educational levels. Higher education
levels imply better understanding of expertise and legal matters. To verify customers'
educational levels accurately and prevent false declarations, banks can request
additional credentials from customers.
5.3 Limitations
Due to time and scope constraints, the study still has several limitations:
Firstly, compared to previous experimental studies, this research uses a smaller
sample size and fewer variables. Factors influencing customers' abilities consist of both
objective and subjective factors. The study only considers 10 factors and needs to
expand and examine more factors.
Secondly, the dependent variable only considers two possibilities: timely
repayment and late repayment. Customers' repayment behavior should consider
whether they fully repay or partially repay their debts.
REFERENCES
Vietnamese documents:
1. Hoang, V. Q., Hung, D. G., Huu, N. V., & Ngoc, T. M. (2006). Statistical
method for constructing individual credit rating models. Vietnam Journal of
Mathematical Applications, 4(2), 1-16.
2. Law No. 47/2010/QH12 of the National Assembly dated June 16, 2010
promulgated on the law on credit institutions
3. Vo, D. L., Nguyen, H. S., Dao, N. L., Do, D. D., Trinh, T. H. M., Pham, V. D.,
... & Nguyen, T. K. T. (2008). Current Inflation in Vietnam: Causes and
Solutions. National University of Hanoi.
4. Vo, T. V. N., & Nguyen, T. H. N. Factors Affecting the Household's
Repayment Ability on Time in the Mekong Delta, Vietnam.
Foreign documents:
1. Bessis, J. (2011). Risk Management in Banking. John Wiley & Sons
2. Ivan Idris - Python Data Analysis
3. Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques,
Second Edition.
4. Kang, D., Raghavan, D., Bailis, P., & Zaharia, M. (2020). Model assertions for
monitoring and improving ML models. Proceedings of Machine Learning and
Systems, 2, 481-496.
5. Magali, J. J. (2013). Factors Affecting Credit Default Risks for Rural Savings
and Credits Cooperative Societies (SACCOs) in Tanzania. European Journal
of Business and Management, 5(32), 60-73.
6. Wongnaa, C. A., & Awunyo-Vitor, D. (2013). Factors Affecting Loan
Repayment Performance Among Yam Farmers in the Sene District, Ghana.
Agris On-line Papers in Economics and Informatics, 5(665-2016-44943), 111-
122.
1
APPENDIX 1
2
3
4
5
6