Credit risk evaluation model with textual features

Description

This repository is about the reimplementation of the result of the paper "Credit risk evaluation model with textual features from loan descriptions for P2P lending", Zhang et al. (2020). It has been made within my Master Thesis topic research at NTNU (Uncertainty Quantification for Language Models in Safety-Critical Systems). The model specificities are given bellow.

Note

Not everything has been made exactly like in the paper. Sometimes because the technologies improved and sometimes by lack of knowledge.

Dataset

The dataset is from LendingClub, [2] loans data from the American market.

Since LendingClub does not provide anymore this data, the dataset used have been taken from Kaggle.

In this section the dataset used will be presented, the feature selection -as in the original paper- will be described, and its characteristics will be compared to the one used in the paper.

Definitions and Descriptions of Variables from LendingClub

In this section the variables names and its description are given. There are few variables that has been computed from existing variables in the dataset, but where not in the orignal dataset. The code for this process can be found in src/data/features.py.

Variable Name	Data Type	Description	Code Variable Name / Computation
Target Variable
Loan status	Categorical	Whether the borrower has or has not paid the loan in full	`loan_status`
Loan characteristics
Loan amount	Numerical (log)	The total amount committed to the loan	`log(loan_amnt)`
Term	Categorical	Number of payments on the loan. Values can be either 36 or 60	`term`
Interest rate	Numerical	Nominal interest on the loan	`int_rate`
Loan purpose	Categorical	Purpose of loan request, 13 purposes included	`purpose`
Creditworthiness features
FICO score	Numerical	Borrower’s FICO score at loan organization	(`fico_range_high` + `fico_range_low`) /2
Credit level	Categorical	Internal credit risk level assigned by the platform	`grade`
Inquiries last 6 months	Numerical	The number of inquiries within the last 6 months	`inq_last_6mths`
Revolving utilization rate	Numerical	The amount of credit the borrower is using relative to all available revolving credit	`revol_util`
Delinquency 2 years	Numerical	The number of delinquencies within the past 2 years	`delinq_2yrs`
Public record	Numerical	The number of derogatory public records	`pub_rec`
Open credit lines	Numerical	The number of open credit lines	`open_acc`
Revolving income ratio	Numerical	The ratio of revolving line to monthly income	`revol_bal / annual_inc / 12`[*]
Total account	Numerical	The total number of credit lines currently	`total_acc`
Credit age	Numerical	The number of months from the time at which the borrower opened his or her first credit card to the loan requests	`issue_d - months_since_earliest_cr_line`
Solvency features
Annual income	Numerical (log)	The self-reported annual income provided by borrower	`log(annual_inc)`
Employ length	Numerical	Employment length in years	`emp_length` converted in months
House ownership	Categorical	House ownership provided by the borrower	`home_ownership`
Income verification	Categorical	The status of income verification. Verified, source verified, or not verified	`verification_status`
DTI	Numerical	Debt-to-income ratio	`dti`
Description feature
Description length	Numerical	The length of the loan description	`desc.length()`

[*]: Malekipirbazari et al. (2015)

Data selection

The data selection has been done the closer to the one done in the original paper.

Hence, as mentioned in the article, I used the loans issued during the period of 2007–2014. After performing a data cleaning operation (discard loans with missing values and with a description length of less than 20 words), I have obtained 69,276 (70,488 in the paper) loan records, 10,151 (against 10,534) (14.65% against 14.94%) of which were ultimately placed in a default status.

Comparison Table of Summary of the length of loan description

Paper samples
Status	Size	Mean	Std	Q1	Q2	Q3
Paid	59954	56.69	45.80	30	45	63
Default	10534	55.78	40.52	29	44	63
All	70488	56.56	40.76	29	45	63
Samples processed
	Size	Mean	Std	Q1	Q2	Q3
Paid	58742	60.96	60.95	32	47	64
Default	10151	61.52	58.24	21	46	64
All	69276	61.04	54.64	31	47	64

Comparison of cross Table on Continuous Variables for LendingClub Data (from paper's and processed's data)

Variable	Default (Mean)	Default (Median)	Default (Std)	Paid off (Mean)	Paid off (Median)	Paid off (Std)	Pbc
Loan amount (Log)	9.444 / 9.447	9.556 / 9.575	0.651 / 0.648	9.346 / 9.341	9.393 / 9.393	0.652 / 0.654	−0.053 / -0.057
Interest rate	0.154 / 0.155	0.153 / 0.153	0.042 / 0.043	0.128 / 0.128	0.125 / 0.125	0.042 / 0.042	−0.216 / -0.217
Annual income (Log)	10.956 / 10.969	10.951/10954	0.499 / 0.491	11.046 / 11.056	11.035 / 11.051	0.513 / 0.509	0.062 / 0.061
DTI	0.173 / 0.172	0.173 / 0.172	0.075 / 0.074	0.159 / 0.158	0.157 / 0.156	0.075 / 0.074	−0.064 / -0.066
FICO score	695.500 / 696.006	687.000 / 692.000	28.238 / 28.278	706.960 / 707.796	702.000 / 702.000	33.667 / 33.747	0.123 / 0.125
Delinquency in 2 years	0.190 / 0.211	0.000 / 0.000	0.490 / 0.660	0.170 / 0.189	0.000 / 0.000	0.470 / 0.603	−0.012 / -0.013
Open credit lines	10.870 / 10.817	10.000 / 10.000	4.760 / 4.713	10.580 / 10.531	10.000 / 10.000	4.615 / 4.559	−0.022 / -0.022
Inquiries last 6 months	0.960 / 0.999	1.000 / 1.000	1.023 / 1.122	0.770 / 0.796	0.000 / 0.000	0.955 / 1.034	−0.069 / -0.068
Public records	0.100 / 0.088	0.000 / 0.000	0.320 / 0.331	0.080 / 0.70	0.000 / 0.000	0.294 / 0.292	−0.020 / -0.022
Revolving to income	2.940 / 2.935	2.499 / 2.496	2.289 / 2.290	2.736 / 2.732	2.297 / 2.294	2.205 / 2.199	−0.033 / -0.033
Revolving utilization	0.600 / 0.602	0.630 / 0.634	0.239 / 0.241	0.546 / 0.547	0.569 / 0.572	0.252 / 0.254	−0.076 / -0.077
Total account	23.860 / 23.681	22.000 / 22.000	11.132 /11.11	24.110 / 23.900	23.000 / 22.000	11.257 / 11.161	0.008 / 0.007
Credit age	174.440 / 170.694	159.000 / 156.000	81.699 / 78.499	177.400 / 174.493	163.000 / 160.000	81.983 / 79.591	0.013 / 0.017
Description length	55.770 / 61.520	44.000 / 46.000	40.523 / 58.236	56.690 / 60.960	45.000 / 47.000	40.799 / 53.995	0.008 / -0.004*

On each cells the data is given like this: {paper} / {processed}. The results provided can be runned in read_csv.ipynb. Pbc means Point-biserial correlation.

Preprocessing

Hard features

The first thing was the feature selection for the model. Hence the hard features (not textual features) are processed by one hot encoding them. Hence a splitting is done: 80% for training set, 10% for dev set and 10% for test set. A resampling strategy on the training set is done to address the data imbalance problem: I have performed an over-sampling strategy for the loans in default and an under-sampling strategy for the positive sample. Due to implementation issues, advanced techniques such SMOTE Chawla et al. (2002) for oversampling or Tomek's link Tomek (1975) for undersampling could not been processed here because of the textual features and that I have not found a way to track the ids. Because in the origial paper they do not mention a specific technique, I have decided to use both a random over sampling and under sampling Brownlee (2021). Once the training set is balanced, I normalize the hard features by proceeding with standardization. Meanwhile, the textual features are processed as described in the Textual features section. Finally, the model is trained on both train and dev sets.

Textual features

The textual features preprocessing is basically processed in two parts. First, the segmentation and then converting the segmented words into embeddings.

Segmentation

According to the paper I used CoreNLP model from Manning et al. (2014) for the segmentation of textual features. However, I understand that, from that time, the technology evolved and as been incorporated to the python package stanza. So instead of using the Java server provided by the authors of the original paper of CoreNLP, I have chosen to use stanza.Pipeline which is more optimized for Python execution. Moreover, the authors of the paper that I tried to replicate do not mention which option (called annotations) of the CoreNLP.Pipeline they used. So I felt free to use some different types of them. I have also added an option to prune words which are included in the nltk.stopwords set. The results are provided is the Results section.

Word embedding

After tokenizing the words, I used a pre-trained word vectors model named GloVe, from Pennington et al. (2014) to create the embeddings of the tokens, as they did in the original paper. GloVe provide several pre-trained word vectors models which has for the embedding dimensions, only 3 sizes: 100, 200 and 300. As described in Model section, and since my goal is to replicate the best model from the paper, the number of heads for the multi-head attention mechanism, from Vaswani et al. (2017), is 8. Hence, I have chosen the 200 dimensions model since the article does not mention the dimension of the Transformer Encoder (TE) model. The implementation itself have been done with spacy, the most recent official package from Stanford for GloVe embeddings.

The model

The base of the model used is a Transformer Encoder (TE) model from Vaswani et al. (2017). The textual features processed are then fed into the TE. As in the original paper, the TE is composed of 1 layers, with 8 heads for the multi-head attention mechanism, and feed-forward networks (FFN) has a size of 50 neurons and the activation function used is the rectified linear unit (ReLU) function. Then, as the TE is considered here as a feature exctractor, only the first sequence of the output of the TE is used. It's concatenated with the hard features and fed into a feed-forward neural network (FNN) with 2 hidden layers of 10 neurons each, separated by a ReLU activation function. The output layer consists of two neurons passed through a Softmax layer.

The model is trained with a binary cross-entropy loss function and the Adam optimizer with a learning rate of 0.0001 and weight decay of 0.0001 to avoid over-fitting. An early-stopping strategy is applied on training. The model is evaluated on the test set using the ROC-AUC and G-MEAN metrics, developped in Metrics section.

A dropout strategy of 0.3 (value not precised in the article) is used for the TE and after each layer of the FNN to avoid overfitting.

Metrics

The model is evaluated on the test-set using the ROC-AUC and G-MEAN metrics. The ROC-AUC is the area under the receiver operating characteristic curve, which is a plot of the true positive rate against the false positive rate. The G-MEAN is the geometric mean of the true positive rate and the true negative rate. The ROC-AUC is a measure of the model's ability to distinguish between the positive and negative classes, while the G-MEAN is a measure of the model's ability to balance the true positive and true negative rates. Both are effective metrics for imbalanced data sets (Gong et al., 2017).

The G-mean is defined as follows:

$$G_{mean} = \sqrt{\left(\frac{TP}{TP + FN}\right)\left(\frac{TN}{TN + FP}\right)}$$

Results

Measures	TE from the paper	TE with stopwords	TE without stopwords
ROC-AUC (%)	70.30	65.56	65.91
G-MEAN (%)	65.32	66.10	66.48

Installation

Clone the repository:

git clone [email protected]:art-test-stack/credit_risk_eval_model.git

Navigate to the project directory:
```
cd credit_risk_eval_model
```
Create a virtual environment

For example I use virtualenv:
```
virtualenv -p python 3.10 venv
```
Install Python depedencies:
```
pip install -r requirements.txt
```
Download the GloVe on Glove's website and put the 200d model file in the model/glove.6B/ folder (more details on that README.md).
Download the LendingClub dataset from Kaggle and put the .csv file in the data/ folder.

Warning

Make sure that the .csv file name match the one in the variable LOANS_FILE from utils.py.

Usage

Once all the installation steps are done, you can do the preprocessing, train, and evaluate the model by running the following command:

python main.py

You can then add different arguments to the command line to change the model's parameters. For example, if you already done a preprocessing, you can skip it, if you want to train another model from the same data, by running:

python main.py --skip_preprocessing

Here the list of the different arguments:

--no_stopwords (flag): Disable the stopwords from the nltk package.
--not_balance_dataset (flag): To not balance the training set.
--skip_preprocessing (flag): Skip the preprocessing step.
--not_verbose (flag): Disable print information during the training.
--epochs: Number of epochs to train the model. Default: 500
--batch_size: Batch size for training. Default: 1024
--dropout: Dropout rate for the model. Default: 0.3
--early_stopping_patience: Number of epochs with no improvement after which training will be stopped. Default: 100
--early_stopping_min_delta: Minimum value to improve model in early stopping. Default: 0.0001

--model_name: Name of the trained model to find its .pt file and its training loss. Default: 'model.pt'

References

Zhang, et al. (2020). "Credit risk evaluation model with textual features from loan descriptions for P2P lending." Electronic Commerce Research and Applications, 39, 100989. pdf
Milad Malekipirbazari, Vural Aksakalli, "Risk assessment in social lending via random forests," Expert Systems with Applications, Volume 42, Issue 10, 2015, Pages 4621-4631, ISSN 0957-4174, pdf.
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research, 16, 321–357.
Ivan Tomek. (1976). "Two modifications of cnn." IEEE Transactions on Systems, Man, and Cybernetics, 6, 769–772.
Jason Brownlee. (2021). "Random Oversampling and Undersampling for Imbalanced Classification." Retrieved from article.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. (2014). "The Stanford CoreNLP Natural Language Processing Toolkit." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. pdf.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. (2014). "GloVe: Global Vectors for Word Representation." pdf
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. (2017). "Attention is All You Need." Advances in Neural Information Processing Systems, 30, 5998-6008. pdf
Gong, Joonho, Kim, Hyunjoong. (2017). "Rhsboost: Improving classification performance in imbalance data." Computational Statistics & Data Analysis, 111, 1–13. pdf

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit risk evaluation model with textual features

Description

Dataset

Definitions and Descriptions of Variables from LendingClub

Data selection

Comparison Table of Summary of the length of loan description

Comparison of cross Table on Continuous Variables for LendingClub Data (from paper's and processed's data)

Preprocessing

Hard features

Textual features

Segmentation

Word embedding

The model

Metrics

Results

Installation

Usage

References

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data		data
model		model
rsc		rsc
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.py		main.py
read_csv.ipynb		read_csv.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

License

art-test-stack/credit_risk_eval_model

Folders and files

Latest commit

History

Repository files navigation

Credit risk evaluation model with textual features

Description

Dataset

Definitions and Descriptions of Variables from LendingClub

Data selection

Comparison Table of Summary of the length of loan description

Comparison of cross Table on Continuous Variables for LendingClub Data (from paper's and processed's data)

Preprocessing

Hard features

Textual features

Segmentation

Word embedding

The model

Metrics

Results

Installation

Usage

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages