Skip to content

Replication of the paper "Credit risk evaluation model with textual features from loan descriptions for P2P lending", Zhang et al. (2020), published in Electric Commerce Research and Applications

License

Notifications You must be signed in to change notification settings

art-test-stack/credit_risk_eval_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit risk evaluation model with textual features

Description

This repository is about the reimplementation of the result of the paper "Credit risk evaluation model with textual features from loan descriptions for P2P lending", Zhang et al. (2020). It has been made within my Master Thesis topic research at NTNU (Uncertainty Quantification for Language Models in Safety-Critical Systems). The model specificities are given bellow.

Note

Not everything has been made exactly like in the paper. Sometimes because the technologies improved and sometimes by lack of knowledge.

Dataset

The dataset is from LendingClub, [2] loans data from the American market.

Since LendingClub does not provide anymore this data, the dataset used have been taken from Kaggle.

In this section the dataset used will be presented, the feature selection -as in the original paper- will be described, and its characteristics will be compared to the one used in the paper.

Definitions and Descriptions of Variables from LendingClub

In this section the variables names and its description are given. There are few variables that has been computed from existing variables in the dataset, but where not in the orignal dataset. The code for this process can be found in src/data/features.py.

Variable Name Data Type Description Code Variable Name / Computation
Target Variable
Loan status Categorical Whether the borrower has or has not paid the loan in full loan_status
Loan characteristics
Loan amount Numerical (log) The total amount committed to the loan log(loan_amnt)
Term Categorical Number of payments on the loan. Values can be either 36 or 60 term
Interest rate Numerical Nominal interest on the loan int_rate
Loan purpose Categorical Purpose of loan request, 13 purposes included purpose
Creditworthiness features
FICO score Numerical Borrower’s FICO score at loan organization (fico_range_high + fico_range_low) /2
Credit level Categorical Internal credit risk level assigned by the platform grade
Inquiries last 6 months Numerical The number of inquiries within the last 6 months inq_last_6mths
Revolving utilization rate Numerical The amount of credit the borrower is using relative to all available revolving credit revol_util
Delinquency 2 years Numerical The number of delinquencies within the past 2 years delinq_2yrs
Public record Numerical The number of derogatory public records pub_rec
Open credit lines Numerical The number of open credit lines open_acc
Revolving income ratio Numerical The ratio of revolving line to monthly income revol_bal / annual_inc / 12[*]
Total account Numerical The total number of credit lines currently total_acc
Credit age Numerical The number of months from the time at which the borrower opened his or her first credit card to the loan requests issue_d - months_since_earliest_cr_line
Solvency features
Annual income Numerical (log) The self-reported annual income provided by borrower log(annual_inc)
Employ length Numerical Employment length in years emp_length converted in months
House ownership Categorical House ownership provided by the borrower home_ownership
Income verification Categorical The status of income verification. Verified, source verified, or not verified verification_status
DTI Numerical Debt-to-income ratio dti
Description feature
Description length Numerical The length of the loan description desc.length()

[*]: Malekipirbazari et al. (2015)

Data selection

The data selection has been done the closer to the one done in the original paper.

Hence, as mentioned in the article, I used the loans issued during the period of 2007–2014. After performing a data cleaning operation (discard loans with missing values and with a description length of less than 20 words), I have obtained 69,276 (70,488 in the paper) loan records, 10,151 (against 10,534) (14.65% against 14.94%) of which were ultimately placed in a default status.

Comparison Table of Summary of the length of loan description

Paper samples
Status Size Mean Std Q1 Q2 Q3
Paid 59954 56.69 45.80 30 45 63
Default 10534 55.78 40.52 29 44 63
All 70488 56.56 40.76 29 45 63
Samples processed
Size Mean Std Q1 Q2 Q3
Paid 58742 60.96 60.95 32 47 64
Default 10151 61.52 58.24 21 46 64
All 69276 61.04 54.64 31 47 64

Comparison of cross Table on Continuous Variables for LendingClub Data (from paper's and processed's data)

Variable Default (Mean) Default (Median) Default (Std) Paid off (Mean) Paid off (Median) Paid off (Std) Pbc
Loan amount (Log) 9.444 / 9.447 9.556 / 9.575 0.651 / 0.648 9.346 / 9.341 9.393 / 9.393 0.652 / 0.654 −0.053 / -0.057
Interest rate 0.154 / 0.155 0.153 / 0.153 0.042 / 0.043 0.128 / 0.128 0.125 / 0.125 0.042 / 0.042 −0.216 / -0.217
Annual income (Log) 10.956 / 10.969 10.951/10954 0.499 / 0.491 11.046 / 11.056 11.035 / 11.051 0.513 / 0.509 0.062 / 0.061
DTI 0.173 / 0.172 0.173 / 0.172 0.075 / 0.074 0.159 / 0.158 0.157 / 0.156 0.075 / 0.074 −0.064 / -0.066
FICO score 695.500 / 696.006 687.000 / 692.000 28.238 / 28.278 706.960 / 707.796 702.000 / 702.000 33.667 / 33.747 0.123 / 0.125
Delinquency in 2 years 0.190 / 0.211 0.000 / 0.000 0.490 / 0.660 0.170 / 0.189 0.000 / 0.000 0.470 / 0.603 −0.012 / -0.013
Open credit lines 10.870 / 10.817 10.000 / 10.000 4.760 / 4.713 10.580 / 10.531 10.000 / 10.000 4.615 / 4.559 −0.022 / -0.022
Inquiries last 6 months 0.960 / 0.999 1.000 / 1.000 1.023 / 1.122 0.770 / 0.796 0.000 / 0.000 0.955 / 1.034 −0.069 / -0.068
Public records 0.100 / 0.088 0.000 / 0.000 0.320 / 0.331 0.080 / 0.70 0.000 / 0.000 0.294 / 0.292 −0.020 / -0.022
Revolving to income 2.940 / 2.935 2.499 / 2.496 2.289 / 2.290 2.736 / 2.732 2.297 / 2.294 2.205 / 2.199 −0.033 / -0.033
Revolving utilization 0.600 / 0.602 0.630 / 0.634 0.239 / 0.241 0.546 / 0.547 0.569 / 0.572 0.252 / 0.254 −0.076 / -0.077
Total account 23.860 / 23.681 22.000 / 22.000 11.132 /11.11 24.110 / 23.900 23.000 / 22.000 11.257 / 11.161 0.008 / 0.007
Credit age 174.440 / 170.694 159.000 / 156.000 81.699 / 78.499 177.400 / 174.493 163.000 / 160.000 81.983 / 79.591 0.013 / 0.017
Description length 55.770 / 61.520 44.000 / 46.000 40.523 / 58.236 56.690 / 60.960 45.000 / 47.000 40.799 / 53.995 0.008 / -0.004*

On each cells the data is given like this: {paper} / {processed}. The results provided can be runned in read_csv.ipynb. Pbc means Point-biserial correlation.

Preprocessing

Hard features

The first thing was the feature selection for the model. Hence the hard features (not textual features) are processed by one hot encoding them. Hence a splitting is done: 80% for training set, 10% for dev set and 10% for test set. A resampling strategy on the training set is done to address the data imbalance problem: I have performed an over-sampling strategy for the loans in default and an under-sampling strategy for the positive sample. Due to implementation issues, advanced techniques such SMOTE Chawla et al. (2002) for oversampling or Tomek's link Tomek (1975) for undersampling could not been processed here because of the textual features and that I have not found a way to track the ids. Because in the origial paper they do not mention a specific technique, I have decided to use both a random over sampling and under sampling Brownlee (2021). Once the training set is balanced, I normalize the hard features by proceeding with standardization. Meanwhile, the textual features are processed as described in the Textual features section. Finally, the model is trained on both train and dev sets.

Textual features

The textual features preprocessing is basically processed in two parts. First, the segmentation and then converting the segmented words into embeddings.

Segmentation

According to the paper I used CoreNLP model from Manning et al. (2014) for the segmentation of textual features. However, I understand that, from that time, the technology evolved and as been incorporated to the python package stanza. So instead of using the Java server provided by the authors of the original paper of CoreNLP, I have chosen to use stanza.Pipeline which is more optimized for Python execution. Moreover, the authors of the paper that I tried to replicate do not mention which option (called annotations) of the CoreNLP.Pipeline they used. So I felt free to use some different types of them. I have also added an option to prune words which are included in the nltk.stopwords set. The results are provided is the Results section.

Word embedding

After tokenizing the words, I used a pre-trained word vectors model named GloVe, from Pennington et al. (2014) to create the embeddings of the tokens, as they did in the original paper. GloVe provide several pre-trained word vectors models which has for the embedding dimensions, only 3 sizes: 100, 200 and 300. As described in Model section, and since my goal is to replicate the best model from the paper, the number of heads for the multi-head attention mechanism, from Vaswani et al. (2017), is 8. Hence, I have chosen the 200 dimensions model since the article does not mention the dimension of the Transformer Encoder (TE) model. The implementation itself have been done with spacy, the most recent official package from Stanford for GloVe embeddings.

The model

The base of the model used is a Transformer Encoder (TE) model from Vaswani et al. (2017). The textual features processed are then fed into the TE. As in the original paper, the TE is composed of 1 layers, with 8 heads for the multi-head attention mechanism, and feed-forward networks (FFN) has a size of 50 neurons and the activation function used is the rectified linear unit (ReLU) function. Then, as the TE is considered here as a feature exctractor, only the first sequence of the output of the TE is used. It's concatenated with the hard features and fed into a feed-forward neural network (FNN) with 2 hidden layers of 10 neurons each, separated by a ReLU activation function. The output layer consists of two neurons passed through a Softmax layer.

The model is trained with a binary cross-entropy loss function and the Adam optimizer with a learning rate of 0.0001 and weight decay of 0.0001 to avoid over-fitting. An early-stopping strategy is applied on training. The model is evaluated on the test set using the ROC-AUC and G-MEAN metrics, developped in Metrics section.

A dropout strategy of 0.3 (value not precised in the article) is used for the TE and after each layer of the FNN to avoid overfitting.

Metrics

The model is evaluated on the test-set using the ROC-AUC and G-MEAN metrics. The ROC-AUC is the area under the receiver operating characteristic curve, which is a plot of the true positive rate against the false positive rate. The G-MEAN is the geometric mean of the true positive rate and the true negative rate. The ROC-AUC is a measure of the model's ability to distinguish between the positive and negative classes, while the G-MEAN is a measure of the model's ability to balance the true positive and true negative rates. Both are effective metrics for imbalanced data sets (Gong et al., 2017).

The G-mean is defined as follows:

$$G_{mean} = \sqrt{\left(\frac{TP}{TP + FN}\right)\left(\frac{TN}{TN + FP}\right)}$$

Results

Measures TE from the paper TE with stopwords TE without stopwords
ROC-AUC (%) 70.30 65.56 65.91
G-MEAN (%) 65.32 66.10 66.48

Installation

  1. Clone the repository:

    git clone [email protected]:art-test-stack/credit_risk_eval_model.git
  2. Navigate to the project directory:

    cd credit_risk_eval_model
  3. Create a virtual environment

    For example I use virtualenv:

    virtualenv -p python 3.10 venv
  4. Install Python depedencies:

    pip install -r requirements.txt
  5. Download the GloVe on Glove's website and put the 200d model file in the model/glove.6B/ folder (more details on that README.md).

  6. Download the LendingClub dataset from Kaggle and put the .csv file in the data/ folder.

Warning

Make sure that the .csv file name match the one in the variable LOANS_FILE from utils.py.

Usage

Once all the installation steps are done, you can do the preprocessing, train, and evaluate the model by running the following command:

python main.py

You can then add different arguments to the command line to change the model's parameters. For example, if you already done a preprocessing, you can skip it, if you want to train another model from the same data, by running:

python main.py --skip_preprocessing

Here the list of the different arguments:

  • --no_stopwords (flag): Disable the stopwords from the nltk package.

  • --not_balance_dataset (flag): To not balance the training set.

  • --skip_preprocessing (flag): Skip the preprocessing step.

  • --not_verbose (flag): Disable print information during the training.

  • --epochs: Number of epochs to train the model. Default: 500

  • --batch_size: Batch size for training. Default: 1024

  • --dropout: Dropout rate for the model. Default: 0.3

  • --early_stopping_patience: Number of epochs with no improvement after which training will be stopped. Default: 100

  • --early_stopping_min_delta: Minimum value to improve model in early stopping. Default: 0.0001

  • --model_name: Name of the trained model to find its .pt file and its training loss. Default: 'model.pt'

References

  1. Zhang, et al. (2020). "Credit risk evaluation model with textual features from loan descriptions for P2P lending." Electronic Commerce Research and Applications, 39, 100989. pdf

  2. Milad Malekipirbazari, Vural Aksakalli, "Risk assessment in social lending via random forests," Expert Systems with Applications, Volume 42, Issue 10, 2015, Pages 4621-4631, ISSN 0957-4174, pdf.

  3. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research, 16, 321–357.

  4. Ivan Tomek. (1976). "Two modifications of cnn." IEEE Transactions on Systems, Man, and Cybernetics, 6, 769–772.

  5. Jason Brownlee. (2021). "Random Oversampling and Undersampling for Imbalanced Classification." Retrieved from article.

  6. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. (2014). "The Stanford CoreNLP Natural Language Processing Toolkit." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. pdf.

  7. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. (2014). "GloVe: Global Vectors for Word Representation." pdf

  8. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. (2017). "Attention is All You Need." Advances in Neural Information Processing Systems, 30, 5998-6008. pdf

  9. Gong, Joonho, Kim, Hyunjoong. (2017). "Rhsboost: Improving classification performance in imbalance data." Computational Statistics & Data Analysis, 111, 1–13. pdf

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Replication of the paper "Credit risk evaluation model with textual features from loan descriptions for P2P lending", Zhang et al. (2020), published in Electric Commerce Research and Applications

Topics

Resources

License

Stars

Watchers

Forks