EasyChair Preprint 8693
EasyChair Preprint 8693
EasyChair Preprint 8693
№ 8693
Afshin Ashofteh
Afshin Ashofteh
1.1 INTRODUCTION
Risk management with the ability to incorporate new and Big Data sources
and benefit from emerging technologies such as cloud and parallel computing
platforms is critically important for financial service providers, supervisory
authorities, and regulators if they are to remain competitive and relevant [1].
Financial institutions’ growing interest in non-traditional data may be seen
as a hypothetical occurrence, a reaction to the most recent financial crisis.
Afshin Ashofteh
NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Cam-
pus de Campolide, 1070-312 Lisboa, Portugal. e-mail: [email protected]
1
2 Afshin Ashofteh
However, the financial crisis not only prompted several statutory and super-
visory initiatives that require significant disclosure of data but also provided
a positive atmosphere to get the advantages of new data sources such as
non-traditional data sets [2, 3].
There are sources of supply and demand for this increased acceptance
of non-traditional data. On the supply side, technology advancements like
mobile phones [4] that have expanded storage space and computing power
while cutting costs have fueled rises in the new data sources. In addition,
mobile data and social data have recently been used to monitor different
risks [5]. On the demand side, loan providers are becoming more interested
in learning how data analysis may improve credit scoring and lower the risk
of default [6].
Some of the largest and most established financial institutions such as
banks, insurance companies, payday lenders, peer-to-peer lending platforms,
microfinance providers, leasing companies, and payment by installment com-
panies are now taking a fresh look at their customers’ transactional data
to enhance the early detection of fraud. They use innovative machine learn-
ing models that exploit novel data sources like Big Data, social data, and
mobile data. Credit risk management may benefit in the long run if these ad-
vancements result in better credit choices. However, there are shorter-term
hazards if early users of non-traditional data credit scoring mostly disregard
the model risk and technical aspects of new methods that might affect credit
scoring [7]. For instance, one crucial issue in credit evaluation is the class
imbalance resulting from distress situations for loan providers [8]. These dis-
tress situations are relatively infrequent events that make the imbalance data
very common in credit scoring. In addition, the limited information for dis-
tinguishing dynamic fraud from a genuine customer in a highly sparse and
imbalanced environment makes default forecasting more challenging.
Even though banks and loan providers should follow different regula-
tions to reduce or eliminate credit risk, regulatory changes can potentially
change the microfinance environment that generates the distribution of non-
traditional data. It could be the source of changes in the probability dis-
tribution function of credit scores over time. In this case, the reliability of
the models based on historical data will decrease dramatically. This time
dependency of the training process needs new approaches adopted to deal
with these situations and to avoid interruptions of ML approaches for Big
Data over time [9]. These issues in credit evaluation show the importance of
comparing the machine learning techniques for evaluating the model risk for
credit scoring. Figure 1.1 summarizes the application of Big Data and small
data in credit risk analytics.
This paper presents greater insight into how non-traditional data in credit
scoring challenges the model risk and addresses the need to develop new credit
scoring models. The rest of the paper is structured as follows. Section 1.2 de-
scribes the PySpark code for data processing as the first step of a personal
credit evaluation. Section 1.3 represents the model building and model eval-
1 Big Data and Machine Learning for Credit Risk Using PySpark 3
uation methods. The results are shown in section 1.4 for a complete personal
credit evaluation. Finally, section 1.5 contains some concluding remarks.
Fig. 1.1 Graphical summary of Big Data analytics on credit risk evaluation
Table 1.1 Loan status in combined datasets without duplicates includes the personal
credit history of customers 2007-2018.
Current - 603,273
Fully Paid - 331,528
In Grace Period - 5,151
Charged Off 94,285 -
Late (31-120 days) 12,154 -
Late (16-30 days) 2,162 -
Default 22 -
Table 1.2 Analytical base table for lending club loan dataset.
Row Attribute Description Scale
The dataset is in CSV format, and Spark can convert it into a Universal Disk
Format (UDF). Additionally, it is exportable to MLeap, a standard serializa-
tion format and execution engine for machine learning pipelines. It supports
Spark, Scikit-learn, and TensorFlow for training pipelines and exporting them
to an MLeap Bundle. As a result, if we continue leveraging Spark contents,
it would be easy to import a batch in the batch stream, a collection of data
points grouped within a specific time interval. Another term often used for
this process is a data window (see Appendix 1 – Lines 1-7).
This study started with the CSV file format. However, it is not an optimal
format for Spark. The Parquet format is a columnar data store, allowing
Spark to only process the data necessary to complete the operations versus
reading the entire dataset. Parquet gives Spark more flexibility in accessing
the data and improves performance on large datasets (see Appendix 1 – Line
9).
Before starting to work with Spark, it could be recommended first to
check the Spark Context and the Spark version to check the version and
1 See PySpark program and dataset here: https://github.com/AfshinAshofteh/creditscore pyspark.git
1 Big Data and Machine Learning for Credit Risk Using PySpark 7
According to the data schema, the types of attributes are String. Before
continuing with Spark, it is essential to covert the data types to numeric types
because Spark only handles numeric data. That means all the data frame
columns must be either integers or decimals (called ”Doubles” in Spark).
Therefore, the .cast() method is used to convert all the numeric columns
from our loan data frame (for instance, see Appendix 1 – Lines 14-17). The
other attributes are treated similarly according to Table 1.3.
Additionally, the emp length column was converted into numeric type (see
Appendix 1 – Lines 18-20) and map multiple levels of the verification status
attribute into a one-factor level (see Appendix 1 – Line 21). Finally, the target
vector default loan was created from the loan status feature by classifying
the data into two values: users with poor credit (default) including Default,
Charged Off, Late (31-120 days), Late (16-30 days), and users with good
credit (not default) including Fully Paid, Current, and In Grace Period (see
Table 1.4 and Appendix 1 – Line 22).
Three new measures were created to increase the model’s accuracy and de-
crease the data dimension by removing less critical attributes according to
the ExtraTreesClassifier approach. For this purpose, we must know that the
Spark data frame is immutable. It means it cannot be changed, and columns
cannot be updated in place. If we want to add a new column in a data frame,
we must make a new one. To overwrite the original data frame, we must
reassign the returned data frame using the .withColumn command.
The first measure refers to the length of credit in years to know how much
each person returned to the bank from the total loan amount. Therefore, we
need to make a new column by subtracting the “loan payment” from the
“total loan amount” (see Appendix 1 – Line 23).
The second measure is the total amount of money earned or lost per loan
to show how much of the total amount of the loan should be repaid to the
bank by each person (see Appendix 1 – Line 24).
Finally, the third measure is the total loan. Customers in this database
could have multiple loans, and it is necessary to aggregate the loan amounts
8 Afshin Ashofteh
based on the member IDs of the customers. Then according to the Basel
accords and routine of the banking risk management, the maximum and
minimum amounts could be reported and reviewed by risk managers to be
checked for concentration risk in the risk appetite statement of the financial
institutions (see Appendix 1 – Lines 25-29).
The primary purpose of this section is to make a decision for the imputation
of the missing values and deal with outliers.
For this large-scale dataset, it is reasonable to have NULL values, and
handling the missing values in Spark is possible with three options: keep,
replace, and remove. With missing data in our dataset, we would not be
able to use the data for modeling in Spark if we have empty or N/A in
the dataset. It would cause training errors. Therefore, we must impute or
remove the missing data, and we could not keep them for the modeling step.
This PySpark code uses the fillna() command to replace the missing values
with an average for continuous variables, the median for discrete ordinal
ones, and mode (the highest number of occurrences) for nominal features.
Additionally, the variables with more than half of the data in a sample as
null were discarded (see example in Appendix 1 – Lines 30-34).
The processing of outliers in this paper follows the following principles:
1. We need to consider the reasonable data range in each attribute and delete
the sample data with outliers. This paper uses a simple subsetting for
indexing the rows with outliers, removes the outliers with an index equal
to TRUE, and checks again if the outliers are removed according to the
criteria (see example in Appendix 1 – Lines 35-36).
2. Then, this paper uses cross tables to find possible errors. Cross tables
for paired attributes with min and max functions as aggregate functions
(aggfunc = “min” and aggfunc = “max”) could show the possible errors
which exceed the minimum or maximum of attributes (see example in
Appendix 1 – Lines 37).
3. Finally, box plots with the interquartile rule are used to measure the spread
and variability in our dataset. According to this rule, data points below
Q1-1.5*IQR or above Q3+1.5*IQR are viewed as being too far from the
central values.
10 Afshin Ashofteh
Table 1.5 shows the criteria for excluding some variables or grouping certain
variables.
For example, the variable “issue d ” shows similar IVs for its first four
categories (Aug-18, Dec-18, Oct-18, and Nov-18) and the following five
categories (Sep-18, Dec-15, Jun-18, Jul-18, and May-18). Therefore, they
could be grouped into two new categories. Furthermore, the results from
IV show strong prediction power for most variables (e.g., term, grade,
home ownership, verification status, etc.), and none were transformed.
1 Big Data and Machine Learning for Credit Risk Using PySpark 11
When the data treatment is completed, there are 1,043,423 customers in rows
and 35 features in the dataset, including four categorical and 31 numeric at-
tributes in addition to one binary target variable with two values default and
not default. After finalizing the data treatment, the Spark Cache is used to
optimize the final dataset for iterative and interactive Spark applications and
improve Jobs’ performance (see Appendix 1 – Line 43). The dataset contains
331,528 fully paid loans. (see Table 1.1 and Appendix 1 – Lines 44-45). Con-
sidering the imbalance of default and non-default loans in the dataset, good
customers are much fewer than bad customers, which may cause prediction
deviation in some modeling approaches such as logistic regression. For these
models, the paper applies a paired sample technique to the training base by
randomly selecting bad customers as the same number of total good clients.
This undersampling method or equivalent approach, such as the synthetic
minority oversampling technique (SMOTE), is essential to increase the effi-
ciency of the models, which suffer from imbalanced datasets [10].
Table 1.6 shows that the data set was randomly divided into two groups,
65% for model training (678,548 observations) and the other 35% for the test
set (364,875 observations) to apply different algorithms.
For dividing the dataset into test and train sets, the function .random-
Split() was used in PySpark, equivalent to test train split() in python (see
Appendix 1 – Line 46). The training dataset for developing the model would
normally have twelve months dedicated to the training to have a full annual
cycle to recover the seasonality of each month. Additionally, some recent
months could be considered just for testing the optimal model with an un-
seen dataset.
Financial institutions must predict the customers’ credit risk over time with
minimum model risk. Recently, machine learning models have been applied
to Big Data to determine if a person is eligible for receiving a loan. However,
the pre-processing for the data quality and finding the best hyper parameters
12 Afshin Ashofteh
1.3.1 Method
According to the dataset, we have a credit history of the customers, and this
study tries to predict the loan status of the customers by applying statisti-
cal learning and machine learning algorithms. It helps the loan providers to
guess the probability of default to determine whether or not a loan should be
granted. For this purpose, this paper makes a preliminary statistical analysis
of the credit dataset. Then, different models were developed to predict the
probability of default. The models include Logistic regression, Decision tree,
Random Forest, Neural network, and Support vector machine.
Finally, the results (predictive power of models) were evaluated by evalu-
ation metrics such as the Area Under the ROC Curve (AUC) and the Mean
F1-Score. Receiver operating curves (ROC) show the statistical performance
of the models. In the ROC chart, the horizontal axis represents the speci-
ficity, and the vertical axis shows the sensitivity. The greater the area be-
tween the curve and the baseline, the better the feature performance in de-
fault prediction. After investigating the characteristics of the new credit score
model, the research employs the area ratio of ROC curves to compare the
classification accuracy and evaluates how well this credit scoring model per-
forms. The F1 score, is commonly used in information retrieval, measures the
model’saccuracy using precision (p) and recall (r). Precision is the ratio of
true positives (tp) to all predicted positives tp + f p. A recall is the ratio of
tp
true positives to all actual positives tp + f n. The F 1 score where p = tp+f p
tp
and r = tp+f n is given by:
p.r
F1 = 2
p+r
The F1 metric weights recall and precision equally, and a good retrieval
algorithm will simultaneously maximize precision and recall. Thus, moder-
ately good performance on both will be favored over excellent performance
on one and poor performance on the other.
For creating a ROC plot in PySpark, we need a library that is not in-
stalled by default in Databricks Runtime for Machine Learning. First, we
have to install plotnine and its dependencies based on the ggplot2 package
(see Appendix 1 – Lines 47-48). Second, PyPI mlflow package could be in-
stalled into the cluster to track the model development and packaging code
into reproducible runs.
1 Big Data and Machine Learning for Credit Risk Using PySpark 13
This section builds and evaluates supervised models in PySpark for personal
credit rating evaluation. This paper applies the obtained dataset to Logis-
tic regression, Decision tree, Random Forest, Neural network, and Support
vector machine.
The model-building phase started with three statistical learning methods
and penalized linear regression models: Lasso, Ridge, and ElasticNet. They
eliminate variables that contribute to overfitting without compromising out-
of-sample accuracy. They have L1 and L2 penalties during training and some
hyperparameters (maxIter, elasticNetParam, and regParam), which should
be set to assign how much weight is given to each of the L1 and L2 penalties
and which model should be fitted. The elasticNetParam For Ridge regression
is 0, for LASSO is 0.99, and for ElasticNet regression is 0.5 (see Appendix
1 – Lines 64-66). The results from this notebook in Databricks were tracked
for storing the results and comparing the accuracy of different models. (see
Appendix 1 – Lines 67-80). Then the logistic regression model was built (see
Appendix 1 – Line 81). A pipeline was defined, which includes standardizing
the data, imputing missing values, and encoding for categorical columns (see
Appendix 1 – Lines 82-83). Setting the mlflow of model tracking and repro-
ducibility of the input parameters is useful to log the model and review later
(see Appendix 1 – Lines 84-88). Finally, the accuracy measures were calcu-
lated by Logging the ROC Curve (see Appendix 1 – Lines 89-93), setting
Max F1 Threshold for predicting the loan default with a balance between
true-positives and false-positives (see Appendix 1 – Lines 94-99), scoring the
customers (see Appendix 1 – Lines 100-114), and logging the results (see
Appendix 1 – Lines 115-116).
The leave-one-out cross-validation method examines the between-sample
variation of default prediction. This paper divides the available data into ten
disjoint subsets to train the models on nine subsets and evaluate the model
selection criterion on the tenth subset. This procedure is then repeated for all
combinations of subsets by the Python API of Apache Spark (see Appendix
1 – Lines 117-118). Finally, this paper uses the MLflow UI built-in as part of
the Community Edition of Databricks to compare the models and choose the
ultimate best model. The best model might be selected with an AUC greater
than a threshold (see Appendix 1 – Lines 124-127) or maximum AUC (see
Appendix 1 – Lines 126-130). The details of the best model with maximum
AUC could be checked (see Appendix 1 – Lines 131-132), and the model’s
score with the test data (see Appendix 1 – Lines 133-134). This final model
could predict the amount of money earned or lost per loan (remain=loan pay-
ments - total loan amount) and the outstanding loan balance (see Appendix
1 – Line 135).
As a result, the Ridge method represents a better performance than Lasso.
The Logistic regression in this paper is based on the Ridge penalty with elas-
tic net regularization zero and regparam 0.3 as the best hyperparameters.
14 Afshin Ashofteh
The results show that A-grade loans have the lowest interest rate because
of the minimum evaluated risk for these customers. A significant amount of
loans is allocated to grade A and B customers with the minimum interest
rate and minimum risk of default. We have a descending trend for the grades
D, E, F, and G because banks typically have some sort of criteria to reject
high-risk applications. The optimal cut-off for logistic regression is considered
0.167.
This study discovers a high level of False Negative Rate in any approach.
This rate represents an unexpected loss for the bank. A False Negative rate
also shows a loss in the bank’s balance sheet since it does not let the new
business increase. These two rates are summarized in the F1 score, indicating
a trade-off between False-positive and False Negative. As a trade-off between
model sensitivity and specificity, AUC in Table 1.7 shows almost the same
performance among the logistic regression, decision tree, and random forest.
However, the logistic regression model obtained a higher F1 score (i.e., 0.815)
than the decision tree and random forest, with F1 scores of 0.766 and 0.577,
respectively (see Appendix 2). Overall, the logistic regression performs the
best, and the support vector machine performs the worst with three times
more training time compared to other algorithms.
1 Big Data and Machine Learning for Credit Risk Using PySpark 15
1.5 Conclusion
Acknowledgment
The author of this paper would like to thank José L. CERVERA-FERRI
(CEO of DevStat) for his invitation to CARMA 2018 (International Con-
ference on Advanced Research Methods and Analytics) at the Polytechnic
University of Valencia, which motivated this research.
Appendix 1
1. file location = ”/FileStore/tables/loan.CSV”
2. file location = ”/FileStore/tables/loan complete.CSV”
3. file type = ”CSV”
4. infer schema = ”false”
5. first row is header = ”true”
6. delimiter = ”,”
7. loan df = Spark.read.format(file type).option(”inferSchema”, infer schema).option(”header”,
first row is header).option(”sep”, delimiter).load(file location)
8. print(” >>>>>>> ” + str(loan df.count())+ ” loans opened in this
data set!”)
9. loan df.write.parquet(“AA DFW ALL.parquet”, mode=“overwrite”)
10. print(sc)
11. print(sc.version)
12. Spark.catalog.listTables()
13. display(loan df)
14. loan df = loan df.withColumn(”loan amnt”, loan df.loan amnt.cast(”integer”))\
15. .withColumn(”int rate”, regexp replace(”int rate”, ”%”, ””).cast(”float”))\
16. .withColumn(”revol util”, regexp replace(”revol util”, ”%”, ””).cast(”float”))\
17. .withColumn(”issue year”, substring(loan df.issue d, 5, 4).cast(”double”))
18. loan df = loan df.withColumn(”emp length”, trim(regexp replace(loan df.emp length,
”([ ]*+[a-zA-Z].*)|(n/a)”, ””) ))
19. loan df = loan df.withColumn(”emp length”, trim(regexp replace(loan df.emp length,
”< 1”, ”0”) ))
20. loan df = loan df.withColumn(”emp length”, trim(regexp replace(loan df.emp length,
”10\\+”, ”10”) ).cast(”float”))
21. loan df = loan df.withColumn(”verification status”, trim(regexp replace(loan df.verification status,
”Source Verified”, ”Verified”)))
22. loan df = loan df.filter(loan df.loan status.isin([”Default”, ”Charged Off”,
”Late (31-120 days)”, ”Late (16-30 days)”, ”Fully Paid”,”Current”] )).with-
Column( ”default loan”, (∼(loan df.loan status.isin([”Fully Paid”, ”In Grace
Period” , ”Current”]))).cast(”string”))
23. loan df = loan df.withColumn(”credit length in years”, (loan df.issue year
- loan df.earliest year))
24. loan df = loan df.withColumn(”remain”, round( loan df.loan amnt - loan df.total pymnt,
2))
25. customer df = loan df.groupBy(”member id”).agg(f.sum(”loan amnt”).alias(”sumLoan”))
26. loan max df = customer df.agg({”sumLoan”: ”max”}).collect()[0]
27. customer max loan = loan max df[”max(sumLoan)”]
28. print(customer df.agg({”sumLoan”: ”max”}).collect()[0],customer df.agg({”sumLoan”:
”min”}).collect()[0])
29. print(customer df.filter(”sumLoan = ” +str(customer max loan)).collect())
30. pandas df = loan intrate income.toPandas()
31. null columns = pandas df.columns[pandas df.isnull().any()]
32. pandas df[null columns].isnull().sum()
1 Big Data and Machine Learning for Credit Risk Using PySpark 17
74. [imputers] + \
75. [VectorAssembler( inputCols=featureCols, outputCol=”features” ), \
76. StringIndexer( inputCol= labelCol, outputCol=”label” )]
77. scaler = StandardScaler(inputCol=”features”,
78. outputCol=”scaledFeatures”,
79. withStd=True,
80. withMean=True)
81. lr = LogisticRegression(maxIter=maxIter, elasticNetParam=elasticNetParam,
regParam=regParam, featuresCol = ”scaledFeatures”)
82. pipeline = Pipeline(stages=model matrix stages+[scaler]+[lr])
83. glm model = pipeline.fit(train)
84. mlflow.log param(”algorithm”, ”SparkML GLM regression”) #put a name
for the algorithm.
85. mlflow.log param(”regParam”, regParam)
86. mlflow.log param(”maxIter”, maxIter)
87. mlflow.log param(”elasticNetParam”, elasticNetParam)
88. mlflow.Spark.log model(glm model, ”glm model”) #log the model.
89. lr summary = glm model.stages[len(glm model.stages)-1].summary
90. roc pd = lr summary.roc.toPandas()
91. fpr = roc pd[”FPR”]
92. tpr = roc pd[”TPR”]
93. roc auc = metrics.auc(roc pd[”FPR”], roc pd[”TPR”])
94. fMeasure = lr summary.fMeasureByThreshold
95. maxFMeasure = fMeasure.groupBy().max(”F-Measure”).select(”max(F-
Measure)”).head()
96. madFMeasure = maxFMeasure[”max(F-Measure)”]
97. fMeasure = fMeasure.toPandas()
98. bestThreshold = float ( fMeasure[ fMeasure[”F-Measure”] == maxFMea-
sure] [”threshold”])
99. lr.setThreshold(bestThreshold)
100. def extract(row):
101. return (row.remain,) + tuple(row.probability.toArray().tolist()) + (row.label,)
+ (row.prediction,)
102. def score(model,data):
103. pred = model.transform(data).select(”remain”, ”probability”, ”label”, ”pre-
diction”)
104. pred = pred.rdd.map(extract).toDF([”remain”, ”p0”, ”p1”, ”label”, ”pre-
diction”])
105. return pred
106. def auc(pred):
107. metric = BinaryClassificationMetrics(pred.select(”p1”, ”label”).rdd)
108. return metric.areaUnderROC
109. glm train = score(glm model, train)
110. glm valid = score(glm model, valid)
111. glm train.registerTempTable(”glm train”)
1 Big Data and Machine Learning for Credit Risk Using PySpark 19
Appendix 2
The first branch of the decision tree shows that if the value of out prncp is
more extensive than 0.01, we will automatically receive that probability de-
fault value of 0.03%. When an individual does not meet demanded value, the
proposal should be checked for total payment with a limit of 5,000. Finally,
the branches show how the application should go through the confirmation
process.
20 Afshin Ashofteh
References
1. C. Onay and E. ÖztürkA review of credit scoring research in the age of Big Data J.
Financ. Regul. Compliance, vol. 26, no. 3, pp. 382–405, Jul. 2018.
2. A. AshoftehMining Big Data in statistical systems of the monetary financial institu-
tions (MFIs) in International Conference on Advanced Research Methods and Ana-
lytics (CARMA), 2018, p. doi: 10.4995/carma2018.2018.8570.
3. M. Óskarsdóttir, C. Bravo, C. Sarraute, J. Vanthienen, and B. BaesensThe value of
big data for credit scoring: Enhancing financial inclusion using mobile phone data
and social network analytics Appl. Soft Comput. J., vol. 74, pp. 26–39, Jan. 2019.
4. J. S. Pedro, D. Proserpio, and N. OliverMobiscore: Towards universal credit scoring
from mobile phone data in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015,
vol. 9146, pp. 195–207.
5. D. Björkegren and D. GrissenBehavior revealed in mobile phone usage predicts credit
repayment arXiv. arXiv, 09-Dec-2017.
1 Big Data and Machine Learning for Credit Risk Using PySpark 21