Skip to content

Using the imbalanced-learn and Scikit-learn libraries to build and evaluate machine learning models.

Notifications You must be signed in to change notification settings

mmsaki/credit-risks-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning for Credit Risks Data

Table of Contents

  1. Background
  2. Resampling Data
  3. Ensemble Learning

Background

  • Mortgages, student loans, auto loans, and debt consolidation are just a few examples of credit and loans that people seek online.

  • Peer-to-peer lending services such as Loans Canada and Mogo let investors loan people money without using a bank.

  • However, because investors always want to mitigate risk, a client has asked that you help them predict credit risk with machine learning techniques.

  • In this project you will build and evaluate several machine learning models to predict credit risk using data you'd typically see from peer-to-peer lending services.

  • Credit risk is an inherently imbalanced classification problem (the number of good loans is much larger than the number of at-risk loans), so you will need to employ different techniques for training and evaluating models with imbalanced classes.

  • You will use the imbalanced-learn and Scikit-learn libraries to build and evaluate models using the two following techniques:

Resampling Data

  • Use the imbalanced learn library to resample the LendingClub data and build and evaluate logistic regression classifiers using the resampled data.

  • To begin:

    • Read the CSV into a DataFrame.
      # Load the data
      file_path = Path('Resources/lending_data.csv')
      df = pd.read_csv(file_path)
    • Split the data into Training and Testing sets.
      # Create our features
      X = df.copy()
      X.drop(columns = ["loan_status", "homeowner"], axis= 1, inplace = True)
      
      # Create our target
      y = df["loan_status"]
      
      # Create X_train, X_test, y_train, y_test
      X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 1, stratify =y)
    • Scale the training and testing data using the StandardScaler from sklearn.preprocessing.
      # Create the StandardScaler instance
      scaler = StandardScaler()
      
      # Fit the Standard Scaler with the training data
      # When fitting scaling functions, only train on the training dataset
      X_scaler = scaler.fit(X_train)
      
      # Scale the training and testing data
      X_train_scaled = X_scaler.transform(X_train)
      X_test_scaled = X_scaler.transform(X_test)
  • Use the provided code to run a Simple Logistic Regression:

    • Fit the logistic regression classifier.
      # Train the Logistic Regression model
      lr_model = LogisticRegression(solver='lbfgs', random_state=1)
      lr_model.fit(X_train, y_train)
    • Calculate the balanced accuracy score.
      # Calculated the balanced accuracy score
      y_pred_lr = lr_model.predict(X_test)
      lr_score = balanced_accuracy_score(y_test, y_pred_lr)
      lr_score
    • Display the confusion matrix.
      # Display the confusion matrix
      confusion_matrix(y_test, y_pred_lr)
    • Print the imbalanced classification report.
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_test, y_pred_lr))
  • Next you will:

    • Oversample the data using the Naive Random Oversampler algorithm.

      # Resample the training data with the RandomOverSampler
      # View the count of target classes with Counter
      ros_model = RandomOverSampler(random_state = 1)
      X_resampled, y_resampled = ros_model.fit_resample(X_train, y_train)
      
      # Train the Logistic Regression model using the resampled data
      ros_model = LogisticRegression(solver = 'lbfgs', random_state = 1)
      ros_model.fit(X_resampled, y_resampled)
      
      # Calculated the balanced accuracy score
      y_pred_ros = ros_model.predict(X_resampled)
      ros_score = balanced_accuracy_score(y_resampled, y_pred_ros)
      ros_score
      
      # Display the confusion matrix
      confusion_matrix(y_resampled, y_pred_ros)
      
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_resampled, y_pred_ros))
    • Oversample the data using the SMOTE algorithm.

      # Resample the training data with SMOTE
      X_resampled, y_resampled = SMOTE(random_state = 1, sampling_strategy = 1.0).fit_resample(X_train, y_train)
      
      # Train the Logistic Regression model using the resampled data
      SMOTE_model = LogisticRegression(solver = 'lbfgs', random_state = 1)
      SMOTE_model.fit(X_resampled, y_resampled)
      
      # Calculated the balanced accuracy score
      y_pred_SMOTE = SMOTE_model.predict(X_resampled)
      SMOTE_score = balanced_accuracy_score(y_resampled, y_pred_SMOTE)
      SMOTE_score
      
      # Display the confusion matrix
      confusion_matrix(y_resampled, y_pred_SMOTE)
      
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_resampled, y_pred_SMOTE))
    • Undersample the data using the Cluster Centroids algorithm.

      # Resample the data using the ClusterCentroids resampler
      cc_model = ClusterCentroids(random_state = 1)
      X_resampled, y_resampled = cc_model.fit_resample(X_train, y_train)
      
      # Train the Logistic Regression model using the resampled data
      cc_model = LogisticRegression(solver = 'lbfgs', random_state = 1)
      cc_model.fit(X_resampled, y_resampled)
      
      # Calculate the balanced accuracy score
      y_pred_cc = cc_model.predict(X_resampled)
      cc_score = balanced_accuracy_score(y_resampled, y_pred_cc)
      cc_score
      
      # Display the confusion matrix
      confusion_matrix(y_resampled, y_pred_cc)
      
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_resampled, y_pred_cc))
    • Over and undersampling using SMOTEENN

      # Resample the training data with SMOTEENN
      SMOTEENN_model = SMOTEENN(random_state = 1)
      X_resampled, y_resampled = SMOTEENN_model.fit_resample(X_train, y_train)
      
      # Train the Logistic Regression model using the resampled data
      SMOTEENN_model = LogisticRegression(solver = 'lbfgs', random_state = 1)
      SMOTEENN_model.fit(X_resampled, y_resampled)
      
      # Calculate the balanced accuracy score
      y_pred_SMOTEENN = SMOTEENN_model.predict(X_resampled)
      SMOTEENN_score = balanced_accuracy_score(y_resampled, y_pred_SMOTEENN)
      SMOTEENN_score
      
      # Display the confusion matrix
      confusion_matrix(y_resampled, y_pred_SMOTEENN)
      
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_resampled, y_pred_SMOTEENN))
  • Use the above to answer the following questions:

    • Which model had the best balanced accuracy score?
      • SMOTEENN Model
    • Which model had the best recall score?
      • SMOTEENN Model
    • Which model had the best geometric mean score?
      • SMOTEENN Model

Ensemble Learning

  • In this section, you will train and compare two different ensemble classifiers to predict loan risk and evaluate each model.

  • You will use the Balanced Random Forest Classifier and the Easy Ensemble Classifier.

  • Refer to the documentation for each of these to read about the models and see examples of the code.

  • To begin:

    • Read the data into a DataFrame using the provided starter code.
      # Load the data
      file_path = Path('Resources/LoanStats_2019Q1.csv')
      df = pd.read_csv(file_path)
    • Split the data into training and testing sets.
      # Create our features
      X = df.drop(columns = ["loan_status", "home_ownership", "verification_status", "issue_d", "pymnt_plan", "initial_list_status", "next_pymnt_d", "application_type", "hardship_flag","debt_settlement_flag"])
      
      # Create our target
      y = df["loan_status"]
      
      # Split the X and y into X_train, X_test, y_train, y_test
      X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, stratify = y)
    • Scale the training and testing data using the StandardScaler from sklearn.preprocessing.
      # Create the StandardScaler instance
      scaler = StandardScaler()
      
      # Fit the Standard Scaler with the training data
      # When fitting scaling functions, only train on the training dataset
      X_scaler = scaler.fit(X_train)
      
      # Scale the training and testing data
      X_train_scaled = X_scaler.transform(X_train)
      X_test_scaled = X_scaler.transform(X_test)
  • Part one: Balanced Random Forest Classifier

    • Train the model using the quarterly data from LendingClub provided in the Resources folder.
      # Resample the training data with the BalancedRandomForestClassifier
      brfc_model = BalancedRandomForestClassifier(n_estimators = 1000, random_state = 1)
      brfc_model.fit(X_train_scaled, y_train)
      Counter(y_train)
    • Calculate the balanced accuracy score from sklearn.metrics.
      # Calculated the balanced accuracy score
      y_pred_brfc = brfc_model.predict(X_test_scaled)
      balanced_accuracy_score(y_test, y_pred_brfc)
    • Display the confusion matrix from sklearn.metrics.
      # Display the confusion matrix
      confusion_matrix(y_test, y_pred_brfc)
    • Generate a classification report using the imbalanced_classification_report from imbalanced learn.
      # Print the imbalanced classification report
      print(classification_report_imbalanced(y_test, y_pred_brfc))
    • For the balanced random forest classifier only, print the feature importance sorted in descending order (most important feature to least important) along with the feature score.
      # List the features sorted in descending order by feature importance
      importances = brf_model.feature_importances_
      importances_sorted = sorted(zip(brf_model.feature_importances_, X.columns), reverse = True)
      display(importances_sorted)
      
      # Plot Importances
      indices = np.argsort(importances)
      fig, ax = plt.subplots(figsize=(10,20))
      ax.barh(range(len(importances)), importances[indices])
      ax.set_yticks(range(len(importances)))
      _ = ax.set_yticklabels(np.array(X_train.columns)[indices])
  • Part Two: Easy Ensemble Classifier

    # Train the Classifier
    eec_model = EasyEnsembleClassifier(random_state = 1)
    eec_model.fit(X_train, y_train)
    
    # Calculated the balanced accuracy score
    y_pred_eec = eec_model.predict(X_test)
    print(balanced_accuracy_score(y_test, y_pred_eec))
    
    # Display the confusion matrix
    confusion_matrix(y_test, y_pred_eec)
    
    # Print the imbalanced classification report
    print(classification_report_imbalanced(y_test, y_pred_eec))
    • Use the above to answer the following questions:
      • Which model had the best balanced accuracy score?
        • Balanced Random Forest Classifier
      • Which model had the best recall score?
        • Balanced Random Forest Classifier
      • Which model had the best geometric mean score?
        • Balanced Random Forest Classifier
      • What are the top three features?
        • total_rec_prncp, total_rec_int, last_pymnt_amnt

© 2021 Trilogy Education Services, a 2U, Inc. brand. All Rights Reserved.