View on GitHub

Loan_Underwriting

Predicting Loan Defaults for Banca Massiccia

Click HERE to see the full and detailed script

Project Overview

This was a project in my Machine Learning in Finance course at NYU. The project aimed at enhancing the bank’s loan underwriting process. Utilizing machine learning, I developed a model to predict the one-year Probability of Default (PD) for prospective borrowers, thereby enabling risk-based pricing and more informed loan decisions.

Workflow

Approach

The approach involved training a logistic regression model on historical bank transaction data. The focus was on key financial indicators to forecast the probability of default for new borrowers.

Techniques

Outcome

The model outputs predicted default probability, which can assist the bank in determining suitable interest rates and underwriting fees. Using the above mentioned techniques, the model’s AUC improved from baseline result of 0.701 to 0.7761.

Usage

Key Files

How to Run

  1. Ensure all dependencies are installed.
  2. Run estimation.py to train the model and generate model.pkl, along with the parameter values required for the harness.py.
  3. Use harness.py with your test dataset to get predictions. This script will call prediction.py for necessary processing and prediction steps.
python3 harness.py --input_csv  <input file in csv> --output_csv <output csv file path to which the predictions are written> 

Data Understanding and Preparation

Problem Formulation

I applied a business-context-aware approach, considering financial factors such as profitability, leverage, and liquidity in our model.

Data Imputation

Handled missing values by replacing them with related financial variables. For example, a variable called roe had some missing values, which were replaced by profit / total equity. After these finance-based imputation, less than 1% of the missing values remained per some variables. These were handled using median imputation.

Engineered Features

Feature Selection

Baseline Model

A baseline model was developed using a “kitchen sink” approach, where all variables were initially included. This model served as a benchmark for the performance of the refined model. The AUC for this model was 0.701.

Univariate Analysis

Multivariate Analysis

Check Multicollinearity

Logistic Regression is sensitive to multicollinearity. Multicollinearity is when the predictor variables are highly correlated with other predictor variables. As a result, it is hard for the model to estimate the effect of each predictor independently. This can result in unstable coefficients where the signs can be flipped, or become very sensitive to smal lchanges in the model.

Variation Inflation Factor (VIF) was run on the final variables using a threshold of 5 (about 80% of the variance can be explained by other variables), and removed variables that had VIF score of above 5.

Model Evaluation and Interpretation

Walk-Forward Analysis

Implemented walk-forward analysis for a realistic simulation of financial industry behavior, ensuring robust model performance.

Calibration

Interpreting Coefficients and P-value

Conclusion and Future Work

The final variables used for the model were:

The AUC using these variables was 0.7761. Using financial variables made a significant improvement upon the baseline model (AUC of 0.701).

Further improvements can be made using a more sophisticated models. For example, a non-parameteric tree-based model like XGBoost or Random Forest model can be tested to see the performance of the AUC. Further assessment will be needed to decide if the improvement on AUC is more beneficial than the explainability of the Logistic Regression.