Project Description

There was an interesting data analytic competition and I thought to give it a try. This is on Kaggle with the more detail description.

Many people struggle to get loans due to insufficient or non- existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience.

In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data–including telco and transactional information–to predict their clients’ repayment abilities.

The data set was posted on kaggle. The 1-st place of the leaderboard has an roc_auc score: 0.80570

Just to compare, I added my result at the end.

Data Standardization

The data files contain different types of raw data. For numeric data: normalize. For text data: flag & divided into buckets. (eg, Monday— Friday as weekday, Saturday and Sunday as weekend). For some columns that have too many null values, flag the null values as its own type.

Data Aggregation

For numeric data: we aggregate the data by using mean, max, min grouped by id. For flag data: we just count the number of 1’s. Result: we merge the files and get a master dataset which has 307511 rows (number of samples) and 779 columns (number of features).

Feature Selection

Due to the large amount of data and our limited computational power, we decided to do some feature selection. In addition, feature selection could help with preventing over-fitting and reducing variance. Variance threshold Recursive feature elimination Forward feature selection

Forward Feature Selection

The idea is to start from an empty set of features. Select 1 feature from the whole feature set with the best evaluation score on the cross-validation set. With the selected best features, add 1 feature from the rest of the feature set that has the best score. Repeat until evaluation scores on cross-validation set do not change anymore. With linear regression as the model, roc_auc as evaluation metric, we were able to narrow down to 70 features and the corresponding roc_auc score on the testing set is: 0.7856

Modeling

  1. Bayesian network
  2. Random forest
  3. Gradient boosting
  4. Logistic regression
  5. SVM
  6. Neural network

Chow-Liu Tree

Accuracy: 0.91
Discretize the range of feature into N uniform bins Problem: Lose data information

Random Forest

Using randomized search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.756522325610385

roc_auc score on testing: 0.765107802812631

Gradient Boosting

Using randomized search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.7558312252360753

roc_auc score on testing: 0.7593559233006815

Logistic Regression (L1 penalty)

Using grid search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.7679654092120825

roc_auc score on testing: 0.7635580017067386

Logistic Regression (L2 penalty)

Using grid search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.768355859378838

roc_auc score on testing: 0.7648156359451366

SVM

Using randomized search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.66843200323323

roc_auc score on testing: 0.653836647907179

Neural Network

Using randomized search for the hyper-parameters, we have for the best model:

roc_auc score on training: 0.750743149087918

roc_auc score on testing: 0.7633249087734756

Changsong Li, PhD September 2018