In this chapter, we are going to apply different algorithms with the aim of obtaining a good model using combinations of our predictors. The most common algorithm that's used in credit risk applications, such as credit scoring and rating, is logistic regression. In this chapter, we will see how other algorithms can be applied to solve some of the weaknesses of logistic regression.
Mathematically, a binary logistic model has a dependent variable with two categorical values. In our example, these values relate to whether or not a bank is solvent.
In a logistic model, log odds refers to the logarithm of the odds for a class, which is a linear combination of one or more independent variables, as follows:
The coefficients (beta values, β) of the logistic regression algorithm must be estimated using maximum likelihood estimation. Maximum likelihood estimation involves getting values for the regression coefficients that minimize the error in the probabilities that are predicted by the model and the real observed case.
Logistic regression is very sensitive to the presence of outlier values, so high correlations in variables should be avoided. Logistic regression in R can be applied as follows:
set.seed(1234)
LogisticRegression=glm(train$Default~.,data=train[,2:ncol(train)],family=binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The code runs without problems, but a warning message appears. If the variables are highly correlated or collinearity exists, it is expected that the model parameters and the variance are inflated.
The high variance is not due to accurate or good predictors, but is instead due to a misspecified model with redundant predictors. Thus, the maximum likelihood is increased by simply adding more parameters, which results in overfitting.
We can observe the parameters of the model with the summary() function:
summary(LogisticRegression)
##
## Call:
## glm(formula = train$Default ~ ., family = binomial(), data = train[,
## 2:ncol(train)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9330 -0.0210 -0.0066 -0.0013 4.8724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.7599825009 6.9560247460 -1.691 0.0909 .
## UBPRE395 -0.0575725641 0.0561441397 -1.025 0.3052
## UBPRE543 0.0014008963 0.0294470630 0.048 0.9621
## .... ..... .... .... ....
## UBPRE021 -0.0114148389 0.0057016025 -2.002 0.0453 *
## UBPRE023 0.4950212919 0.2459506994 2.013 0.0441 *
## UBPRK447 -0.0210028916 0.0192296299 -1.092 0.2747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2687.03 on 7090 degrees of freedom
## Residual deviance: 284.23 on 6982 degrees of freedom
## AIC: 502.23
##
## Number of Fisher Scoring iterations: 13
We can see that most of the variables in the last column of the preceding table are insignificant. In cases like this, the number of variables should be reduced in the regression, or another approach should be followed, such as a penalized or regularization method.
There are three common approaches to using regularized methods:
In this section, we will see how these methods can be implemented in R. For these models, we will use the h2o package. This provides a predictive analysis platform to be used in machine learning that is open source, based on in-memory parameters, and distributed, fast, and scalable. It helps in creating models that are built on big data and is most suitable for enterprise applications as it enhances production quality.
For more information on the h2o package, please visit its documentation at https://cran.r-project.org/web/packages/h2o/index.html.
This package is very useful because it summarizes several common machine learning algorithms in one package. Moreover, these algorithms can be executed in parallel on our own computer, as it is very fast. The package includes generalized linear naïve Bayes, distributed random forest, gradient boosting, and deep learning, among others.
It is not necessary to have a high level of programming knowledge, because the package comes with a user interface.
Let's see how the package works. First, the package should be loaded:
library(h2o)
Use the h2o.init method to initialize H2O. This method accepts other options that can be found in the package documentation:
h2o.init()
The first step toward building our model involves placing our data in the H2O cluster/Java process. Before this step, we will ensure that our target is considered as a factor variable:
train$Default<-as.factor(train$Default)
test$Default<-as.factor(test$Default)
Now, let's upload our data to the h2o cluster:
as.h2o(train[,2:ncol(train)],destination_frame="train")
as.h2o(test[,2:ncol(test)],destination_frame="test")
If you close R and restart it later, you will need to upload the datasets again, as in the preceding code.
We can check that the data has been uploaded correctly with the following command:
h2o.ls()
## key
## 1 test
## 2 train
The package contains an easy interface that allows us to create different models when we run it in our browser. In general, the interface can be launched by writing the following address in our web browser, http://localhost:54321/flow/index.html. You will be faced with a page like the one that's shown in the following screenshot. In the Model tab, we can see a list with all of the available models that are implemented in this package:
First, we are going to develop regularization models. For that, Generalized Linear Modelling… must be selected. This module includes the following:
- Gaussian regression
- Poisson regression
- Binomial regression (classification)
- Multinomial classification
- Gamma regression
- Ordinal regression
As shown in the following screenshot, we should fill in the necessary parameters to train our model:
We will fill in the following fields:
- model_id: Here, we can specify the name that can be used as a reference by the model.
- training_frame: The dataset that we wish to use to build and train the model can be mentioned here, as this will be our training dataset.
- validation_frame: Here, the dataset that will be used to check the accuracy of the model is mentioned.
- nfolds: For validation, we require a certain number of folds to be mentioned here. In our case, the nfolds value is 5.
- seed: This specifies the seed that will be used by the algorithm. We will use a Random Number Generator (RNG) for the components in the algorithm that require random numbers.
- response_column: This is the column to use as the dependent variable. In our case, the column is named Default.
- ignored_columns: In this section, it is possible to ignore variables in the training process. In our case, all of the variables are considered relevant.
- ignore_const_cols: This is a flag that indicates that the package should avoid constant variables.
- family: This specifies the model type. In our case, we want to train a regression model, so the family sho...