eBook - ePub

Machine Learning with R Quick Start Guide

Name: Machine Learning with R Quick Start Guide
Author: Iván Pastor Sanz

A beginner's guide to implementing machine learning techniques from scratch using R 3.5

Iván Pastor Sanz,

250 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Machine Learning with R Quick Start Guide

A beginner's guide to implementing machine learning techniques from scratch using R 3.5

Iván Pastor Sanz,

Book details

Book preview

Table of contents

Citations

About This Book

Learn how to use R to apply powerful machine learning methods and gain insight into real-world applications using clustering, logistic regressions, random forests, support vector machine, and more.

Key Features

Use R 3.5 to implement real-world examples in machine learning
Implement key machine learning algorithms to understand the working mechanism of smart models
Create end-to-end machine learning pipelines using modern libraries from the R ecosystem

Book Description

Machine Learning with R Quick Start Guide takes you on a data-driven journey that starts with the very basics of R and machine learning. It gradually builds upon core concepts so you can handle the varied complexities of data and understand each stage of the machine learning pipeline.

From data collection to implementing Natural Language Processing (NLP), this book covers it all. You will implement key machine learning algorithms to understand how they are used to build smart models. You will cover tasks such as clustering, logistic regressions, random forests, support vector machines, and more. Furthermore, you will also look at more advanced aspects such as training neural networks and topic modeling.

By the end of the book, you will be able to apply the concepts of machine learning, deal with data-related problems, and solve them using the powerful yet simple language that is R.

What you will learn

Introduce yourself to the basics of machine learning with R 3.5
Get to grips with R techniques for cleaning and preparing your data for analysis and visualize your results
Learn to build predictive models with the help of various machine learning techniques
Use R to visualize data spread across multiple dimensions and extract useful features
Use interactive data analysis with R to get insights into data
Implement supervised and unsupervised learning, and NLP using R libraries

Who this book is for

This book is for graduate students, aspiring data scientists, and data analysts who wish to enter the field of machine learning and are looking to implement machine learning techniques and methodologies from scratch using R 3.5. A working knowledge of the R programming language is expected.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Machine Learning with R Quick Start Guide by Iván Pastor Sanz in PDF and/or ePUB format, as well as other popular books in Informatique & Intelligence artificielle (IA) et sémantique. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2019

ISBN

9781838647056

Edition

Topic

Informatique

Subtopic

Intelligence artificielle (IA) et sémantique

Predicting Failures of Banks - Multivariate Analysis

In this chapter, we are going to apply different algorithms with the aim of obtaining a good model using combinations of our predictors. The most common algorithm that's used in credit risk applications, such as credit scoring and rating, is logistic regression. In this chapter, we will see how other algorithms can be applied to solve some of the weaknesses of logistic regression.

In this chapter, we will be covering the following topics:

Logistic regression
Regularized methods
Testing a random forest model
Gradient boosting
Deep learning in neural networks
Support vector machines
Ensembles
Automatic machine learning

Logistic regression

Mathematically, a binary logistic model has a dependent variable with two categorical values. In our example, these values relate to whether or not a bank is solvent.

In a logistic model, log odds refers to the logarithm of the odds for a class, which is a linear combination of one or more independent variables, as follows:

The coefficients (beta values, β) of the logistic regression algorithm must be estimated using maximum likelihood estimation. Maximum likelihood estimation involves getting values for the regression coefficients that minimize the error in the probabilities that are predicted by the model and the real observed case.

Logistic regression is very sensitive to the presence of outlier values, so high correlations in variables should be avoided. Logistic regression in R can be applied as follows:

set.seed(1234)
LogisticRegression=glm(train$Default~.,data=train[,2:ncol(train)],family=binomial())
 ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The code runs without problems, but a warning message appears. If the variables are highly correlated or collinearity exists, it is expected that the model parameters and the variance are inflated.

The high variance is not due to accurate or good predictors, but is instead due to a misspecified model with redundant predictors. Thus, the maximum likelihood is increased by simply adding more parameters, which results in overfitting.

We can observe the parameters of the model with the summary() function:

summary(LogisticRegression)
 ## 
 ## Call:
 ## glm(formula = train$Default ~ ., family = binomial(), data = train[, 
 ## 2:ncol(train)])
 ## 
 ## Deviance Residuals: 
 ## Min 1Q Median 3Q Max 
 ## -3.9330 -0.0210 -0.0066 -0.0013 4.8724 
 ## 
 ## Coefficients:
 ## Estimate Std. Error z value Pr(>|z|) 
 ## (Intercept) -11.7599825009 6.9560247460 -1.691 0.0909 .
 ## UBPRE395 -0.0575725641 0.0561441397 -1.025 0.3052 
 ## UBPRE543 0.0014008963 0.0294470630 0.048 0.9621 
 ## .... ..... .... .... ....
 ## UBPRE021 -0.0114148389 0.0057016025 -2.002 0.0453 *
 ## UBPRE023 0.4950212919 0.2459506994 2.013 0.0441 *
 ## UBPRK447 -0.0210028916 0.0192296299 -1.092 0.2747 
 ## ---
 ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ## 
 ## (Dispersion parameter for binomial family taken to be 1)
 ## 
 ## Null deviance: 2687.03 on 7090 degrees of freedom
 ## Residual deviance: 284.23 on 6982 degrees of freedom
 ## AIC: 502.23
 ## 
 ## Number of Fisher Scoring iterations: 13

We can see that most of the variables in the last column of the preceding table are insignificant. In cases like this, the number of variables should be reduced in the regression, or another approach should be followed, such as a penalized or regularization method.

Regularized methods

There are three common approaches to using regularized methods:

Lasso
Ridge
Elastic net

In this section, we will see how these methods can be implemented in R. For these models, we will use the h2o package. This provides a predictive analysis platform to be used in machine learning that is open source, based on in-memory parameters, and distributed, fast, and scalable. It helps in creating models that are built on big data and is most suitable for enterprise applications as it enhances production quality.

For more information on the h2o package, please visit its documentation at https://cran.r-project.org/web/packages/h2o/index.html.

This package is very useful because it summarizes several common machine learning algorithms in one package. Moreover, these algorithms can be executed in parallel on our own computer, as it is very fast. The package includes generalized linear naïve Bayes, distributed random forest, gradient boosting, and deep learning, among others.

It is not necessary to have a high level of programming knowledge, because the package comes with a user interface.

Let's see how the package works. First, the package should be loaded:

library(h2o)

Use the h2o.init method to initialize H2O. This method accepts other options that can be found in the package documentation:

h2o.init()

The first step toward building our model involves placing our data in the H2O cluster/Java process. Before this step, we will ensure that our target is considered as a factor variable:

train$Default<-as.factor(train$Default)
 
test$Default<-as.factor(test$Default)

Now, let's upload our data to the h2o cluster:

as.h2o(train[,2:ncol(train)],destination_frame="train")

as.h2o(test[,2:ncol(test)],destination_frame="test")

If you close R and restart it later, you will need to upload the datasets again, as in the preceding code.

We can check that the data has been uploaded correctly with the following command:

h2o.ls()

 ## key
 ## 1 test
 ## 2 train

The package contains an easy interface that allows us to create different models when we run it in our browser. In general, the interface can be launched by writing the following address in our web browser, http://localhost:54321/flow/index.html. You will be faced with a page like the one that's shown in the following screenshot. In the Model tab, we can see a list with all of the available models that are implemented in this package:

First, we are going to develop regularization models. For that, Generalized Linear Modelling… must be selected. This module includes the following:

Gaussian regression
Poisson regression
Binomial regression (classification)
Multinomial classification
Gamma regression
Ordinal regression

As shown in the following screenshot, we should fill in the necessary parameters to train our model:

We will fill in the following fields:

model_id: Here, we can specify the name that can be used as a reference by the model.
training_frame: The dataset that we wish to use to build and train the model can be mentioned here, as this will be our training dataset.
validation_frame: Here, the dataset that will be used to check the accuracy of the model is mentioned.
nfolds: For validation, we require a certain number of folds to be mentioned here. In our case, the nfolds value is 5.
seed: This specifies the seed that will be used by the algorithm. We will use a Random Number Generator (RNG) for the components in the algorithm that require random numbers.
response_column: This is the column to use as the dependent variable. In our case, the column is named Default.
ignored_columns: In this section, it is possible to ignore variables in the training process. In our case, all of the variables are considered relevant.
ignore_const_cols: This is a flag that indicates that the package should avoid constant variables.
family: This specifies the model type. In our case, we want to train a regression model, so the family sho...

Title Page
Copyright and Credits
About Packt
Contributors
Preface
R Fundamentals for Machine Learning
Predicting Failures of Banks - Data Collection
Predicting Failures of Banks - Descriptive Analysis
Predicting Failures of Banks - Univariate Analysis
Predicting Failures of Banks - Multivariate Analysis
Visualizing Economic Problems in the European Union
Sovereign Crisis - NLP and Topic Modeling
Other Books You May Enjoy