Packt+ | Advance your knowledge in tech

You're reading from Practical Predictive Analytics Analyse current and historical data to predict future trends using R, Spark, and more

Product type Paperback

Published in Jun 2017

Publisher Packt

ISBN-13 9781785886188

Length 576 pages

Edition 1st Edition

Languages

Tools

Splunk

Concepts

Predictive Analytics

Author (1):

Winters

View More author details

Table of Contents (19) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Getting Started with Predictive Analytics

2. The Modeling Process FREE CHAPTER

3. Inputting and Exploring Data

4. Introduction to Regression Algorithms

5. Introduction to Decision Trees, Clustering, and SVM

6. Using Survival Analysis to Predict and Analyze Customer Churn

7. Using Market Basket Analysis as a Recommender Engine

8. Exploring Health Care Enrollment Data as a Time Series

9. Introduction to Spark Using R

10. Exploring Large Datasets Using Spark

11. Spark Machine Learning - Regression and Cluster Models

12. Spark Models – Rule-Based Learning

Spark machine learning using logistic regression

Now that we have constructed our test and training datasets, we will begin by building a logistic regression model which will predict the outcome 1 or 0. As you will recall, 1 designates diabetes detected, while 0 designates diabetes not detected.

The syntax of a Spark glm is very similar to a normal glm. Specify the model using formula notation. Be sure to specify family = "binomial" to indicate that the outcome variable has only two outcomes:

# run glm model on Training dataset and assign it to object named "model"

model <- spark.glm(outcome ~ pregnant + glucose + pressure + triceps + insulin + pedigree + age,family = "binomial", maxIter=100, data = df) 
summary(model)

Examining the output:

You can observe the coefficients of the model in the Estimate column. You can also see that the residuals range from -2.54 to +2.40, which encompasses about 2.5 standard devations, and that the median value (not the mean) is supplied, which is -.326....