Spark machine learning using logistic regression
Now that we have constructed our test and training datasets, we will begin by building a logistic regression model which will predict the outcome 1 or 0. As you will recall, 1 designates diabetes detected, while 0 designates diabetes not detected.
The syntax of a Spark glm
is very similar to a normal glm. Specify the model using formula notation. Be sure to specify family = "binomial"
to indicate that the outcome variable has only two outcomes:
# run glm model on Training dataset and assign it to object named "model" model <- spark.glm(outcome ~ pregnant + glucose + pressure + triceps + insulin + pedigree + age,family = "binomial", maxIter=100, data = df) summary(model)
Examining the output:
You can observe the coefficients of the model in the Estimate column. You can also see that the residuals range from -2.54 to +2.40, which encompasses about 2.5 standard devations, and that the median value (not the mean) is supplied, which is -.326....