In this article, Md. Rezaul Karim and Md. Mahedi Kaysar, the authors of the book Large Scale Machine Learning with Spark discusses how to develop a large scale heart diseases prediction pipeline by considering steps like taking input, parsing, making label point for regression, model training, model saving and finally predictive analytics using the trained model using Spark 2.0.0. In this article, they will develop a large-scale machine learning application using several classifiers like the random forest, decision tree, and linear regression classifier. To make this happen the following steps will be covered:

Data collection and exploration
Loading required packages and APIs
Creating an active Spark session
Data parsing and RDD of Label point creation
Splitting the RDD of label point into training and test set
Training the model
Model saving for future use
Predictive analysis using the test set
Predictive analytics using the new dataset
Performance comparison among different classifier

(For more resources related to this topic, see here.)

Background

Machine learning in big data together is a radical combination that has created some great impacts in the field of research to academia and industry as well in the biomedical sector. In the area of biomedical data analytics, this carries a better impact on a real dataset for diagnosis and prognosis for better healthcare. Moreover, the life science research is also entering into the Big data since datasets are being generated and produced in an unprecedented way. This imposes great challenges to the machine learning and bioinformatics tools and algorithms to find the VALUE out of the big data criteria like volume, velocity, variety, veracity, visibility and value.

In this article, we will show how to predict the possibility of future heart disease by using Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL.

Data collection and exploration

In the recent time, biomedical research has gained lots of advancement and more and more life sciences data set are being generated making many of them open. However, for the simplicity and ease, we decided to use the Cleveland database. Because till date most of the researchers who have applied the machine learning technique to biomedical data analytics have used this dataset. According to the dataset description at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names, the heart disease dataset is one of the most used as well as very well-studied datasets by the researchers from the biomedical data analytics and machine learning respectively.

The dataset is freely available at the UCI machine learning dataset repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. This data contains total 76 attributes, however, most of the published research papers refer to use a subset of only 14 feature of the field. The goal field is used to refer if the heart diseases are present or absence. It has 5 possible values ranging from 0 to 4. The value 0 signifies no presence of heart diseases. The value 1 and 2 signify that the disease is present but in the primary stage. The value 3 and 4, on the other hand, indicate the strong possibility of the heart disease. Biomedical laboratory experiments with the Cleveland dataset have determined on simply attempting to distinguish presence (values 1, 2, 3, 4) from absence (value 0). In short, the more the value the more possibility and evidence of the presence of the disease. Another thing is that the privacy is an important concern in the area of biomedical data analytics as well as all kind of diagnosis and prognosis. Therefore, the names and social security numbers of the patients were recently removed from the dataset to avoid the privacy issue. Consequently, those values have been replaced with dummy values instead.

It is to be noted that three files have been processed, containing the Cleveland, Hungarian, and Switzerland datasets altogether. All four unprocessed files also exist in this directory. To demonstrate the example, we will use the Cleveland dataset for training evaluating the models. However, the Hungarian dataset will be used to re-use the saved model. As said already that although the number of attributes is 76 (including the predicted attribute). However, like other ML/Biomedical researchers, we will also use only 14 attributes with the following attribute information:

No.	Attribute name	Explanation
1	age	Age in years
2	sex	Either male or female: sex (1 = male; 0 = female)
3	cp	Chest pain type: -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-angina pain -- Value 4: asymptomatic
4	trestbps	Resting blood pressure (in mm Hg on admission to the hospital)
5	chol	Serum cholesterol in mg/dl
6	fbs	Fasting blood sugar. If > 120 mg/dl)(1 = true; 0 = false)
7	restecg	Resting electrocardiographic results: -- Value 0: normal -- Value 1: having ST-T wave abnormality -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria.
8	thalach	Maximum heart rate achieved
9	exang	Exercise induced angina (1 = yes; 0 = no)
10	oldpeak	ST depression induced by exercise relative to rest
11	slope	The slope of the peak exercise ST segment -- Value 1: upsloping -- Value 2: flat -- Value 3: down-sloping
12 Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at £15.99/month. Cancel anytime	ca	Number of major vessels (0-3) coloured by fluoroscopy
13	thal	Heart rate: ---Value 3 = normal; ---Value 6 = fixed defect ---Value 7 = reversible defect
14	num	Diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing

Table 1: Dataset characteristics

Note there are several missing attribute values distinguished with value -9.0. In the Cleveland dataset contains the following class distribution:

Database: 0 1 2 3 4 Total

Cleveland: 164 55 36 35 13 303

A sample snapshot of the dataset is given as follows:

heart-diseases-prediction-using-spark-200-img-0

Figure 1: Snapshot of the Cleveland's heart diseases dataset

Loading required packages and APIs

The following packages and APIs need to be imported for our purpose. We believe the packages are self-explanatory if you have minimum working experience with Spark 2.0.0.:

import java.util.HashMap;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.mllib.classification.LogisticRegressionModel;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LinearRegressionModel;
import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
import org.apache.spark.mllib.tree.DecisionTree;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.example.SparkSession.UtilityForSparkSession;
import javassist.bytecode.Descriptor.Iterator;
import scala.Tuple2;

Creating an active Spark session

SparkSession spark = UtilityForSparkSession.mySession();

Here is the UtilityForSparkSession class that creates and returns an active Spark session:

import org.apache.spark.sql.SparkSession;
public class UtilityForSparkSession {
  public static SparkSession mySession() {
    SparkSession spark = SparkSession
                             .builder()
                             .appName("UtilityForSparkSession")
                             .master("local[*]")
                             .config("spark.sql.warehouse.dir", "E:/Exp/")
                             .getOrCreate();
    return spark;
  }
}

Note that here in Windows 7 platform, we have set the Spark SQL warehouse as "E:/Exp/", set your path accordingly based on your operating system.

Data parsing and RDD of Label point creation

Taken input as simple text file, parse them as text file and create RDD of label point that will be used for the classification and regression analysis. Also specify the input source and number of partition. Adjust the number of partition based on your dataset size. Here number of partition has been set to 2:

String input = "heart_diseases/processed_cleveland.data";
Dataset<Row> my_data = spark.read().format("com.databricks.spark.csv").load(input);
my_data.show(false);
RDD<String> linesRDD = spark.sparkContext().textFile(input, 2);

Since, JavaRDD cannot be created directly from the text files; rather we have created the simple RDDs, so that we can convert them as JavaRDD when necessary. Now let's create the JavaRDD with Label Point. However, we need to convert the RDD to JavaRDD to serve our purpose that goes as follows:

JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() {
      @Override
      public LabeledPoint call(String row) throws Exception {
        String line = row.replaceAll("\?", "999999.0");
        String[] tokens = line.split(",");
        Integer last = Integer.parseInt(tokens[13]);
        double[] features = new double[13];
        for (int i = 0; i < 13; i++) {
          features[i] = Double.parseDouble(tokens[i]);
        }
        Vector v = new DenseVector(features);
        Double value = 0.0;
        if (last.intValue() > 0)
          value = 1.0;
        LabeledPoint lp = new LabeledPoint(value, v);
        return lp;
      }
    });

Using the replaceAll() method we have handled the invalid values like missing values that are specified in the original file using ? character. To get rid of the missing or invalid value we have replaced them with a very large value that has no side-effect to the original classification or predictive results. The reason behind this is that missing or sparse data can lead you to highly misleading results.

Splitting the RDD of label point into training and test set

Well, in the previous step, we have created the RDD label point data that can be used for the regression or classification task. Now we need to split the data as training and test set. That goes as follows:

  double[] weights = {0.7, 0.3};
  long split_seed = 12345L;
  JavaRDD<LabeledPoint>[] split = data.randomSplit(weights, split_seed);
  JavaRDD<LabeledPoint> training = split[0];
  JavaRDD<LabeledPoint> test = split[1];

If you see the preceding code segments, you will find that we have split the RDD label point as 70% as the training and 30% goes to the test set. The randomSplit() method does this split. Note that, set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. The split seed value is a long integer that signifies that split would be random but the result would not be a change in each run or iteration during the model building or training.

Training the model and predict the heart diseases possibility

At the first place, we will train the linear regression model which is the simplest regression classifier.

final double stepSize = 0.0000000009;
final int numberOfIterations = 40; 
LinearRegressionModel model = LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numberOfIterations, stepSize);

As you can see the preceding code trains a linear regression model with no regularization using Stochastic Gradient Descent. This solves the least squares regression formulation f (weights) = 1/n ||A weights-y||^2^; which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right-hand side label y. Also to train the model it takes the training set, number of iteration and the step size. We provide here some random value of the last two parameters.

Model saving for future use

Now let's save the model that we just created above for future use. It's pretty simple just use the following code by specifying the storage location as follows:

String model_storage_loc = "models/heartModel";  
model.save(spark.sparkContext(), model_storage_loc);

Once the model is saved in your desired location, you will see the following output in your Eclipse console:

heart-diseases-prediction-using-spark-200-img-1

Figure 2: The log after model saved to the storage

Predictive analysis using the test set

Now let's calculate the prediction score on the test dataset:

JavaPairRDD<Double, Double> predictionAndLabel =
          test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
            @Override
            public Tuple2<Double, Double> call(LabeledPoint p) {
              return new Tuple2<>(model.predict(p.features()), p.label());
            }
          });

Predict the accuracy of the prediction:

double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
          @Override
          public Boolean call(Tuple2<Double, Double> pl) {
            return pl._1().equals(pl._2());
          }
        }).count() / (double) test.count();
System.out.println("Accuracy of the classification: "+accuracy);

The output goes as follows:

Accuracy of the classification: 0.0

Performance comparison among different classifier

Unfortunately, there is no prediction accuracy at all, right? There might be several reasons for that, including:

The dataset characteristic
Model selection
Parameters selection, that is, also called hyperparameter tuning

Well, for the simplicity, we assume the dataset is okay; since, as already said that it is a widely used dataset used for machine learning research used by many researchers around the globe. Now, what next? Let's consider another classifier algorithm for example Random forest or decision tree classifier. What about the Random forest? Lets' go for the random forest classifier at second place. Just use below code to train the model using the training set.

Integer numClasses = 26; //Number of classes
//HashMap is used to restrict the delicacy in the tree construction 
HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
Integer numTrees = 5; // Use more in practice.
String featureSubsetStrategy = "auto"; // Let the algorithm choose the best
String impurity = "gini"; // also information gain & variance reduction available
Integer maxDepth = 20; // set the value of maximum depth accordingly
Integer maxBins = 40; // set the value of bin accordingly
Integer seed = 12345; //Setting a long seed value is recommended      
final RandomForestModel model = RandomForest.trainClassifier(training, numClasses,categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);

We believe the parameters user by the trainClassifier() method is self-explanatory and we leave it to the readers to get know the significance of each parameter. Fantastic! We have trained the model using the Random forest classifier and cloud manage to save the model too for future use. Now if you reuse the same code that we described in the Predictive analysis using the test set step, you should have the output as follows:

Accuracy of the classification: 0.7843137254901961

Much better, right? If you are still not satisfied, you can try with another classifier model like Naïve Bayes classifier.

Predictive analytics using the new dataset

As we already mentioned that we have saved the model for future use, now we should take the opportunity to use the same model for new datasets. The reason is if you recall the steps, we have trained the model using the training set and evaluate using the test set. Now if you have more data or new data available to be used? Will you go for re-training the model? Of course not since you will have to iterate several steps and you will have to sacrifice valuable time and cost too.

Therefore, it would be wise to use the already trained model and predict the performance on a new dataset. Well, now let's reuse the stored model then. Note that you will have to reuse the same model that is to be trained the same model. For example, if you have done the model training using the Random forest classifier and saved the model while reusing you will have to use the same classifier model to load the saved model. Therefore, we will use the Random forest to load the model while using the new dataset. Use just the following code for doing that.

Now create RDD label point from the new dataset (that is, Hungarian database with same 14 attributes):

String new_data = "heart_diseases/processed_hungarian.data";
RDD<String> linesRDD = spark.sparkContext().textFile(new_data, 2);
    
JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() {
      @Override
      public LabeledPoint call(String row) throws Exception {
        String line = row.replaceAll("\?", "999999.0");
        String[] tokens = line.split(",");
        Integer last = Integer.parseInt(tokens[13]);
        double[] features = new double[13];
        for (int i = 0; i < 13; i++) {
          features[i] = Double.parseDouble(tokens[i]);
        }
        Vector v = new DenseVector(features);
        Double value = 0.0;
        if (last.intValue() > 0)
          value = 1.0;
        LabeledPoint p = new LabeledPoint(value, v);
        return p;
      }
    });

Now let's load the saved model using the Random forest model algorithm as follows:

RandomForestModel model2 = 
RandomForestModel.load(spark.sparkContext(), model_storage_loc);

Now let's calculate the prediction on test set:

JavaPairRDD<Double, Double> predictionAndLabel =
          data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
            @Override
            public Tuple2<Double, Double> call(LabeledPoint p) {
              return new Tuple2<>(model2.predict(p.features()), p.label());
            }
          });

Now calculate the accuracy of the prediction as follows:

double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
          @Override
          public Boolean call(Tuple2<Double, Double> pl) {
            return pl._1().equals(pl._2());
          }
        }).count() / (double) data.count();
System.out.println("Accuracy of the classification: "+accuracy);

We got the following output:

Accuracy of the classification: 0.7380952380952381

To get more interesting and fantastic machine learning application like spam filtering, topic modelling for real-time streaming data, handling graph data for machine learning, market basket analysis, neighborhood clustering analysis, Air flight delay analysis, making the ML application adaptable, Model saving and reusing, hyperparameter tuning and model selection, breast cancer diagnosis and prognosis, heart diseases prediction, optical character recognition, hypothesis testing, dimensionality reduction for high dimensional data, large-scale text manipulation and many more visits inside. Moreover, the book also contains how to scaling up the ML model to handle massive big dataset on cloud computing infrastructure. Furthermore, some best practice in the machine learning techniques has also been discussed.

In a nutshell many useful and exciting application have been developed using the following machine learning algorithms:

Linear Support Vector Machine (SVM)
Linear Regression
Logistic Regression
Decision Tree Classifier
Random Forest Classifier
K-means Clustering
LDA topic modelling from static and real-time streaming data
Naïve Bayes classifier
Multilayer Perceptron classifier for deep classification
Singular Value Decomposition (SVD) for dimensionality reduction
Principal Component Analysis (PCA) for dimensionality reduction
Generalized Linear Regression
Chi Square Test (for goodness of fit test, independence test, and feature test)
KolmogorovSmirnovTest for hypothesis test
Spark Core for Market Basket Analysis
Multi-label classification
One Vs Rest classifier
Gradient Boosting classifier
ALS algorithm for movie recommendation
Cross-validation for model selection
Train Split for model selection
RegexTokenizer, StringIndexer, StopWordsRemover, HashingTF and TF-IDF for text manipulation