In this article by Alexey Grigorev, author of the book Mastering Java for Data Science, we will look at how to do pre-processing of data in Java and how to do Exploratory Data Analysis inside and outside Java. Now, when we covered the foundation, we are ready to start creating Machine Learning models.

First, we start with Supervised Learning. In the supervised settings we have some information attached to each observation – called labels – and we want to learn from it, and predict it for observations without labels.

There are two types of labels: the first are discrete and finite, such as True/False or Buy/Sell, and second are continuous, such as salary or temperature. These types correspond to two types of Supervised Learning: Classification and Regression. We will talk about them in this article.

This article covers:

Classification problems
Regression Problems
Evaluation metrics for each type
Overview of available implementations in Java

(For more resources related to this topic, see here.)

Classification

In Machine Learning, the classification problem deals with discrete targets with finite set of possible values.

The Binary Classification is the most common type of classification problems: the target variable can have only two possible values, such as True/False, Relevant/Not Relevant, Duplicate/Not Duplicate, Cat/Dog and so on.

Sometimes the target variable can have more than two outcomes, for example: colors, category of an item, model of a car, and so on, and we call it Multi-Class Classification. Typically each observation can only have one label, but in some settings an observation can be assigned several values.

Typically, Multi-Class Classification can be converted to a set of Binary Classification problems, which is why we will mostly concentrate on Binary Classification.

Binary Classification Models

There are many models for solving the binary classification problem and it is not possible to cover all of them. We will briefly cover the ones that are most often used in practice.

They include:

Logistic Regression
Support Vector Machines
Decision Trees
Neural Networks

We assume that you are already familiar with these methods. Deep familiarity is not required, but for more information you can check the following book:

An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie and R. Tibshirani
Python Machine Learning by S. Raschka

When it comes to libraries, we will cover the following: Smile, JSAT, LIBSVM and LIBLINEAR and Encog.

Smile

Smile (Statistical Machine Intelligence and Learning Engine) is a library with a large set of classification and other machine learning algorithms. You can see the full list here: https://github.com/haifengl/smile.

The library is available on Maven Central and the latest version at the moment of writing is 1.1.0. To include it to your project, add the following dependency:

<dependency> 
  <groupId>com.github.haifengl</groupId> 
  <artifactId>smile-core</artifactId> 
  <version>1.1.0</version> 
</dependency>

It is being actively developed; new features and bug fixes are added quite often, but not released as frequently. We recommend to use the latest available version of Smile, and to get it, you will need to build it from the sources. To do it:

Install sbt – a tool for building scala projects. You can follow this instruction: http://www.scala-sbt.org/release/docs/Manual-Installation.html
Use git to clone the project from https://github.com/haifengl/smile
To build and publish the library to local maven repository, run the following command:
```
sbt core/publishM2
```

The Smile library consists of several sub-modules, such as smile-core, smile-nlp, and smile-plot and so on. So, after building it, add the following dependency to your pom:

<dependency> 
  <groupId>com.github.haifengl</groupId> 
  <artifactId>smile-core</artifactId> 
  <version>1.2.0</version> 
</dependency>

The models from Smile expect the data to be in a form of two-dimensional arrays of doubles, and the label information is one dimensional array of integers. For binary models, the values should be 0 or 1. Some models in Smile can handle Multi-Class Classification problems, so it is possible to have more labels.

The models are built using the Builder pattern: you create a special class, set some parameters and at the end it returns the object it builds. In Smile this builder class is typically called Trainer, and all models should have a trainer for them.

For example, consider training a Random Forest model:

double[] X = ...// training data
int[] y = ...// 0 and 1 labels
RandomForest model = new RandomForest.Trainer()
    .setNumTrees(100) 
    .setNodeSize(4)
    .setSamplingRates(0.7)
    .setSplitRule(SplitRule.ENTROPY)
    .setNumRandomFeatures(3)
    .train(X, y);

The RandomForest.Trainer class takes in a set of parameters and the training data, and at the end produces the trained Random Forest model. The implementation of Random Forest from Smile has the following parameters:

numTrees: number of trees to train in the model.
nodeSize: the minimum number of items in the leaf nodes.
samplingRate: the ratio of training data used to grow each tree.
splitRule: the impurity measure used for selecting the best split.
numRandomFeatures: the number of features the model randomly chooses for selecting the best split.

Similarly, a logistic regression is trained as follows:

LogisticRegression lr = new LogisticRegression.Trainer()
        .setRegularizationFactor(lambda)
        .train(X, y);

Once we have a model, we can use it for predicting the label of previously unseen items. For that we use the predict method:

double[] row = // data 
int prediction = model.predict(row);

This code outputs the most probable class for the given item. However, often we are more interested not in the label itself, but in the probability of having the label. If a model implements the SoftClassifier interface, then it is possible to get these probabilities like this:

double[] probs = new double[2];
model.predict(row, probs);

After running this code, the probs array will contain the probabilities.

JSAT

JSAT (Java Statistical Analysis Tool) is another Java library which contains a lot of implementations of common Machine Learning algorithms. You can check the full list of implemented models at https://github.com/EdwardRaff/JSAT/wiki/Algorithms.

To include JSAT to a Java project, add this to pom:

<dependency> 
  <groupId>com.edwardraff</groupId> 
  <artifactId>JSAT</artifactId> 
  <version>0.0.5</version> 
</dependency>

Unlike Smile, which just takes arrays of doubles, JSAT requires a special wrapper class for data instances. If we have an array, it is converted to the JSAT representation like this:

double[][] X = ... // data
int[] y = ... // labels

// change to more classes for more classes for multi-classification
CategoricalData binary = new CategoricalData(2); 

List<DataPointPair<Integer>> data = new ArrayList<>(X.length);
for (int i = 0; i < X.length; i++) {
    int target = y[i];
    DataPoint row = new DataPoint(new DenseVector(X[i]));
    data.add(new DataPointPair<Integer>(row, target));
}

ClassificationDataSet dataset = new ClassificationDataSet(data, binary);

Once we have prepared the dataset, we can train a model. Let us consider the Random Forest classifier again:

RandomForest model = new RandomForest();
model.setFeatureSamples(4);
model.setMaxForestSize(150);
model.trainC(dataset);

First, we set some parameters for the model, and then, at we end, we call the trainC method (which means “train a classifier”).

In the JSAT implementation, Random Forest has fewer options for tuning: only the number of features to select and the number of trees to grow.

There are several implementations of Logistic Regression. The usual Logistic Regression model does not have any parameters, and it is trained like this:

LogisticRegression model = new LogisticRegression();
model.trainC(dataset);

If we want to have a regularized model, then we need to use the LogisticRegressionDCD class (DCD stands for “Dual Coordinate Descent” - this is the optimization method used to train the logistic regression). We train it like this:

LogisticRegressionDCD model = new LogisticRegressionDCD();
model.setMaxIterations(maxIterations);
model.setC(C);
model.trainC(fold.toJsatDataset());

In this code, C is the regularization parameter, and the smaller values of C correspond to stronger regularization effect.

Finally, for outputting the probabilities, we can do the following:

double[] row = // data
DenseVector vector = new DenseVector(row);
DataPoint point = new DataPoint(vector);
CategoricalResults out = model.classify(point);
double probability = out.getProb(1);

The class CategoricalResults contains a lot of information, including probabilities for each class and the most likely label.

LIBSVM and LIBLINEAR

Next we consider two similar libraries: LIBSVM and LIBLINEAR.

LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is a library with implementation of Support Vector Machine models, which include support vector classifiers.
LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) is a library for fast linear classification algorithms such as Liner SVM and Logistic Regression.

Both these libraries come from the same research group and have very similar interfaces.

LIBSVM is implemented in C++ and has an officially supported java version. It is available on Maven Central:

<dependency>
  <groupId>tw.edu.ntu.csie</groupId>
  <artifactId>libsvm</artifactId>
  <version>3.17</version>
</dependency>

Note that the Java version of LIBSVM is updated not as often as the C++ version. Nevertheless, the version above is stable and should not contain bugs, but it might be slower than its C++ version.

To use SVM models from LIBSVM, you first need to specify the parameters. For this, you create a svm_parameter class. Inside, you can specify many parameters, including:

the kernel type (RBF, POLY or LINEAR),
the regularization parameter C,
probability, which you can set to 1 to be able to get probabilities,
the svm_type should be set to C_SVC: this tells that the model should be a classifier.

Here is an example of how you can configure an SVM classifier with the Linear kernel which can output probabilities:

svm_parameter param = new svm_parameter();
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.LINEAR;
param.probability = 1;
param.C = C;

// default parameters
param.cache_size = 100;
param.eps = 1e-3;
param.p = 0.1;
param.shrinking = 1;

The polynomial kernel is specified by the following formula:

It has three additional parameters: gamma, coeff0 and degree; and also C – the regularization parameter. You can specify it like this:

svm_parameter param = new svm_parameter();
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.POLY;
param.C = C;
param.degree = degree;
param.gamma = 1;
param.coef0 = 1;
param.probability = 1;
// plus defaults from the above

Finally, the Gaussian kernel (or RBF) has the following formula:'

So there is one parameter gamma, which controls the width of the Gaussians. We can specify the model with the RBF kernel like this:

svm_parameter param = new svm_parameter();
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.RBF;
param.C = C;
param.gamma = gamma;
param.probability = 1;
// plus defaults from the above

Once we created the configuration object, we need to convert the data in the right format. The LIBSVM command line application reads files in the SVMLight format, so the library also expects sparse data representation.

For a single array, the conversion is following:

double[] dataRow = // single row vector
svm_node[] svmRow = new svm_node[dataRow.length];

for (int j = 0; j < dataRow.length; j++) {
    svm_node node = new svm_node();
    node.index = j;
    node.value = dataRow[j];
    svmRow[j] = node;
}

For a matrix, we do this for every row:

double[][] X = ... // data
int n = X.length;
svm_node[][] nodes = new svm_node[n][];

for (int i = 0; i < n; i++) {
    nodes[i] = wrapAsSvmNode(X[i]);
}

Where wrapAsSvmNode is a function, that wraps a vector into an array of svm_node objects.

Now we can put the data and the labels together into svm_problem object:

double[] y = ... // labels 
svm_problem prob = new svm_problem();
prob.l = n;
prob.x = nodes;
prob.y = y;

Now using the data and the parameters we can train the SVM model:

svm_model model = svm.svm_train(prob, param);

Once the model is trained, we can use it for classifying unseen data. Getting probabilities is done this way:

double[][] X = // test data
int n = X.length;
double[] results = new double[n];
double[] probs = new double[2];

for (int i = 0; i < n; i++) {
    svm_node[] row = wrapAsSvmNode(X[i]);
    svm.svm_predict_probability(model, row, probs);
    results[i] = probs[1];
}

Since we used the param.probability = 1, we can use svm.svm_predict_probability method to predict probabilities. Like Smile, the method takes an array of doubles and writes the results there. Then we can get the probabilities there.

Finally, while training, LIBSVM outputs a lot of things on the console. If we are not interested in this output, we can disable it with the following code snippet:

svm.svm_set_print_string_function(s -> {});

Just add this in the beginning of your code and you will not see this anymore.

The next library is LIBLINEAR, which provides very fast and high performing linear classifiers such as SVM with Linear Kernel and Logistic Regression. It can easily scale to tens and hundreds of millions of data points.

Unlike LIBSVM, there is no official Java version of LIBLINEAR, but there is unofficial Java port available at http://liblinear.bwaldvogel.de/. To use it, include the following:

<dependency>
  <groupId>de.bwaldvogel</groupId>
  <artifactId>liblinear</artifactId>
  <version>1.95</version>
</dependency>

The interface is very similar to LIBSVM. First, you define the parameters:

SolverType solverType = SolverType.L1R_LR;
double C = 0.001;
double eps = 0.0001; 
Parameter param = new Parameter(solverType, C, eps);

We have to specify three parameters here:

solverType: defines the model which will be used;
C: the amount regularization, the smaller C the stronger the regularization;
epsilon: tolerance for stopping the training process. A reasonable default is 0.0001;

For classification there are the following solvers:

Logistic Regression: L1R_LR or L2R_LR
SVM: L1R_L2LOSS_SVC or L2R_L2LOSS_SVC

According to the official FAQ (which can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html) you should:

Prefer SVM to Logistic Regression as it trains faster and usually gives higher accuracy.
Try L2 regularization first unless you need a sparse solution – in this case use L1
The default solver is L2-regularized support linear classifier for the dual problem. If it is slow, try solving the primal problem instead.

Then you define the dataset. Like previously, let's first see how to wrap a row:

double[] row = // data
int m = row.length;
Feature[] result = new Feature[m];

for (int i = 0; i < m; i++) {
    result[i] = new FeatureNode(i + 1, row[i]);
}

Please note that we add 1 to the index – the 0 is the bias term, so the actual features should start from 1.

We can put this into a function wrapRow and then wrap the entire dataset as following:

double[][] X = // data
int n = X.length;
Feature[][] matrix = new Feature[n][];
for (int i = 0; i < n; i++) {
    matrix[i] = wrapRow(X[i]);
}

Now we can create the Problem class with the data and labels:

double[] y = // labels

Problem problem = new Problem();
problem.x = wrapMatrix(X);
problem.y = y;
problem.n = X[0].length + 1;
problem.l = X.length;

Note that here we also need to provide the dimensionality of the data, and it's the number of features plus one. We need to add one because it includes the bias term.

Now we are ready to train the model:

Model model = LibLinear.train(fold, param);

When the model is trained, we can use it to classify unseen data. In the following example we will output probabilities:

double[] dataRow = // data
Feature[] row = wrapRow(dataRow);
Linear.predictProbability(model, row, probs);
double result = probs[1];

The code above works fine for the Logistic Regression model, but it will not work for SVM: SVM cannot output probabilities, so the code above will throw an error for solvers like L1R_L2LOSS_SVC. What you can do instead is get the raw output:

double[] values = new double[1];
Feature[] row = wrapRow(dataRow);
Linear.predictValues(model, row, values);
double result = values[0];

In this case the results will not contain probability, but some real value. When this value is greater than zero, the model predicts that the class is positive.

If we would like to map this value to the [0, 1] range, we can use the sigmoid function for that:

public static double[] sigmoid(double[] scores) {
    double[] result = new double[scores.length];

    for (int i = 0; i < result.length; i++) {
        result[i] = 1 / (1 + Math.exp(-scores[i]));
    }

    return result;
}

Finally, like LIBSVM, LIBLINEAR also outputs a lot of things to standard output. If you do not wish to see it, you can mute it with the following code:

PrintStream devNull = new PrintStream(new NullOutputStream());
Linear.setDebugOutput(devNull);

Here, we use NullOutputStream from Apache IO which does nothing, so the screen stays clean.

When to use LIBSVM and when to use LIBLINEAR? For large datasets often it is not possible to use any kernel methods. In this case you should prefer LIBLINEAR. Additionally, LIBLINEAR is especially good for Text Processing purposes such as Document Classification.

Encog

Finally, we consider a library for training Neural Networks: Encog. It is available on Maven Central and can be added with the following snippet:

<dependency> 
  <groupId>org.encog</groupId> 
  <artifactId>encog-core</artifactId> 
  <version>3.3.0</version> 
</dependency>

With this library you first need to specify the network architecture:

BasicNetwork network = new BasicNetwork();
network.addLayer(new BasicLayer(new ActivationSigmoid(), true, noInputNeurons));
network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 30));
network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 1));
network.getStructure().finalizeStructure();
network.reset();

Here we create a network with one input layer, one inner layer with 30 neurons and one output layer with 1 neuron. In each layer we use sigmoid as the activation function and add the bias input (the true parameter).

The last line randomly initializes the weights in the network.

For both input and output Encog expects two-dimensional double arrays. In case of binary classification we typically have a one dimensional array, so we need to convert it:

double[][] X = // data
double[] y = // labels
double[][] y2d = new double[y.length][];

for (int i = 0; i < y.length; i++) {
    y2d[i] = new double[] { y[i] };
}

Once the data is converted, we wrap it into special wrapper class:

MLDataSet dataset = new BasicMLDataSet(X, y2d);

Then this dataset can be used for training:

MLTrain trainer = new ResilientPropagation(network, dataset);
double lambda = 0.01;
trainer.addStrategy(new RegularizationStrategy(lambda));

int noEpochs = 101;
for (int i = 0; i < noEpochs; i++) {
    trainer.iteration();
}

There are a lot of other Machine Learning libraries available in Java. For example Weka, H2O, JavaML and others. It is not possible to cover all of them, but you can also try them and see if you like them more than the ones we have covered.

Evaluation

We have covered many Machine Learning libraries, and many of them implement the same algorithms like Random Forest or Logistic Regression. Also, each individual model can have many different parameters: a Logistic Regression has the regularization coefficient, an SVM is configured by setting the kernel and its parameters.

How do we select the best single model out of so many possible variants?

For that we first define some evaluation metric and then select the model which achieves the best possible performance with respect to this metric. For binary classification we can select one of the following metrics:

Accuracy and Error
Precision, Recall and F1
AUC (AU ROC)
Result Evaluation
K-Fold Cross Validation
Training, Validation and Testing.

Accuracy

Accuracy tells us for how many examples the model predicted the correct label. Calculating it is trivial:

int n = actual.length;
double[] proba = // predictions;

double[] prediction = Arrays.stream(proba).map(p -> p > threshold ? 1.0 : 0.0).toArray();
int correct = 0;

for (int i = 0; i < n; i++) {
    if (actual[i] == prediction[i]) {
        correct++;
    }
}

double accuracy = 1.0 * correct / n;

Accuracy is the simplest evaluation metric and everybody understands it.

Precision, Recall and F1

In some cases accuracy is not the best measure of model performance.

For example, suppose we have an unbalanced dataset: there are only 1% of examples that are positive. Then a model which always predict negative is right in 99% cases, and hence will have accuracy of 0.99. But this model is not useful.

There are alternatives to accuracy that can overcome this problem. Precision and Recall are among these metrics: they both look at the fraction of positive items that the model correctly recognized.

They can be calculated using the Confusion Matrix: a table which summarizes the performance of a binary classifier:

supervised-learning-classification-and-regression-img-2

Precision is the fraction of correctly predicted positive items among all items the model predicted positive. In terms of the confusion matrix, Precision is TP / (TP + FP).
Recall is the fraction of correctly predicted positive items among items that are actually positive. With values from the confusion matrix, Recall is TP / (TP + FN).
It is often hard to decide whether one should optimize Precision or Recall. But there is another metric which combines both Precision and Recall into one number, and it is called F1 score.

For calculating Precision and Recall, we first need to calculate the values for the cells of the confusion matrix:

int tp = 0, tn = 0, fp = 0, fn = 0;

for (int i = 0; i < actual.length; i++) {
    if (actual[i] == 1.0 && proba[i] > threshold) {
        tp++;
    } else if (actual[i] == 0.0 && proba[i] <= threshold) {
        tn++;
    } else if (actual[i] == 0.0 && proba[i] > threshold) {
        fp++;
    } else if (actual[i] == 1.0 && proba[i] <= threshold) {
        fn++;
    }
}

Then we can use the values to calculate Precision and Recall:

double precision = 1.0 * tp / (tp + fp);
double recall = 1.0 * tp / (tp + fn);

Finally, F1 can be calculated using the following formula:

double f1 = 2 * precision * recall / (precision + recall);

ROC and AU ROC (AUC)

The metrics above are good for binary classifiers which produce hard output: they only tell if the class should be assigned a positive label or negative.

If instead our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier.

Most of the models can output probabilities of belonging to a certain class, and we can use it to rank examples such that the positive are likely to come first.

The ROC Curve visually tells us how good a ranking classifier separates positive examples from negative ones. The way a ROC curve is build is the following:

We sort the observations by their score and then starting from the origin we go up if the observation is positive and right if it is negative.

This way, in the ideal case, we first always go up, and then always go right – and this will result in the best possible ROC curve. In this case we can say that the separation between positive and negative examples is perfect.

If the separation is not perfect, but still OK, the curve will go up for positive examples, but sometimes will turn right when a misclassification occurred.

Finally, a bad classifier will not be able to tell positive and negative examples apart and the curve would alternate between going up and right.

supervised-learning-classification-and-regression-img-3

The diagonal line on the plot represents the baseline – the performance that a random classifier would achieve. The further away the curve from the baseline, the better.

Unfortunately, there is no available easy-to-use implementation of ROC curves in Java.

So the algorithm for drawing a ROC curve is the following:

Let POS be number of positive labels, and NEG be the number of negative labels
Order data by the score, decreasing
Start from (0, 0)
For each example in the sorted order,
- if the example is positive, move 1 / POS up in the graph,
- otherwise, move 1 / NEG right in the graph.

This is a simplified algorithm and assumes that the scores are distinct. If the scores aren't distinct, and there are different actual labels for the same score, some adjustment needs to be made.

It is implemented in the class RocCurve which you will find in the source code. You can use it as following:

RocCurve.plot(actual, prediction);

Calling it will create a plot similar to this one:

supervised-learning-classification-and-regression-img-4

The area under the curve says how good the separation is. If the separation is very good, then the area will be close to one. But if the classifier cannot distinguish between positive and negative examples, the curve will go around the random baseline curve, and the area will be close to 0.5.

Area Under the Curve is often abbreviated as AUC, or, sometimes, AU ROC – to emphasize that the Curve is a ROC Curve.

AUC has a very nice interpretation: the value of AUC corresponds to probability that a randomly selected positive example is scored higher than a randomly selected negative example. Naturally, if this probability is high, our classifier does a good job separating positive and negative examples.

This makes AUC a to-go evaluation metric for many cases, especially when the dataset is unbalanced – in the sense that there are a lot more examples of one class than another.

Luckily, there are implementations of AUC in Java. For example, it is implemented in Smile. You can use it like this:

double[] predicted = ...  //
int[] truth = ... //
double auc = AUC.measure(truth, predicted);

Result Validation

When learning from data there is always the danger of overfitting. Overfitting occurs when the model starts learning the noise in the data instead of detecting useful patterns. It is always important to check if a model overfits – otherwise it will not be useful when applied to unseen data.

The typical and most practical way of checking whether a model overfits or not is to emulate “unseen data” – that is, take a part of the available labeled data and do not use it for training.

This technique is called “hold out”: we hold out a part of the data and use it only for evaluation.

supervised-learning-classification-and-regression-img-5

Often we shuffle the original data set before splitting. In many cases we make a simplifying assumption that the order of data is not important – that is, one observation has no influence on another. In this case shuffling the data prior to splitting will remove effects that the order of items might have.

On the other hand, if the data is a Time Series data, then shuffling it is not a good idea, because there is some dependence between observations.

So, let us implement the hold out split. We assume that the data that we have is already represented by X – a two-dimensional array of doubles with features and y – a one-dimensional array of labels.

First, let us create a helper class for holding the data:

public class Dataset {
    private final double[][] X;
    private final double[] y;
    // constructor and getters are omitted
}

Splitting our dataset should produce two datasets, so let us create a class for that as well:

public class Split {
    private final Dataset train;
    private final Dataset test;
    // constructor and getters are omitted
}

Now suppose we want to split the data into two parts: train and test. We also want to specify the size of the train set, we will do it using a testRatio parameter: the percentage of items that should go to the test set.

So the first thing we do is generating an array with indexes and then splitting it according to testRatio:

int[] indexes = IntStream.range(0, dataset.length()).toArray();
int trainSize = (int) (indexes.length * (1 - testRatio));
int[] trainIndex = Arrays.copyOfRange(indexes, 0, trainSize);
int[] testIndex = Arrays.copyOfRange(indexes, trainSize, indexes.length);

We can also shuffle the indexes if we want:

Random rnd = new Random(seed);

for (int i = indexes.length - 1; i > 0; i--) {
    int index = rnd.nextInt(i + 1);
    int tmp = indexes[index];
    indexes[index] = indexes[i];
    indexes[i] = tmp;
}

Then we can select instances for the training set as follows:

int trainSize = trainIndex.length;
double[][] trainX = new double[trainSize][];
double[] trainY = new double[trainSize];
for (int i = 0; i < trainSize; i++) {
    int idx = trainIndex[i];
    trainX[i] = X[idx];
    trainY[i] = y[idx];
}

And then finally wrap it into our Dataset class:

Dataset train = new Dataset(trainX, trainY);

If we repeat the same for the test set, we can put both train and test sets into a Split object:

Split split = new Split(train, test);

And now we can use train fold for training and test fold for testing the models.

If we put all the code above into a function of the Dataset class, for example, trainTestSplit, we can use it as follows:

Split split = dataset.trainTestSplit(0.2);
Dataset train = split.getTrain();
// train the model using train.getX() and train.getY()

Dataset test = split.getTest();
// test the model using test.getX(); test.getY();

K-Fold Cross Validation

Holding out only one part of the data may not always be the best option. What we can do instead is splitting it into K parts and then testing the models only on 1/Kth of the data.

This is called K-Fold Cross-Validation: it not only gives the performance estimation, but also the possible spread of the error. Typically we are interested in models which give good and consistent performance. K-Fold Cross-Validation helps us to select such models.

Then we prepare the data for K-Fold Cross-Validation is the following:

First, split the data into K parts
Then for each of these parts
- Take one part as the validation set
- Take the remaining K-1 parts as the training set

If we translate this into Java, the first step will look like this:

int[] indexes = IntStream.range(0, dataset.length()).toArray();
int[][] foldIndexes = new int[k][];

int step = indexes.length / k;
int beginIndex = 0;

for (int i = 0; i < k - 1; i++) {
    foldIndexes[i] = Arrays.copyOfRange(indexes, beginIndex, beginIndex + step);
    beginIndex = beginIndex + step;
}

foldIndexes[k - 1] = Arrays.copyOfRange(indexes, beginIndex, indexes.length);

This creates an array of indexes for each fold. You can also shuffle the indexes array as previously.

Now we can create splits from each fold:

List<Split> result = new ArrayList<>();
	
for (int i = 0; i < k; i++) {
    int[] testIdx = folds[i];
    int[] trainIdx = combineTrainFolds(folds, indexes.length, i);
    result.add(Split.fromIndexes(dataset, trainIdx, testIdx));
}

In the code above we have two additional methods:

combineTrainFolds K-1 arrays with indexes and combines them into one
Split.fromIndexes creates a split gives train and test indexes.

We have already covered the second function when we created a simple hold-out test set.

And the first function, combineTrainFolds, is implemented like this:

private static int[] combineTrainFolds(int[][] folds, int totalSize, int excludeIndex) {
    int size = totalSize - folds[excludeIndex].length;
    int result[] = new int[size];

    int start = 0;
    for (int i = 0; i < folds.length; i++) {
        if (i == excludeIndex) {
            continue;
        }
        int[] fold = folds[i];
        System.arraycopy(fold, 0, result, start, fold.length);
        start = start + fold.length;
    }

    return result;
}

Again, we can put the code above into a function of the Dataset class and call it like follows:

List<Split> folds = train.kfold(3);

Now when we have a list of Split objects, we can create a special function for performing Cross-Validation:

public static DescriptiveStatistics crossValidate(List<Split> folds, 
        Function<Dataset, Model> trainer) {
    double[] aucs = folds.parallelStream().mapToDouble(fold -> {
        Dataset foldTrain = fold.getTrain();
        Dataset foldValidation = fold.getTest();
        Model model = trainer.apply(foldTrain);
        return auc(model, foldValidation);
    }).toArray();

    return new DescriptiveStatistics(aucs);
}

What this function does takes a list of folds and a callback which inside creates a model. Then, after the model is trained, we calculate AUC for it.

Additionally, we take advantage of Java's ability to parallelize loops and train models on each fold at the same time.

Finally, we put the AUCs calculated on each fold into a DescriptiveStatistics object, which can later on be used to return the mean and the standard deviation of the AUCs. As you probably remember, the DescriptiveStatistics class comes from the Apache Commons Math library.

Let us consider an example. Suppose we want to use Logistic Regression from LIBLINEAR and select the best value for the regularization parameter C. We can use the function above this way:

double[] Cs = { 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 };

for (double C : Cs) {
    DescriptiveStatistics summary = crossValidate(folds, fold -> {
        Parameter param = new Parameter(SolverType.L1R_LR, C, 0.0001);
        return LibLinear.train(fold, param);
    });

    double mean = summary.getMean();
    double std = summary.getStandardDeviation();
    System.out.printf("L1 logreg C=%7.3f, auc=%.4f ± %.4f%n", C, mean, std);
}

Here, LibLinear.train is a helper method which takes a Dataset object and a Parameter object and then trains a LIBLINEAR model. This will print AUC for all provided values of C, so you can see which one is the best, and pick the one with highest mean AUC.

Training, Validation and Testing

When doing the Cross-Validation there’s still a danger of overfitting. Since we try a lot of different experiments on the same validation set, we might accidentally pick the model which just happened to do well on the validation set – but it may later on fail to generalize to unseen data.

The solution to this problem is to hold out a test set at the very beginning and do not touch it at all until we select what we think is the best model. And we use it only for evaluating the final model on it.

So how do we select the best model? What we can do is to do Cross-Validation on the remaining train data. It can be either hold out or K-Fold Cross-Validation. In general you should prefer doing K-Fold Cross-Validation because it also gives you the spread of performance, and you may use it in for model selection as well.

supervised-learning-classification-and-regression-img-6

The typical workflow should be the following:

(0) Select some metric for validation, e.g. accuracy or AUC.
(1) Split all the data into train and test sets
(2) Split the training data further and hold out a validation dataset or split it into K folds
(3) Use the validation data for model selection and parameter optimization
(4) Select the best model according to the validation set and evaluate it against the hold out test set

It is important to avoid looking at the test set too often. It should be used only occasionally for final evaluation to make sure the selected model does not overfit. If the validation scheme is set up properly, the validation score should correspond to the final test score. If this happens, we can be sure that the model does not overfit and is able to generalize to unseen data.

Using the classes and the code we created previously, it translates to the following Java code:

Dataset data = new Dataset(X, y);
Dataset train = split.getTrain();
List<Split> folds = train.kfold(3);
// now use crossValidate(folds, ...) to select the best model

Dataset test = split.getTest();
// do final evaluation of the best model on test

With this information we are ready to do a project on Binary Classification.

Summary

In this article we spoke about supervised machine learning and about two common supervised problems: Classification and Regression. We also covered the libraries which are commonly used algorithms , implemented and how to evaluate the performance of these algorithms.

There is another family of Machine Learning algorithms that do not require the label information: these methods are called Unsupervised Learning.