Supervised learning task
Like in the previous chapter, we need to prepare the training and validation data. In this case, we'll reuse the Spark API to split the data:
val trainValidSplits = inputData.randomSplit(Array(0.8, 0.2)) val (trainData, validData) = (trainValidSplits(0), trainValidSplits(1))
Now, let's perform a grid search using a simple decision tree and a few hyperparameters:
val gridSearch = for ( hpImpurity <- Array("entropy", "gini"); hpDepth <- Array(5, 20); hpBins <- Array(10, 50)) yield { println(s"Building model with: impurity=${hpImpurity}, depth=${hpDepth}, bins=${hpBins}") val model = new DecisionTreeClassifier() .setFeaturesCol("reviewVector") .setLabelCol("label") .setImpurity(hpImpurity) .setMaxDepth(hpDepth) .setMaxBins(hpBins) .fit(trainData) val preds = model.transform(validData) val auc = new BinaryClassificationEvaluator().setLabelCol("label") .evaluate(preds) (hpImpurity...