Let's do some (model) training!
At this point, we have a numeric representation of textual data, which captures the structure of reviews in a simple way. Now, it is time for model building. First, we will select columns that we need for training and split the resulting dataset. We will keep the generated row_id
column in the dataset. However, we will not use it as an input feature, but only as a simple unique row identifier:
valsplits = tfIdfTokens.select("row_id", "label", idf.getOutputCol).randomSplit(Array(0.7, 0.1, 0.1, 0.1), seed = 42) val(trainData, testData, transferData, validationData) = (splits(0), splits(1), splits(2), splits(3)) Seq(trainData, testData, transferData, validationData).foreach(_.cache())
Notice that we have created four different subsets of our data: a training dataset, testing dataset, transfer dataset, and a final validation dataset. The transfer dataset will be explained later on in the chapter, but everything else should appear very familiar to you already from...