Text classification with Spark 2.0
In this section, we will use the libsvm version of 20newsgroup data to use the Spark DataFrame-based APIs to classify the text documents. In the current version of Spark libsvm version 3.22 is supported (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
Download the libsvm formatted data from the following link and copy output folder under Spark-2.0.x.
Visit the following link for the 20newsgroup libsvm data: https://1drv.ms/f/s!Av6fk5nQi2j-iF84quUlDnJc6G6D
Import the appropriate packages from org.apache.spark.ml
and create Wrapper Scala:
package org.apache.spark.examples.ml import org.apache.spark.SparkConf import org.apache.spark.ml.classification.NaiveBayes import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.sql.SparkSession object DocumentClassificationLibSVM { def main(args: Array[String]): Unit = { } }
Next, we will load the libsvm
data into a Spark DataFrame:
val spConfig...