Word2Vec with Spark ML on the 20 Newsgroups dataset
In this section, we look at how to use the Spark ML DataFrame and newer implementations from Spark 2.0.X to create a Word2Vector model.
We will create a DataFrame from the dataSet:
val spConfig = (new SparkConf).setMaster("local").setAppName("SparkApp") val spark = SparkSession .builder .appName("Word2Vec Sample").config(spConfig) .getOrCreate() import spark.implicits._ val rawDF = spark.sparkContext .wholeTextFiles("./data/20news-bydate-train/alt.atheism/*") val temp = rawDF.map( x => { (x._2.filter(_ >= ' ').filter(! _.toString.startsWith("(")) ) }) val textDF = temp.map(x => x.split(" ")).map(Tuple1.apply) .toDF("text")
This will be followed by creating the Word2Vec
class and training the model on the DataFrame textDF
created above:
val word2Vec = new Word2Vec() .setInputCol("text") .setOutputCol("result") .setVectorSize(3) .setMinCount(0) val model = word2Vec.fit(textDF) val result = model.transform...