CountVectorizer
CountVectorizer
is used to a collection of text documents to vectors of token counts essentially producing sparse representations for the documents over the vocabulary. The end result is a vector of features, which can then be passed to other algorithms. Later on, we will see how to use the output from the CountVectorizer
in LDA algorithm to perform topic detection.
In order to invoke CountVectorizer
, you need to import the package:
import org.apache.spark.ml.feature.CountVectorizer
First, you need to initialize a CountVectorizer
Transformer specifying the input column and the output column. Here, we are choosing the filteredWords
column created by the StopWordRemover
and generate output column features:
scala> val countVectorizer = new CountVectorizer().setInputCol("filteredWords").setOutputCol("features")
countVectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_555716178088
Next, invoking the fit()
function on the dataset yields an output Transformer:
scala...