Tokenization
Tokenizer converts the input string into lowercase and then splits the string with whitespaces into individual tokens. A given is split into words either using the default space delimiter or using a customer regular expression based Tokenizer. In either case, the input column is transformed into an output column. In particular, the input column is usually a String and the output column is a Sequence of Words.
Tokenizers are available by two packages shown next, the Tokenizer
and the RegexTokenize
:
import org.apache.spark.ml.feature.Tokenizer import org.apache.spark.ml.feature.RegexTokenizer
First, you need to initialize a Tokenizer
specifying the input column and the output column:
scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_942c8332b9d8
Next, invoking the transform()
function on the input dataset yields an output dataset:
scala> val wordsDF = tokenizer.transform(sentenceDF)
wordsDF...