Understanding tokenizers
We have previously seen that an analyzer may be a single class or a set of defined tokenizer and filter classes.
The analyzer executes the analysis process in two steps:
- Tokenization (parsing): Using configured tokenizer classes
- Filtering (transformation): Using configured filter classes
We can also do preprocessing on a character stream before tokenization; we can do this with the help of CharFilters
(we will see this later in the chapter). An analyzer knows its configured field, but a tokenizer doesn't have any idea about the field. The job of the tokenizer is only to read from a character stream, apply a tokenization mechanism based on its behavior, and produce a new sequence of a token stream.
What is a tokenizer?
A tokenizer is a tool provided by Solr that runs a tokenization process, breaks a stream of text into tokens at some delimiter, and generates a token stream. Tokenizers are configured by their Java implementation factory class in the managed-schema.xml
file...