Gensim fastText parameters
Gensim supports the same hyperparameters that are supported in the native implementation of fastText. You should be able to set them as follows:
sentences
: This can be a list of list of tokens. In general, a stream of tokens is recommended, such asLineSentence
from the word2vec module, as you have seen earlier. In the Facebook fastText library this is given by the path to the file and is given by the-input
parameter.sg
: Either 1 or 0. 1 means to train a skip-gram model, and 0 means to train a CBOW model. In the Facebook fastText library the equivalent is when you pass theskipgram
andcbow
arguments.size
: The dimensions of the word vectors and hence must be an integer. In line with the original implementation, 100 is chosen as default. This is similar to the-dim
argument in the Facebook fastText implementation.window
: The window size that is considered around a word. This is the same as-ws
argument in the original implementation.alpha
: This is the initial learning rate and is a float. It is the same parameter as the-lr
as what you saw in Chapter 2, Creating Models Using FastText Command Line.min_alpha
: This is the min learning rate to which the learning rate will drop to as the training progresses.seed
: This is for reproducability. For seeding to work the number of threads will also need to be 1.min_count
: Minimum frequency of words in the documents below which the words will be discarded. Similar to the-minCount
parameter in the command line.max_vocab_size
: This is to limit the RAM size. In case there are more unique words than this will prune the less frequent ones. This needs to be decided based on top of the RAM that you have. For example, if you have 2 GB memory thenmax_vocab_size
needs to be 10M * 2 = 20 million (20 000 000).sample
: For down sampling of words. Similar to the "-t" parameter in fasttext command line.workers
: Number of threads for training, similar to the-thread
parameter in fastText command.hs
: Either 0 or 1. If this is 1, then hierarchical softmax will be used as the loss function.negative
: If you want to use negative sampling as the loss function, then seths
=0 and negative to a non-zero positive number. Note that, there are only two functions that are supported for loss functions, hierarchical softmax and negative sampling. Simple softmax is not supported. This parameter, along withhs
is the equivalent of the-loss
parameter in thefasttext
command.cbow_mean
: There is a difference from the fastText command here. In the original implementation forcbow
the mean of the vectors are taken. But in this case you have the option to use the sum by passing 0 and 1 in case you want to try out with the mean.hashfxn
: Hash function for randomly initializing the weights.iter
: Number of iterations or epochs over the samples. This is the same as the-epoch
parameter in the command line.trim_rule
: Function to specify if certain words should be kept in the vocabulary or trimmed away.sorted_vocab
: Accepted values are 1 or 0. If 1 then the vocabulary will be sorted before indexing.batch_words
: This is the target size of the batches that are passed. The default value is 10000. This is a bit similar to the-lrUpdateRate
in the command line as the number of batches determine when the weights will be updated.min_n
andmax_n
: Minimum and maximum length of the character n-grams.word_ngrams
: Enriches subword information for use in the training process.- bucket: The character n-grams are hashed on to a vector of fixed size. By default bucket size of 2 million words are used.
callbacks
: A list of callback functions to be executed at specific stages of the training process.