Packt+ | Advance your knowledge in tech

You're reading from fastText Quick Start Guide Get started with Facebook's library for text representation and classification

Product type Paperback

Published in Jul 2018

Publisher Packt

ISBN-13 9781789130997

Length 194 pages

Edition 1st Edition

Languages

Python

Tools

fastText

Concepts

Mobile Application Development

Author (1):

Joydeep Bhattacharjee

View More author details

Table of Contents (17) Chapters

Title Page

Dedication

Packt Upsell

Contributors

Preface

1. Introducing FastText FREE CHAPTER

2. Creating Models Using FastText Command Line

3. Word Representations in FastText

4. Sentence Classification in FastText

5. FastText in Python

6. Machine Learning and Deep Learning Models

7. Deploying Models to Web and Mobile

1. Notes for the Readers

2. References

3. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Gensim fastText parameters

Gensim supports the same hyperparameters that are supported in the native implementation of fastText. You should be able to set them as follows:

sentences: This can be a list of list of tokens. In general, a stream of tokens is recommended, such as LineSentence from the word2vec module, as you have seen earlier. In the Facebook fastText library this is given by the path to the file and is given by the -input parameter.
sg: Either 1 or 0. 1 means to train a skip-gram model, and 0 means to train a CBOW model. In the Facebook fastText library the equivalent is when you pass the skipgram and cbow arguments.
size: The dimensions of the word vectors and hence must be an integer. In line with the original implementation, 100 is chosen as default. This is similar to the -dim argument in the Facebook fastText implementation.
window: The window size that is considered around a word. This is the same as -ws argument in the original implementation.
alpha: This is the initial learning rate and is a float. It is the same parameter as the -lr as what you saw in Chapter 2, Creating Models Using FastText Command Line.
min_alpha: This is the min learning rate to which the learning rate will drop to as the training progresses.
seed: This is for reproducability. For seeding to work the number of threads will also need to be 1.
min_count: Minimum frequency of words in the documents below which the words will be discarded. Similar to the -minCount parameter in the command line.
max_vocab_size: This is to limit the RAM size. In case there are more unique words than this will prune the less frequent ones. This needs to be decided based on top of the RAM that you have. For example, if you have 2 GB memory then max_vocab_size needs to be 10M * 2 = 20 million (20 000 000).
sample: For down sampling of words. Similar to the "-t" parameter in fasttext command line.
workers: Number of threads for training, similar to the -thread parameter in fastText command.
hs: Either 0 or 1. If this is 1, then hierarchical softmax will be used as the loss function.
negative: If you want to use negative sampling as the loss function, then set hs=0 and negative to a non-zero positive number. Note that, there are only two functions that are supported for loss functions, hierarchical softmax and negative sampling. Simple softmax is not supported. This parameter, along with hs is the equivalent of the -loss parameter in the fasttext command.
cbow_mean: There is a difference from the fastText command here. In the original implementation for cbow the mean of the vectors are taken. But in this case you have the option to use the sum by passing 0 and 1 in case you want to try out with the mean.
hashfxn: Hash function for randomly initializing the weights.
iter: Number of iterations or epochs over the samples. This is the same as the -epoch parameter in the command line.
trim_rule: Function to specify if certain words should be kept in the vocabulary or trimmed away.
sorted_vocab: Accepted values are 1 or 0. If 1 then the vocabulary will be sorted before indexing.
batch_words: This is the target size of the batches that are passed. The default value is 10000. This is a bit similar to the -lrUpdateRate in the command line as the number of batches determine when the weights will be updated.
min_n and max_n: Minimum and maximum length of the character n-grams.
word_ngrams: Enriches subword information for use in the training process.
bucket: The character n-grams are hashed on to a vector of fixed size. By default bucket size of 2 million words are used.
callbacks: A list of callback functions to be executed at specific stages of the training process.