Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
fastText Quick Start Guide

You're reading from   fastText Quick Start Guide Get started with Facebook's library for text representation and classification

Arrow left icon
Product type Paperback
Published in Jul 2018
Publisher Packt
ISBN-13 9781789130997
Length 194 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Joydeep Bhattacharjee Joydeep Bhattacharjee
Author Profile Icon Joydeep Bhattacharjee
Joydeep Bhattacharjee
Arrow right icon
View More author details
Toc

Gensim fastText parameters


Gensim supports the same hyperparameters that are supported in the native implementation of fastText. You should be able to set them as follows:

  • sentences: This can be a list of list of tokens. In general, a stream of tokens is recommended, such as LineSentence from the word2vec module, as you have seen earlier. In the Facebook fastText library this is given by the path to the file and is given by the -input parameter.
  • sg: Either 1 or 0. 1 means to train a skip-gram model, and 0 means to train a CBOW model. In the Facebook fastText library the equivalent is when you pass the skipgram and cbow arguments.
  • size: The dimensions of the word vectors and hence must be an integer. In line with the original implementation, 100 is chosen as default. This is similar to the -dim argument in the Facebook fastText implementation.
  • window: The window size that is considered around a word. This is the same as -ws argument in the original implementation.
  • alpha: This is the initial learning rate and is a float. It is the same parameter as the -lr as what you saw in Chapter 2, Creating Models Using FastText Command Line.
  • min_alpha: This is the min learning rate to which the learning rate will drop to as the training progresses.
  • seed: This is for reproducability. For seeding to work the number of threads will also need to be 1.
  • min_count: Minimum frequency of words in the documents below which the words will be discarded. Similar to the -minCount parameter in the command line.
  • max_vocab_size: This is to limit the RAM size. In case there are more unique words than this will prune the less frequent ones. This needs to be decided based on top of the RAM that you have. For example, if you have 2 GB memory then max_vocab_size needs to be 10M * 2 = 20 million (20 000 000).
  • sample: For down sampling of words. Similar to the "-t" parameter in fasttext command line.
  • workers: Number of threads for training, similar to the -thread parameter in fastText command.
  • hs: Either 0 or 1. If this is 1, then hierarchical softmax will be used as the loss function. 
  • negative: If you want to use negative sampling as the loss function, then set hs=0 and negative to a non-zero positive number. Note that, there are only two functions that are supported for loss functions, hierarchical softmax and negative sampling. Simple softmax is not supported. This parameter, along with hs is the equivalent of the -loss parameter in the fasttext command.
  • cbow_mean: There is a difference from the fastText command here. In the original implementation for cbow the mean of the vectors are taken. But in this case you have the option to use the sum by passing 0 and 1 in case you want to try out with the mean.
  • hashfxn: Hash function for randomly initializing the weights.
  • iter: Number of iterations or epochs over the samples. This is the same as the -epoch parameter in the command line.
  • trim_rule: Function to specify if certain words should be kept in the vocabulary or trimmed away.
  • sorted_vocab: Accepted values are 1 or 0. If 1 then the vocabulary will be sorted before indexing.
  • batch_words: This is the target size of the batches that are passed. The default value is 10000. This is a bit similar to the -lrUpdateRate in the command line as the number of batches determine when the weights will be updated.
  • min_n and max_n: Minimum and maximum length of the character n-grams. 
  • word_ngrams: Enriches subword information for use in the training process.
  • bucket: The character n-grams are hashed on to a vector of fixed size. By default bucket size of 2 million words are used.
  • callbacks: A list of callback functions to be executed at specific stages of the training process.
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime
Visually different images