Scikit-learn and fastText
In this section, we will be talking about how to integrate fastText into your statistical models. The most common and popular library for statistical machine learning is scikit-learn, so we will focus on that.
scikit-learn is one of the most popular machine learning tools and the reason is that the API is very simple and uniform. The flow is like this:
- You basically convert your data into matrix format.
- Then, you create an instance of the predictor class.
- Using the instance, you run the
fit
method on the data. - Once the model is created, you can run
predict
on it.
This means that you can create a custom classifier by defining the fit and predict methods.
Custom classifiers for fastText
Since we are interested in combining fastText word vectors with the linear classifiers, you cannot pass the wholevectors and would need a way to define a single vector. In this case, let's go with the mean:
class MeanEmbeddingVectorizer(object): def __init__(self, ft_wv): self...