Remember our fake task: predicting a word within the same context (window) from the input word. This means that we are not trying to execute a binary classification (the example used in the introductory section about neural networks), but rather a classification with N classes, N being the number of words in the corpus. Hence, the output layer contains N neurons whose values are either 0 or 1: if the value of the neuron at index i is 1, the word i is the chosen word, meaning the network predicts that word i is within the same context as the input word. In practice, neural networks will use a softmax classifier, where the softmax function gives the probability of a word being in the same context as the input word:
softmax(i) = eoi / Σi eoi
In this formula, oi is the output of the the ith neuron in the output layer.
This is the basic idea behind skip-gram. In practice, some care needs to be taken when implementing such a model:
- Certain words, such as the, will be over...