Remember that our fake task is to predict words within a window of given size around the word. So our target variable is also a word and our training set will be made up of pairs of words appearing within the same context. Context here is defined by the size of the window; the larger the window size, the bigger our dataset (we can create more pairs). However, if window sizes are larger, we start getting more irrelevant data in the training set, since we will get pairs of words appearing quite far away from each other.
Let's consider again the quotation from Sherlock Holmes:
Considering a window size of 2, we will create the dataset in the following way (see the following diagram). Starting from the first word, the input, we can pair it with all the words within a 2-word window. Since we are considering the first word of the text, we can only move toward the right and encounter two words: is and a. Hence, we create...