The goal of embedding is to encode words as a vector V(word) with dimension d « N, such that the word meaning is somehow preserved. So, starting with the one-hot encoded matrix representing words, we want to end up with a matrix with size N×d, where d is small, as illustrated in the following diagram:
However, so far, all we have done is reduce the number of features. What it actually means to preserve the word meaning is illustrated by the following diagram:
This diagram shows four words – child, young, old, and elderly, each represented by two-dimensional vectors. These vectors are related to each other as in the following equation:
V(child) - V(young) + V(old) ≈ V(elderly)
This means that if you take the vector representing the word child and remove from it the vector representing the word young, you have an intermediate vector representing a person without age. If you add to this vector the vector representation of the word...