In a text corpus, each observation corresponds to a word and we need to find and quantify characteristics for each observation. It makes sense to create features so that words representing similar concepts will have similar features. As a first example, let's consider some characters from the Star Wars movies.
We can try and find a list of characteristics for each character. Here, three characteristics are identified: gender, species, and goodness (0 for bad, 1 for good). Using these three characteristics, we can build the following table:
Character | Gender | Species | Goodness |
Darth Vader | 1 | 1 | 0 |
Yoda | 1 | 2 | 1 |
Princess Leia | 0 | 1 | 1 |
R2D2 | 1 | 3 | 1 |
Darth Vader is a man from the Dark Side of the Force – the bad people in the saga, while Leia and Yoda are on the Light Side – the good people. While Darth Vader and Leia are both humans, R2D2 is a robot and Yoda is from an alien species.
We can conduct the same exercise...