Tokenizing sentences
Before defining and feeding data into an LSTM network it is important that the data is converted into a form which can be understood by the neural network. Computers understand everything in binary code (0s and 1s) and therefore, the textual or data in string format needs to be converted into one hot encoded variables.
Getting ready
For understanding how one hot encoding works, visit the following links:
- https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python
- https://www.ritchieng.com/machinelearning-one-hot-encoding/
- https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
How to do it...
After the going through the previous section you should be able to clean the entire corpus and split up sentences. The next steps which involve one hot encoding...