Chapter 7: Processing Human Language
Activity 19: Predicting Sentiments of Movie Reviews
Solution:
- Read the IMDB movie review dataset using pandas in Python:
import pandas as pd data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')
- Convert the tweets to lowercase to reduce the number of unique words:
data.text = data.text.str.lower()
Note
Keep in mind that "
Hello
" and "hellow
" are not the same to a computer. - Clean the reviews using RegEx with the
clean_str
function:import re def clean_str(string): string = re.sub(r"https?\://\S+", '', string) string = re.sub(r'\<a href', ' ', string) string = re.sub(r'&', '', string) string = re.sub(r'<br />', ' ', string) string...