Chapter 7: Topic Modeling
Activity 15: Loading and Cleaning Twitter Data
Solution:
- Import the necessary libraries:
import langdetect import matplotlib.pyplot import nltk import numpy import pandas import pyLDAvis import pyLDAvis.sklearn import regex import sklearn
- Load the LA Times health Twitter data (
latimeshealth.txt
) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:Note
Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.
path = '<Path>/latimeshealth.txt' df = pandas.read_csv(path, sep="|", header=None) df.columns = ["id", "datetime", "tweettext"]
- Run a quick exploratory analysis to ascertain the data size and structure:
def dataframe_quick_look(df, nrows): print("SHAPE:\n{shape}\n".format(shape=df.shape)) print("COLUMN NAMES:\n{names}\n".format(names=df.columns...