Identifying spam in YouTube video comments using RNNs
As a first example, we will look into the problem of identifying spam in YouTube video comments. The complete Jupyter Notebook for this example is available under the Chapter05/02_example.ipynb
directory in this book's code repository. The data contains the comments with binary labels specifying whether the comment is genuine or spam. The code that follows loads the comments in CSV format into a pandas DataFrame:
comments_df_list = []
comments_file = ['data/Youtube01-Psy.csv','data/Youtube02-KatyPerry.csv','data/Youtube03-LMFAO.csv',
'data/Youtube04-Eminem.csv','data/Youtube05-Shakira.csv']
for f in comments_file:
df = pd.read_csv(f,header=0)
comments_df_list.append(df)
comments_df = pd.concat(comments_df_list)
comments_df = comments_df.sample(frac=1.0)
print(comments_df.shape)
comments_df.head(5)
The following output shows a sample of the YouTube comments with the various fields:
COMMENT_ID | AUTHOR | DATE | CONTENT | CLASS | |
---|---|---|---|---|---|
102 | z12dfr5irwr5chwm3232gvnq2laqcdezn04... |