Building a vocabulary for our captions
The next step involves doing some preprocessing for our caption data and building a vocabulary or metadata dictionary for our captions. We start by reading in our training dataset records and writing a function to preprocess our text captions:
train_df = pd.read_csv('image_train_dataset.tsv', delimiter='\t') total_samples = train_df.shape[0] total_samples 35000 # function to pre-process text captions def preprocess_captions(caption_list): pc = [] for caption in caption_list: caption = caption.strip().lower() caption = caption.replace('.', '').replace(',', '').replace("'", "").replace('"', '') caption = caption.replace('&','and').replace('(','').replace(')', '').replace('-', ' ') caption = ' '.join(caption.split()) caption = '<START> '+caption+' <END>' pc.append(caption) return pc
We will now...