Starting with basic feature engineering
Before starting to code, we have to load the dataset in Python and also provide Python with all the necessary packages for our project. We will need to have these packages installed on our system (the latest versions should suffice, no need for any specific package version):
Numpy
pandas
fuzzywuzzy
python-Levenshtein
scikit-learn
gensim
pyemd
NLTK
As we will be using each one of these packages in the project, we will provide specific instructions and tips to install them.
For all dataset operations, we will be using pandas (and Numpy will come in handy, too). To install numpy
and pandas
:
pip install numpy pip install pandas
The dataset can be loaded into memory easily by using pandas and a specialized data structure, the pandas dataframe (we expect the dataset to be in the same directory as your script or Jupyter notebook):
import pandas as pd import numpy as np data = pd.read_csv('quora_duplicate_questions.tsv', sep='\t') data = data.drop(['id', 'qid1', 'qid2...