Once the dataset size is known, the next most important information to gather is whether the dataset contains labels for each observation. The label is the value the algorithm should target in your problem. It can be a text or integer class in the context of a classification task, and a real number for regression problems. If the dataset contains labels, we are in the context of supervised learning. Otherwise, we have two choices: either rely on unsupervised techniques or try and fetch data labels from other sources.
Remember, our problem consists of determining whether a user contributed to Neo4j or not. So, our dataset does have labels via the column named contributed_to_neo4j, thus we are in a supervised classification problem. We can check the distribution of this variable with the seaborn Python package, a wrapper around the historical matplotlib package that was built for data analysis. As an example, a single line of code (apart from the import!) is required to draw a bar...