Submission and leaderboard dynamics
Apparently, the way Kaggle works seems simple: the test set is hidden to participants; you fit your model, if your model is the best in predicting the test set, then you score high and you possibly win. Unfortunately, such description renders the inner workings of Kaggle competitions in a too simplistic way and it doesn’t take into account that there are dynamics regarding the direct and indirect interactions of competitors between themselves and the nuances of the problem you are facing and of its training and test set.
A more comprehensive description of how Kaggle works is actually given by Professor David Donoho, professor of statistics at Stanford University (https://web.stanford.edu/dept/statistics/cgi-bin/donoho/), in his writing “50 Years of Data Science”. The paper has first appeared in the Journal of Computational and Graphical Statistics and then subsequently posted on the MIT Computer Science and Artificial Intelligence Laboratory (see http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf). Professor Donoho does not refer to Kaggle specifically, but to all data science competition platforms generically. Quoting computational linguist Mark Liberman, he refers to data science competitions and platforms as part of a Common Task Framework (CTF) paradigm that has been silently and steadily progressing data science in many fields during the last decades. He states that a CTF can work incredibly well at improving the solution of a problem in data science from an empirical point of view. He quotes as successful examples the NetFlix competition and many DARPA competitions that have reshaped the best in class solutions for problems in many fields.
A CTF is composed of “ingredients” and a “secret sauce” (see paragraph 6 of Donoho, David. "50 years of data science." Journal of Computational and Graphical Statistics 26.4 (2017): 745-766.). The ingredients are simply:
- A public available dataset and a related prediction task
- A set of competitors who share the common task of producing the best prediction for the task
- A system for scoring the predictions by the participants in a fair and objective way, without providing too specific hints (or limiting them at most) about the solution
The system works the best if the task is well defined and the data is of good quality. On the long run, the performance of solutions improves by small gains until they reach an asymptote. The process can be speeded up by allowing a certain amount of sharing among participants (as it happens on Kaggle by means of discussions and sharing Kernel notebooks and extra data by Datasets). It can be noticed that the only competitive pressure, in spite of any degree of sharing among participants doesn’t stop the improvement in the solution, it just makes it slower.
This is because the secret sauce in the CTF paradigm is the competition itself, that, in the framework of a practical problem whose empirical performance has to be improved, always leads to the emergence of new benchmarks, new data and modeling solutions, and in general to an improved application of machine learning to the problem objected of the competition. A competition can therefore provide a new way to solve a prediction problem, new ways of feature engineering, new algorithmic or modeling solutions. For instance, deep learning has not simply emerged from academic research, but it has first gained a great boost because of successful competitions that declared its efficacy (we have already mentioned for instance the Merck competition won by Geoffrey Hinton’s team https://www.kaggle.com/c/MerckActivity/overview/winners).
Coupled with the open software movement, which allows everyone to access to powerful analytical tools (such as Scikit-learn or TensorFlow or PyTorch), the CTF paradigm brings even better results because all competitors are on the same line at the start. On the other end, the reliance of a solution to a competition on specialized or improved hardware can limit the achievable results because it can prevent competitors without access to such resources to properly participate and contribute directly to the solution or indirectly by exercising competitive pressure on the other participants. Understandably, that the reason why Kaggle started also offering cloud services for free to participants to its competitions (the Kernels): it can flatten some differences in hardware intense competitions (like most deep learning ones are) and increase the overall competitive pressure.
There are, anyway, occurrences that can go wrong and instead led to a suboptimal result in a competition:
- Leakage from the data
- Probing from the Leaderboard (the scoring system)
- Overfitting and consequent shake-up
- Private Sharing
You have leakage from data when part of the solution is to be retraced in the data itself. For instance, certain variables are posterior to the target variable (so they reveal something of it) or the ordering of the training and test examples or some identifier is evocative of the solution. Such solution leakage, sometimes named by the competitors as “golden features” (because getting a hint of such nuances in the data problem could turn into gold prizes for the participants) invariably leads to a not reusable solution. It also implies a suboptimal result for the sponsor (who at least should have learned something about leaking features that should affect a solution to his/her problem).
Another problem is the possibility to probe a solution from the leaderboard. In such situation, you can snoop the solution by repeated submission trials on the leaderboard. Again, in this case the solution is completely unusable in different circumstances. A clear example of such happened in the competition “Don’t Overfit II” (https://www.kaggle.com/c/dont-overfit-ii). In this competition, the winning participant, Zachary Mayers, submitted every individual variable as a single submission, gaining information on the possible weight of each variable that allowed him to estimate the correct coefficients for his model (you can read Zach’s detailed solution here: https://www.kaggle.com/c/dont-overfit-ii/discussion/91766). Generally, time series problems (or other problems where there are systematic shifts in the test data) may be seriously affected by probing, since they can help competitors to successfully define some kind of ‘post-processing’ (like multiplying their predictions by a constant) that is most suitable for scoring high on the specific test set.
Another form of leaderboard snooping happens when participants tend to rely more on the feedback from the public leaderboard than their own tests. Sometimes such turns into a complete failure of the competition, with a wild shake-up, that is a complete and unpredictable reshuffling of the competitors’ positions on the final leaderboard. The winning solutions, in such a case, may (but not always) turn out to be not so optimal for the problem or even sometimes just dictated by chance. This has led to precise analysis by competitors if the training set is much different from the test set they have to guess. Such analysis, called “Adversarial testing” can provide insight if to rely on the leaderboard and if there are features that are too different from the training and test set to be better avoided completely (For an example, you can have a look at this Kernel by Bojan Tunguz: https://www.kaggle.com/tunguz/adversarial-ieee). Another form of defense against leaderboard overfitting is choosing safe strategies to avoid submitting solutions too based on the leaderboard results. For instance, since two solutions are allowed to be chosen by each participant for the evaluation at the end of the competition, a good strategy is to submit the best one performing on the leaderboard and the best performing one based on one’s own cross-validation tests.
In order to avoid problems with leaderboard probing and overfitting, Kaggle has recently introduced different innovations based on Code Competitions, where the evaluation is split into two distinct stages, that is you have a test set for the public leaderboard (the leaderboard you follow during the competition) and a completely hold out test set for the final private leaderboard. In this way, participants are actually blind to the actual data their solutions will be evaluated against and they should be forced to consider more their own tests and a more general solution against a solution specific to the test set.
Finally, another possible distortion of a competition is due to private sharing (sharing ideas and solutions in a closed circle of participants) and other illicit moves such as playing by multiple accounts or playing in multiple teams and stealing ideas from each in favor of another team. All such actions create an asymmetry in information between participants that can be favorable to a few and detrimental to the most. Again, the resulting solution may be affected because sharing has been imperfect during the competition and fewer teams have been able to exercise full competitive pressure. Moreover, if such situations appear evident to participants (for instance see https://www.kaggle.com/c/ashrae-energy-prediction/discussion/122503), it can lead to distrust and less involvement in the competition or in following competitions.