You're reading from The Kaggle Book Data analysis and machine learning for competitive data science

Product type Paperback

Published in Apr 2022

Publisher Packt

ISBN-13 9781801817479

Length 534 pages

Edition 1st Edition

Concepts

Data Analysis

Authors (2):

Konrad Banachewicz

Luca Massaron

View More author details

Table of Contents (4) Chapters

1. Data Analysis and Machine Learning with Kaggle: How to win competitions on Kaggle and build a successful career in data science

2. Introducing data science competitions FREE CHAPTER

3. Organizing Data with Datasets

4. Working and Learning with Kaggle Notebooks

The rise of data science competition platforms

Commonly, problems in competitive programming range from combinatory to number theory, graph theory, algorithmic game theory, computational geometry, string analysis, and data structures. Recently also problems relative to artificial intelligence have successfully emerged, in particular after the launch of the KDD Cup, a contest in knowledge discovery and data mining, held by the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining, during its annual conference.

The first KDD cup, held in 1997, involved a problem on direct marketing for lift curve optimization and it started a long series of competitions (you can find the archives containing datasets, instructions, and winners at: https://www.kdd.org/kdd-cup) that continues up to nowadays (here is the latest available at the time of writing: https://www.kdd.org/kdd2020/kdd-cup). KDD cups proved quite effective in establishing best practices with many published papers describing solutions and techniques and competition dataset sharing that has been useful for many practitioners for experimentation, education and benchmarking.

The experience of competitive programming and KDD cups together gave rise to data science competition platforms, platforms where companies can host data science challenges that are somehow hard to solve and that could benefit from a crowdsourcing approach. In fact, given the fact that there is no golden approach for all the problems in data science, many problems require a time-consuming approach of the kind of try-all-that you-can-try.

In fact, no algorithm on the long run can beat all the others on all the problems, but each machine learning algorithm performs if and only if its space of hypothesis comprises the solution. Yet you cannot know that beforehand, hence you have to try and test to be assured that you are doing the right thing. You can consult the no free lunch theorem for a theoretical explanation of this practical truth, here is a complete article from Analytics India Magazine on the topic: https://analyticsindiamag.com/what-are-the-no-free-lunch-theorems-in-data-science/.

Crowdsourcing proves ideal in such conditions when you need to test in an extensively manner algorithms and data transformations in order to find the best possible combinations but you lack man and computer power for that. That’s why for instance, governments and companies resort to competitions in order to advance in certain fields. On the government side, we can quote DARPA and its many competitions on self-driving cars, robotic operations, machine translation, speaker identification, fingerprint recognition, information retrieval, OCR, automatic target recognition, and many others. On the business side, we can quote a company such as Netflix, which entrusted a competition in order to improve its algorithm to predict user movie selection.

The Netflix competition was based on the idea of improving existing collaborative filtering, whose purpose is simply to predict the potential rating of a user for film, solely based on the previous ratings that the user gave on other films without knowing in specific who the user is or what the films are. Since no user description or movie title or description were available (all replaced by identity codes), the competition required to develop smart ways to use the available past ratings. The grand prize of USD $1,000,000 was to be assigned only if the solution could improve the existing Netflix algorithm, Cinematch, above a certain threshold. The competition run from 2006 to 2009 and in the end saw a team made up by the fusion of many previously teams in competitions (a team from Commendo Research & Consulting GmbH, Andreas Töscher and Michael Jahrer, quite renown also in Kaggle competitions, two researchers from AT&T Labs and two others from Yahoo!). In the end, winning the competition required so much computation power and the ensembling of different solutions, that teams were forced to merge in order to keep pace. Such situation, reflected also on the actual usage of the solution by Netflix, which preferred not to implement it, but to simply take the most interesting insight from it in order to improve its existing Cinematch algorithm (you can read more about it on this Wired article: https://www.wired.com/2012/04/netflix-prize-costs/). What mattered more in the end of the Netflix competition has not been the solution per se, which has been quickly superseded by the change in business of Netflix from DVDs to online movies. The real benefit for both participants (who gained a huge reputation in collaborative filtering) and the company (who could transfer its improved recommendation knowledge to its new business) were the insights that has been gained from participating in the competition (and we can advance that knowledge will be also the leitmotiv of most of this book).

You're reading from The Kaggle Book Data analysis and machine learning for competitive data science

Table of Contents (4) Chapters

The rise of data science competition platforms

Authors (2)

Personalised recommendations for you