Using PySpark
Nowadays, Apache Spark is one of the most popular projects for distributed computing. Developed in Scala, Spark was released in 2014, and integrates with HDFS and provides several advantages and improvements over the Hadoop MapReduce framework.
Contrary to Hadoop MapReduce, Spark is designed to process data interactively and supports APIs for the Java, Scala, and Python programming languages. Given its different architecture, especially by the fact that Spark keep results in memory, Spark is generally much faster than Hadoop MapReduce.
Setting up Spark and PySpark
Setting up PySpark from scratch requires the installation of the Java and Scala runtimes, the compilation of the project from source, and the configuration of Python and Jupyter notebook so that they can be used alongside the Spark installation. An easier and less error-prone way to set up PySpark is to use an already configured Spark cluster made available through a Docker container.
Note
Docker can be downloaded at...