Chapter 8. Distributed Processing
In the last chapter, we introduced the concept of parallel processing and learned how to leverage multicore processors and GPUs. Now, we can step up our game a bit and turn our attention on distributed processing, which involves executing tasks across multiple machines to solve a certain problem.
In this chapter, we will illustrate the challenges, use cases, and examples of how to run code on a cluster of computers. Python offers easy-to-use and reliable packages for distribute processing, which will allow us to implement scalable and fault-tolerant code with relative ease.
The list of topics for this chapter is as follows:
- Distributed computing and the MapReduce model
- Directed Acyclic Graphs with Dask
- Writing parallel code with Dask's
array
,Bag
, andDataFrame
data structures - Distributing parallel algorithms with Dask Distributed
- An introduction to PySpark
- Spark's Resilient Distributed Datasets and DataFrame
- Scientific computing with
mpi4py