Performing out-of-core computations on large arrays with Dask
Dask is a parallel computing library that offers not only a general framework for distributing complex computations on many nodes, but also a set of convenient high-level APIs to deal with out-of-core computations on large arrays. Dask provides data structures resembling NumPy arrays (dask.array
) and Pandas DataFrames (dask.dataframe
) that efficiently scale to huge datasets. The core idea of Dask is to split a large array into smaller arrays (chunks).
In this recipe, we illustrate the basic principles of dask.array
.
Getting ready
Dask should already be installed in Anaconda, but you can always install it manually with conda install dask
. You also need memory_profiler
, which you can install with conda install memory_profiler
.
How to do it...
Let's import the libraries:
>>> import numpy as np import dask.array as da import memory_profiler >>> %load_ext memory_profiler
We initialize a large 10,000 x 10,000 array...