Caching
Caching enables Spark to persist data across and operations. In fact, this is one of the most important technique in Spark to speed up computations, particularly when dealing with iterative computations.
Caching works by storing the RDD as much as possible in the memory. If there is not enough memory then the current data in storage is evicted, as per LRU policy. If the data being asked to cache is larger than the memory available, the performance will come down because Disk will be used instead of memory.
You can mark an RDD as cached using either persist()
or cache()
Note
cache()
is simply a synonym for persist(MEMORY_ONLY)
persist
can use memory or disk or both:
persist(newLevel: StorageLevel)
The following are the possible for Storage level:
Storage Level | Meaning |
| Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. |
|