Packt+ | Advance your knowledge in tech

You're reading from Scala and Spark for Big Data Analytics Explore the concepts of functional programming, data streaming, and machine learning

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781785280849

Length 796 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Authors (2):

Karim

Sridhar Alla

View More author details

Table of Contents (25) Chapters

Title Page

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Introduction to Scala FREE CHAPTER

2. Object-Oriented Scala

3. Functional Programming Concepts

4. Collection APIs

5. Tackle Big Data – Spark Comes to the Party

6. Start Working with Spark – REPL and RDDs

7. Special RDD Operations

8. Introduce a Little Structure - Spark SQL

9. Stream Me Up, Scotty - Spark Streaming

10. Everything is Connected - GraphX

11. Learning Machine Learning - Spark MLlib and Spark ML

12. My Name is Bayes, Naive Bayes

13. Time to Put Some Order - Cluster Your Data with Spark MLlib

14. Text Analytics Using Spark ML

15. Spark Tuning

16. Time to Go to ClusterLand - Deploying Spark on a Cluster

17. Testing and Debugging Spark

18. PySpark and SparkR

Caching

Caching enables Spark to persist data across and operations. In fact, this is one of the most important technique in Spark to speed up computations, particularly when dealing with iterative computations.

Caching works by storing the RDD as much as possible in the memory. If there is not enough memory then the current data in storage is evicted, as per LRU policy. If the data being asked to cache is larger than the memory available, the performance will come down because Disk will be used instead of memory.

You can mark an RDD as cached using either persist() or cache()

Note

cache() is simply a synonym for persist(MEMORY_ONLY)

persist can use memory or disk or both:

persist(newLevel: StorageLevel)

The following are the possible for Storage level:

Storage Level	Meaning
`MEMORY_ONLY`	Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
`MEMORY_AND_DISK...`