Preface
The continued growth in data coupled with the need to make increasingly complex decisions against that data is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional analytical approaches. The field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Whether you're scrutinizing the clickstream from millions of visitors to optimize online ad placements, or sifting through billions of transactions to identify signs of fraud, the need for advanced analytics, such as machine learning and graph processing, to automatically glean insights from enormous volumes of data is more evident than ever.
Apache Spark, the de facto standard for big data processing, analytics, and data sciences across all academia and industries, provides both machine learning and graph processing libraries, allowing companies to tackle complex problems easily with the power of highly scalable and clustered computers. Spark's promise is to take this a little further to make writing distributed programs using Scala feel like writing regular programs for Spark. Spark will be great in giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmer's daily chant of despair to the Hadoop gods.
In this book, we used Spark and Scala for the endeavor to bring state-of-the-art advanced data analytics with machine learning, graph processing, streaming, and SQL to Spark, with their contributions to MLlib, ML, SQL, GraphX, and other libraries.
We started with Scala and then moved to the Spark part, and finally, covered some advanced topics for big data analytics with Spark and Scala. In the appendix, we will see how to extend your Scala knowledge for SparkR, PySpark, Apache Zeppelin, and in-memory Alluxio. This book isn't meant to be read from cover to cover. Skip to a chapter that looks like something you're trying to accomplish or that simply ignites your interest.
Happy reading!
What this book covers
Chapter 1, Introduction to Scala, will teach big data analytics using the Scala-based APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief introduction to Scala, such as the basic aspects of its history, purposes, and how to install Scala on Windows, Linux, and Mac OS. After that, the Scala web framework will be discussed in brief. Then, we will provide a comparative analysis of Java and Scala. Finally, we will dive into Scala programming to get started with Scala.
Chapter 2, Object-Oriented Scala, says that the object-oriented programming (OOP) paradigm provides a whole new layer of abstraction. In short, this chapter discusses some of the greatest strengths of OOP languages: discoverability, modularity, and extensibility. In particular, we will see how to deal with variables in Scala; methods, classes, and objects in Scala; packages and package objects; traits and trait linearization; and Java interoperability.
Chapter 3, Functional Programming Concepts, showcases the functional programming concepts in Scala. More specifically, we will learn several topics, such as why Scala is an arsenal for the data scientist, why it is important to learn the Spark paradigm, pure functions, and higher-order functions (HOFs). A real-life use case using HOFs will be shown too. Then, we will see how to handle exceptions in higher-order functions outside of collections using the standard library of Scala. Finally, we will look at how functional Scala affects an object's mutability.
Chapter4, Collection APIs, introduces one of the features that attract most Scala users--the Collections API. It's very powerful and flexible, and has lots of operations coupled. We will also demonstrate the capabilities of the Scala Collection API and how it can be used in order to accommodate different types of data and solve a wide range of different problems. In this chapter, we will cover Scala collection APIs, types and hierarchy, some performance characteristics, Java interoperability, and Scala implicits.
Chapter 5, Tackle Big Data - Spark Comes to the Party, outlines data analysis and big data; we see the challenges that big data poses, how they are dealt with by distributed computing, and the approaches suggested by functional programming. We introduce Google's MapReduce, Apache Hadoop, and finally, Apache Spark, and see how they embraced this approach and these techniques. We will look into the evolution of Apache Spark: why Apache Spark was created in the first place and the value it can bring to the challenges of big data analytics and processing.
Chapter 6, Start Working with Spark - REPL and RDDs, covers how Spark works; then, we introduce RDDs, the basic abstractions behind Apache Spark, and see that they are simply distributed collections exposing Scala-like APIs. We will look at the deployment options for Apache Spark and run it locally as a Spark shell. We will learn the internals of Apache Spark, what RDDs are, DAGs and lineages of RDDs, Transformations, and Actions.
Chapter 7, Special RDD Operations, focuses on how RDDs can be tailored to meet different needs, and how these RDDs provide new functionalities (and dangers!) Moreover, we investigate other useful objects that Spark provides, such as broadcast variables and Accumulators. We will learn aggregation techniques, shuffling.
Chapter 8, Introduce a Little Structure - SparkSQL, teaches how to use Spark for the analysis of structured data as a higher-level abstraction of RDDs and how Spark SQL's APIs make querying structured data simple yet robust. Moreover, we introduce datasets and look at the differences between datasets, DataFrames, and RDDs. We will also learn to join operations and window functions to do complex data analysis using DataFrame APIs.
Chapter 9, Stream Me Up, Scotty - Spark Streaming, takes you through Spark Streaming and how we can take advantage of it to process streams of data using the Spark API. Moreover, in this chapter, the reader will learn various ways of processing real-time streams of data using a practical example to consume and process tweets from Twitter. We will look at integration with Apache Kafka to do real-time processing. We will also look at structured streaming, which can provide real-time queries to your applications.
Chapter 10, Everything is Connected - GraphX, in this chapter, we learn how many real-world problems can be modeled (and resolved) using graphs. We will look at graph theory using Facebook as an example, Apache Spark's graph processing library GraphX, VertexRDD and EdgeRDDs, graph operators, aggregateMessages, TriangleCounting, the Pregel API, and use cases such as the PageRank algorithm.
Chapter 11, Learning Machine Learning - Spark MLlib and ML, the purpose of this chapter is to provide a conceptual introduction to statistical machine learning. We will focus on Spark's machine learning APIs, called Spark MLlib and ML. We will then discuss how to solve classification tasks using decision trees and random forest algorithms and regression problem using linear regression algorithm. We will also show how we could benefit from using one-hot encoding and dimensionality reductions algorithms in feature extraction before training a classification model. In later sections, we will show a step-by-step example of developing a collaborative filtering-based movie recommendation system.
Chapter 12, My Name is Bayes, Naive Bayes, states that machine learning in big data is a radical combination that has created great impact in the field of research, in both academia and industry. Big data imposes great challenges on ML, data analytics tools, and algorithms to find the real value. However, making a future prediction based on these huge datasets has never been easy. Considering this challenge, in this chapter, we will dive deeper into ML and find out how to use a simple yet powerful method to build a scalable classification model and concepts such as multinomial classification, Bayesian inference, Naive Bayes, decision trees, and a comparative analysis of Naive Bayes versus decision trees.
Chapter 13, Time to Put Some Order - Cluster Your Data with Spark MLlib, gets you started on how Spark works in cluster mode with its underlying architecture. In previous chapters, we saw how to develop practical applications using different Spark APIs. Finally, we will see how to deploy a full Spark application on a cluster, be it with a pre-existing Hadoop installation or without.
Chapter 14, Text Analytics Using Spark ML, outlines the wonderful field of text analytics using Spark ML. Text analytics is a wide area in machine learning and is useful in many use cases, such as sentiment analysis, chat bots, email spam detection, natural language processing, and many many more. We will learn how to use Spark for text analysis with a focus on use cases of text classification using a 10,000 sample set of Twitter data. We will also look at LDA, a popular technique to generate topics from documents without knowing much about the actual text, and will implement text classification on Twitter data to see how it all comes together.
Chapter 15, Spark Tuning, digs deeper into Apache Spark internals and says that while Spark is great in making us feel as if we are using just another Scala collection, we shouldn't forget that Spark actually runs in a distributed system. Therefore, throughout this chapter, we will cover how to monitor Spark jobs, Spark configuration, common mistakes in Spark app development, and some optimization techniques.
Chapter 16, Time to Go to ClusterLand - Deploying Spark on a Cluster, explores how Spark works in cluster mode with its underlying architecture. We will see Spark architecture in a cluster, the Spark ecosystem and cluster management, and how to deploy Spark on standalone, Mesos, Yarn, and AWS clusters. We will also see how to deploy your app on a cloud-based AWS cluster.
Chapter 17, Testing and Debugging Spark, explains how difficult it can be to test an application if it is distributed; then, we see some ways to tackle this. We will cover how to do testing in a distributed environment, and testing and debugging Spark applications.
Chapter 18, PySpark & SparkR, covers the other two popular APIs for writing Spark code using R and Python, that is, PySpark and SparkR. In particular, we will cover how to get started with PySpark and interacting with DataFrame APIs and UDFs with PySpark, and then we will do some data analytics using PySpark. The second part of this chapter covers how to get started with SparkR. We will also see how to do data processing and manipulation, and how to work with RDD and DataFrames using SparkR, and finally, some data visualization using SparkR.
Chapter 19, Advanced Machine Learning Best Practices, provides theoretical and practical aspects of some advanced topics of machine learning with Spark. We will see how to tune machine learning models for optimized performance using grid search, cross-validation, and hyperparameter tuning. In a later section, we will cover how to develop a scalable recommendation system using ALS, which is an example of a model-based recommendation algorithm. Finally, a topic modelling application will be demonstrated as a text clustering technique
Appendix A, Accelerating Spark with Alluxio, shows how to use Alluxio with Spark to increase the speed of processing. Alluxio is an open source distributed memory storage system useful for increasing the speed of many applications across platforms, including Apache Spark. We will explore the possibilities of using Alluxio and how Alluxio integration will provide greater performance without the need to cache the data in memory every time we run a Spark job.
Appendix B, Interactive Data Analytics with Apache Zeppelin, says that from a data science perspective, interactive visualization of your data analysis is also important. Apache Zeppelin is a web-based notebook for interactive and large-scale data analytics with multiple backends and interpreters. In this chapter, we will discuss how to use Apache Zeppelin for large-scale data analytics using Spark as the interpreter in the backend.
Chapter 19 and Appendices are not present in the book but are available for download at the following link: https://www.packtpub.com/sites/default/files/downloads/ScalaandSparkforBigDataAnalytics_OnlineChapter_Appendices.pdf.
What you need for this book
All the examples have been implemented using Python version 2.7 and 3.5 on an Ubuntu Linux 64 bit, including the TensorFlow library version 1.0.1. However, in the book, we showed the source code with only Python 2.7 compatible. Source codes that are Python 3.5+ compatible can be downloaded from the Packt repository. You will also need the following Python modules (preferably the latest versions):
- Spark 2.0.0 (or higher)
- Hadoop 2.7 (or higher)
- Java (JDK and JRE) 1.7+/1.8+
- Scala 2.11.x (or higher)
- Python 2.7+/3.4+
- R 3.1+ and RStudio 1.0.143 (or higher)
- Eclipse Mars, Oxygen, or Luna (latest)
- Maven Eclipse plugin (2.9 or higher)
- Maven compiler plugin for Eclipse (2.3.2 or higher)
- Maven assembly plugin for Eclipse (2.4.1 or higher)
Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).
Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for standalone word missing and for an SQL warehouse).
Who this book is for
Anyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful in order to pick up the concepts quicker. Scala has been observing a steady rise in adoption over the past few years, especially in the fields of data science and analytics. Going hand in hand with Scala is Apache Spark, which is programmed in Scala and is widely used in the field of analytics. This book will help you leverage the power of both these tools to make sense of big data.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the BeautifulSoup
function."
A block of code is set as follows:
package com.chapter11.SparkMachineLearning import org.apache.spark.mllib.feature.StandardScalerModel import org.apache.spark.mllib.linalg.{ Vector, Vectors } import org.apache.spark.sql.{ DataFrame } import org.apache.spark.sql.SparkSession
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "E:/Exp/")
.config("spark.kryoserializer.buffer.max", "1024m")
.appName("OneVsRestExample")
.getOrCreate()
Any command-line input or output is written as follows:
$./bin/spark-submit --class com.chapter11.RandomForestDemo \ --master spark://ip-172-31-21-153.us-west-2.compute:7077 \ --executor-memory 2G \ --total-executor-cores 2 \ file:///home/KMeans-0.0.1-SNAPSHOT.jar \ file:///home/mnist.bz2
New termsand important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next
button moves you to the next screen."
Note
Warnings or important notes appear like this.
Note
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected]
, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:
- Log in or register to our website using your e-mail address and password.
- Hover the mouse pointer on the
SUPPORT
tab at the top. - Click on
Code Downloads & Errata
. - Enter the name of the book in the
Search
box. - Select the book for which you're looking to download the code files.
- Choose from the drop-down menu where you purchased this book from.
- Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
- WinRAR / 7-Zip for Windows
- Zipeg / iZip / UnRarX for Mac
- 7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-and-Spark-for-Big-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ScalaandSparkforBigDataAnalytics_ColorImages.pdf
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form
link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata
section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected]
with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at [email protected]
, and we will do our best to address the problem.