Packt+ | Advance your knowledge in tech

You're reading from Scala and Spark for Big Data Analytics Explore the concepts of functional programming, data streaming, and machine learning

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781785280849

Length 796 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Authors (2):

Karim

Sridhar Alla

View More author details

Table of Contents (25) Chapters

Title Page

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Introduction to Scala FREE CHAPTER

2. Object-Oriented Scala

3. Functional Programming Concepts

4. Collection APIs

5. Tackle Big Data – Spark Comes to the Party

6. Start Working with Spark – REPL and RDDs

7. Special RDD Operations

8. Introduce a Little Structure - Spark SQL

9. Stream Me Up, Scotty - Spark Streaming

10. Everything is Connected - GraphX

11. Learning Machine Learning - Spark MLlib and Spark ML

12. My Name is Bayes, Naive Bayes

13. Time to Put Some Order - Cluster Your Data with Spark MLlib

14. Text Analytics Using Spark ML

15. Spark Tuning

16. Time to Go to ClusterLand - Deploying Spark on a Cluster

17. Testing and Debugging Spark

18. PySpark and SparkR

DataFrame API and SQL API

The creation of a DataFrame can be in several ways:

By executing SQL queries
Loading external data such as Parquet, JSON, CSV, text, Hive, JDBC, and so on
Converting RDDs to data frames

A DataFrame can be by loading a CSV file. We look at a CSV statesPopulation.csv, which is being loaded as a DataFrame.

The CSV has the following format of US states populations from years 2010 to 2016.

State	Year	Population
Alabama	2010	4785492
Alaska	2010	714031
Arizona	2010	6408312
Arkansas	2010	2921995
California	2010	37332685

Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection.

scala> val statesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv")
statesDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field]

Once the DataFrame is loaded, it can be examined for the schema:

scala> statesDF.printSchema
root
 |-- State: string (nullable = true)
 |-- Year...

The rest of the chapter is locked

You're reading from Scala and Spark for Big Data Analytics Explore the concepts of functional programming, data streaming, and machine learning

Table of Contents (25) Chapters

DataFrame API and SQL API

Authors (2)

Other recommended products

Personalised recommendations for you

You're reading from Scala and Spark for Big Data Analytics Explore the concepts of functional programming, data streaming, and machine learning

Table of Contents (25) Chapters

DataFrame API and SQL API

Unlock this book and the full library FREE for 7 days

Authors (2)

Other recommended products

Personalised recommendations for you