Packt+ | Advance your knowledge in tech

0

All Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Apache Spark 2.x Cookbook

You're reading from Apache Spark 2.x Cookbook Over 70 cloud-ready recipes for distributed Big Data processing and analytics

Product type Paperback

Published in May 2017

Publisher

ISBN-13 9781787127265

Length 294 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Author (1):

Yadav

View More author details

Table of Contents (19) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Getting Started with Apache Spark FREE CHAPTER

2. Developing Applications with Spark

3. Spark SQL

4. Working with External Data Sources

5. Spark Streaming

6. Getting Started with Machine Learning

7. Supervised Learning with MLlib — Regression

8. Supervised Learning with MLlib — Classification

9. Unsupervised Learning

10. Recommendations Using Collaborative Filtering

11. Graph Processing Using GraphX and GraphFrames

12. Optimizations and Performance Tuning

Analyzing nested structures

There is a reason why the nested structures recipe is right after that of joins. Nested structures have traditionally been associated with web-based applications and hyper-scale companies. The most common format of nested structures is JSON. JSON inherited nested structures from XML, which JSON made irrelevant.

Getting ready

The power of nested structures goes far beyond traditional use cases, though. It has been very difficult to represent hierarchical data in highly normalized databases. Data needs to be joined across tables as needed. This does provide us with flexibility. Let's understand it with the example we covered in the previous recipe. In the Yelp dataset, a user reviews a business, which is represented by yelp_academic_dataset_review.json. In reality, a user reviews multiple businesses and a business is reviewed by multiple users. One would argue that it represents standard NxN relationships between entities, so what's the big deal here? The challenges...

The rest of the chapter is locked

Register for a free Packt account to unlock a world of extra content!

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at ₹800/month. Cancel anytime

Authors (1)

Yadav

Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.

See other products by Yadav

Other recommended products

Related to this chapter

Learning Spark SQL

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

Sep 2017 15h 4m

Apache Spark Quick Start Guide

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

Hands-On Data Analysis with Scala

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

May 2019 9h 56m

Scala and Spark for Big Data Analytics

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you'll learn how to leverage the power of both Scala and Spark to make sense of big data.

Jul 2017 26h 32m

Learning Apache Spark 2

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

Mar 2017 11h 52m

Machine Learning with Scala Quick Start Guide

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

Apr 2019 7h 20m

Mastering Apache Spark 2.x

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

Jul 2017 11h 48m

Hands-On Big Data Analytics with PySpark

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

Apache Spark 2.x for Java Developers

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

Jul 2017 11h 40m

Learning Apache Flink

Learning Apache Flink

Feb 2017 9h 20m

PySpark Cookbook

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

Jun 2018 11h 0m

Learning PySpark

Learning PySpark

This book will get you to grips with the Spark Python API. You'll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

Personalised recommendations for you

Based on your interests and search pattern

Mathematics of Machine Learning

Mathematics of Machine Learning

Deepen your theoretical knowledge and enhance your ability to solve complex machine learning problems with structured guidance. Gain the confidence to engage with advanced ML literature and tailor algorithms to meet your project requirements.

May 2025 24h 20m

Generative AI with Python and PyTorch

Generative AI with Python and PyTorch

Learn how to create images and text using VAEs, GANs, LSTMs, and transformers. Implement applications in natural language processing and computer vision through practical tutorials.

Mar 2025 15h 0m

Practical Generative AI with ChatGPT

Practical Generative AI with ChatGPT

This book helps you unlock ChatGPT's potential to make your working life better. From prompt engineering to creating custom GPTs, you'll enhance your productivity, creativity, and efficiency with practical insights and advanced techniques.

Apr 2025 12h 52m

Generative AI with LangChain

Generative AI with LangChain

Gain a solid foundation in LangChain, agentic AI, and LangGraph, and learn to build production-ready systems with multi-agent architectures, advanced RAG pipelines, Tree of Thought reasoning, agent handoffs, and fine-grained error handling.

May 2025 15h 52m

Architecting Power BI Solutions in Microsoft Fabric

Architecting Power BI Solutions in Microsoft Fabric

Power BI provides several options to solve common data problems, and designing the correct solution for each scenario can be a daunting task. This book makes it easier by guiding you through designing optimal solutions using Power BI.

Apr 2025 14h 16m

Microsoft Identity and Access Administrator SC-300 Exam Guide

Microsoft Identity and Access Administrator SC-300 Exam Guide

This comprehensive guide covers key topics such as Microsoft Entra ID implementation, authentication and access management, external user management, and hybrid identity solutions, providing practical insights and techniques for SC-300 exam success.

Mar 2025 19h 48m

LLM Design Patterns

LLM Design Patterns

This book helps you gain practical skills to develop and deploy LLMs. You'll learn data prep, training, pruning, quantization, and evaluation, as well as explore RAG, advanced prompting, and optimization to build robust, scalable language models.

May 2025 17h 48m

Tableau Cookbook for Experienced Professionals

Tableau Cookbook for Experienced Professionals

Advance your Tableau knowledge beyond the basics, streamline dashboard performance, tackle advanced geospatial challenges, and unlock API potential while fortifying your corporate data infrastructure with proven best practices.

Apr 2025 12h 24m

Time Series Analysis with Spark

Time Series Analysis with Spark

This book offers a complete guide to time series analysis with Apache Spark and Databricks, covering essential concepts and advanced techniques including Generative AI to equip readers with skills for real-world challenges across industries.

Mar 2025 9h 56m

Hands-On Artificial Intelligence for IoT

Hands-On Artificial Intelligence for IoT

Transform IoT systems with the power of artificial intelligence using this hands-on guide. Dive into practical techniques and expert insights to innovate and optimize your IoT devices, making them smarter and more efficient.

May 2025 15h 44m