Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Real-Time Big Data Analytics

You're reading from   Real-Time Big Data Analytics Design, process, and analyze large sets of complex data in real time

Arrow left icon
Product type Paperback
Published in Feb 2016
Publisher
ISBN-13 9781784391409
Length 326 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Shilpi Saxena Shilpi Saxena
Author Profile Icon Shilpi Saxena
Shilpi Saxena
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Real-Time Big Data Analytics
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
1. Introducing the Big Data Technology Landscape and Analytics Platform FREE CHAPTER 2. Getting Acquainted with Storm 3. Processing Data with Storm 4. Introduction to Trident and Optimizing Storm Performance 5. Getting Acquainted with Kinesis 6. Getting Acquainted with Spark 7. Programming with RDDs 8. SQL Query Engine for Spark – Spark SQL 9. Analysis of Streaming Data Using Spark Streaming 10. Introducing Lambda Architecture Index

Index

A

  • Abstract Syntax Tree / The Catalyst optimizer
  • advanced data sources
    • reference link / The components of Spark Streaming
  • Amazon Kinesis
    • about / Benefits and use cases of Amazon Kinesis
    • managed service / Benefits and use cases of Amazon Kinesis
    • disruptive innovation / Benefits and use cases of Amazon Kinesis
    • benefits / Benefits and use cases of Amazon Kinesis
    • telecommunication / Benefits and use cases of Amazon Kinesis
    • healthcare / Benefits and use cases of Amazon Kinesis
    • automotive / Benefits and use cases of Amazon Kinesis
  • Amazon S3
    • reference link / Executing Spark Streaming applications on Apache Mesos
  • Analytical Engine / Solution implementation
  • anchoring / The concept of anchoring and reliability
  • annotations, org.apache.spark.annotation
    • DeveloperAPI / Spark packaging structure and core APIs
    • Experimental / Spark packaging structure and core APIs
    • AlphaComponent / Spark packaging structure and core APIs
  • Apache Cassandra 2.1.7
    • reference link / Configuring Apache Cassandra and Spark
  • Apache Flume
    • URL / The technology matrix for Lambda Architecture
  • Apache Hadoop
    • URL / The emergence of Spark SQL
  • Apache Kafka
    • URL / The technology matrix for Lambda Architecture
  • Apache Mesos
    • about / The Spark execution model – master-worker view, Executing Spark Streaming applications on Apache Mesos
    • URL / The Spark execution model – master-worker view
    • reference link / Executing Spark Streaming applications on Apache Mesos
    • Spark Streaming applications, executing on / Executing Spark Streaming applications on Apache Mesos
  • Apache Sqoop
    • URL / The technology matrix for Lambda Architecture
  • ApplicationMaster (AM) / Executing Spark Streaming applications on Yarn
  • application master (AM) / The Spark execution model – master-worker view
  • architectural overview, Kinesis
    • about / Architectural overview of Kinesis
    • Amazon Kinesis, benefits / Benefits and use cases of Amazon Kinesis
    • high-level architecture / High-level architecture
    • components / Components of Kinesis
  • auto learning synchronization mechanism / Solution implementation
  • Avro
    • reference / Schema evolution/merging
  • AWS SDK / Components of Kinesis
  • Azure Table Storage (ATS) / Distributed databases (NoSQL)

B

  • batch data processing
    • about / Batch data processing
    • use cases / Batch data processing
    • challenges / Batch data processing
  • batch duration
    • about / High-level architecture
  • batching
    • about / Batching
    • count-based batching / Batching
    • time-based batching / Batching
  • batch mode / The emergence of Spark SQL
  • batch processing
    • in distributed modeTopicnabout / Batch processing in distributed mode
    • in distributed modeTopicncode, pushing to data / Push code to data
  • Big Data
    • about / Big Data – a phenomenon
    • dimensional paradigm / The Big Data dimensional paradigm
    • infrastructure / The Big Data infrastructure
  • Big Data analytics architecture
    • about / The Big Data analytics architecture
    • business solution, building / Building business solutions
    • data processing / Dataset processing
    • solution implementation / Solution implementation
    • presentation / Presentation
  • Big Data ecosystem
    • about / The Big Data ecosystem
    • components / Components of the Big Data ecosystem
  • Big Data problem statements, Lambda Architecture
    • Volume / Layers/components of Lambda Architecture
    • reference link / Layers/components of Lambda Architecture
    • Velocity / Layers/components of Lambda Architecture
    • Variety / Layers/components of Lambda Architecture
  • bolts
    • about / Bolts
    • declareStream() / Bolts
    • emit() / Bolts
    • InputDeclarer / Bolts
    • execute() / Bolts
    • IRichBolt / Bolts
    • IBasicBolt / Bolts
    • tasks / Tasks
    • workers / Workers
  • Business Intelligence (BI) / The Big Data ecosystem

C

  • Call Data Record (CDR) / The telecoms or cellular arena
  • CAS (content-addressed storage) / Producers
  • cascading / Components of the Big Data ecosystem
  • Cassandra Core driver
    • reference link / Configuring Apache Cassandra and Spark
  • Cassandra Query Language (CQL) / Configuring Apache Cassandra and Spark
  • Catalyst optimizer
    • about / The Catalyst optimizer
    • phases / The Catalyst optimizer
  • challenges, batch data processing
    • large data / Batch data processing
    • distributed processing / Batch data processing
    • SLAs / Batch data processing
    • fault tolerant / Batch data processing
  • challenges, in selecting technology for data consumption layer
    • highly available / The technology matrix for Lambda Architecture
    • fault tolerance / The technology matrix for Lambda Architecture
    • reliability / The technology matrix for Lambda Architecture
    • performance efficient / The technology matrix for Lambda Architecture
    • extendable and flexible / The technology matrix for Lambda Architecture
  • challenges, real-time data processing
    • strict SLAs / Real-time data processing
    • recovering from failures / Real-time data processing
    • scalable / Real-time data processing
    • all in-memory / Real-time data processing
    • asynchronous / Real-time data processing
  • cluster manager
    • about / The Spark execution model – master-worker view
  • cluster managers, for Spark streaming
    • about / Cluster managers for Spark Streaming
  • Coda Hale metrics library
    • reference link / Monitoring Spark Streaming applications
  • Complex Event Processing (CEP) / Real-time processing
  • components, Big Data ecosystem
    • about / Components of the Big Data ecosystem
  • components, Kinesis
    • about / Components of Kinesis
    • data sources / Components of Kinesis
    • producers / Components of Kinesis
    • consumers / Components of Kinesis
    • AWS SDK / Components of Kinesis
    • KPL / Components of Kinesis
    • KCL / Components of Kinesis
    • Kinesis streams / Components of Kinesis
    • shards / Components of Kinesis
    • partition keys / Components of Kinesis
    • sequence numbers / Components of Kinesis
  • components, Spark SQL
    • DataFrame API / The DataFrame API
    • Catalyst optimizer / The Catalyst optimizer
    • SQL/Hive contexts / SQL and Hive contexts
  • components, Spark Streaming
    • about / The components of Spark Streaming
    • input data streams / The components of Spark Streaming
    • Spark streaming job / The components of Spark Streaming
    • Spark core engine / The components of Spark Streaming
    • output data streams / The components of Spark Streaming
  • components/layers, Lambda Architecture
    • data sources / Layers/components of Lambda Architecture
    • data consumption layer / Layers/components of Lambda Architecture
    • batch layer / Layers/components of Lambda Architecture
    • real-time layers / Layers/components of Lambda Architecture
    • serving layers / Layers/components of Lambda Architecture
  • ConnectionProvider interface
    • about / Storm's JDBC persistence framework
  • consumer group
    • about / Getting to know more about Kafka
  • cost-based optimization / The Catalyst optimizer
  • CQLSH
    • about / Configuring Apache Cassandra and Spark
  • custom connectors
    • reference link / The components of Spark Streaming

D

  • Dashboard/Workbench / Solution implementation
  • Data as a Service (DaaS) / The Big Data ecosystem
  • DataFrame API
    • about / The DataFrame API
    • DataFrames and RDD / DataFrames and RDD
    • user-defined functions / User-defined functions
    • DataFrames and SQL / DataFrames and SQL
  • DataFrames
    • about / Spark extensions/libraries
  • Data Lineage
    • about / Understanding Spark transformations and actions
  • data mining
    • about / When to use Spark – practical use cases
    • reference link / When to use Spark – practical use cases
  • data processing
    • reliability / Reliability of data processing
    • anchoring / The concept of anchoring and reliability
    • Storm acking framework / The Storm acking framework
  • dependencies
    • about / Understanding Spark transformations and actions
  • deployment
    • about / Deployment and monitoring
  • dimensional paradigm, Big Data
    • about / The Big Data dimensional paradigm
    • volume / The Big Data dimensional paradigm
    • velocity / The Big Data dimensional paradigm
    • variety / The Big Data dimensional paradigm
    • veracity / The Big Data dimensional paradigm
    • value / The Big Data dimensional paradigm
  • Direct Acyclic Graph (DAG)
    • about / Understanding Spark transformations and actions
    / Partitioning and parallelism
  • directed acyclic graph (DAG)
    • about / Spark packaging structure and core APIs
    • reference link / Spark packaging structure and core APIs
  • distributed batch processing
    • about / Distributed batch processing
  • distributed computing
    • reference link / The technology matrix for Lambda Architecture
  • distributed databases (NoSQL)
    • about / Distributed databases (NoSQL)
  • DoubleRDDFunctions.scala
    • about / RDD APIs
    • reference link / RDD APIs
  • DStreams
    • about / High-level architecture, The components of Spark Streaming
    • URL / The components of Spark Streaming
  • duplication / Distributed databases (NoSQL)
  • DynamoDB
    • reference / Components of Kinesis

E

  • Eclipse
    • installing / Eclipse
  • Eclipse Luna (4.4)
    • download link / Eclipse
  • electronic publishing
    • reference link / Real-time data processing
  • electronic trading platform
    • reference link / Real-time data processing
  • ETL (Extract Transform Load) / Dataset processing
  • extensibility
    • reference link / The need for Lambda Architecture
  • extensions/libraries, Spark
    • Spark Streaming / Spark extensions/libraries
    • MLlib / Spark extensions/libraries
    • GraphX / Spark extensions/libraries
    • Spark SQL / Spark extensions/libraries
    • SparkR / Spark extensions/libraries

F

  • fastutil library
    • URL / Memory tuning
  • fault tolerance
    • reference link / The need for Lambda Architecture
  • fault tolerant
    • reference link / Batch data processing
  • features, Lambda Architecture
    • scalable / The need for Lambda Architecture
    • resilient to failures / The need for Lambda Architecture
    • low latency / The need for Lambda Architecture
    • extensible / The need for Lambda Architecture
    • maintenance / The need for Lambda Architecture
  • features, resilient distributed datasets (RDD)
    • fault tolerance / Fault tolerance
    • storage / Storage
    • persistence / Persistence
    • shuffling / Shuffling
  • features, Spark
    • data storage / Apache Spark – a one-stop solution
    • use cases / Apache Spark – a one-stop solution
    • fault-tolerance / Apache Spark – a one-stop solution
    • programming languages / Apache Spark – a one-stop solution
    • hardware / Apache Spark – a one-stop solution
    • management / Apache Spark – a one-stop solution
    • deployment / Apache Spark – a one-stop solution
    • efficiency / Apache Spark – a one-stop solution
    • distributed caching / Apache Spark – a one-stop solution
    • ease of use / Apache Spark – a one-stop solution
    • high-level operations / Apache Spark – a one-stop solution
    • API and extension / Apache Spark – a one-stop solution
    • security / Apache Spark – a one-stop solution
  • fence instruction / Memory and cache
  • filtering step / Dataset processing
  • Flume
    • URL / The components of Spark Streaming
  • functionalities, RDD API
    • partitions / Understanding Spark transformations and actions
    • splits / Understanding Spark transformations and actions
    • dependencies / Understanding Spark transformations and actions
    • partitioner / Understanding Spark transformations and actions
    • location of splits / Understanding Spark transformations and actions
  • functions, resilient distributed datasets (RDD)
    • saveAsTextFile(path) / Storage
    • saveAsSequenceFile(path) / Storage
    • saveAsObjectFile(path) / Storage

G

  • GraphX
    • about / When to use Spark – practical use cases, Spark extensions/libraries
    • reference link / When to use Spark – practical use cases, Spark extensions/libraries

H

  • Hadoop / The Big Data infrastructure
    • reference link / Apache Spark – a one-stop solution
  • Hadoop 2.0
    • URL / The Spark execution model – master-worker view
  • Hadoop 2.4.0 distribution
    • URL, for downloading / Programming Spark transformations and actions
  • Hadoop ecosystem
    • key technologies / The Big Data infrastructure
  • HadoopRDD
    • about / RDD APIs
    • reference link / RDD APIs
  • HDFS
    • about / Components of the Big Data ecosystem
  • high-level architecture, Kinesis
    • about / High-level architecture
  • high-level architecture, of SQL Streaming Crime Analyzer
    • crime producer / The high-level architecture of our job
    • stream consumer / The high-level architecture of our job
    • Stream to DataFrame transformer / The high-level architecture of our job
  • high-level architecture, Spark
    • about / High-level architecture
    • physical machines / High-level architecture
    • data storage layer / High-level architecture
    • resource manager / High-level architecture
    • Spark core libraries / High-level architecture
    • Spark extensions/libraries / High-level architecture
  • high level architecture, Lambda
    • data source / high-level architecture
    • custom producer / high-level architecture
    • real-time layer / high-level architecture
    • batch layers / high-level architecture
    • serving layers / high-level architecture
  • high level architecture, of Spark Streaming / High-level architecture
  • Hive / Components of the Big Data ecosystem
    • URL / The emergence of Spark SQL
  • HiveQL
    • reference / Working with Hive tables
  • Hive tables
    • working with / Working with Hive tables

I

  • Infrastructure as a Service (IaaS) / The Big Data ecosystem
  • input data streams
    • about / The components of Spark Streaming
    • basic data sources / The components of Spark Streaming
    • advance data sources / The components of Spark Streaming
  • input sources, Storm
    • about / Storm input sources, Other sources for input to Storm
    • Kafka / Meet Kafka, Kafka as an input source
    • file / A file as an input source
    • socket / A socket as an input source
  • installing
    • Spark / Spark
    • Java / Java
    • Scala / Scala
    • Eclipse / Eclipse
  • integration / Dataset processing
  • inter-worker communication
    • about / Storm internal message processing
    • workers, executing on same node / Storm internal message processing
    • workers, executing across nodes / Storm internal message processing
  • Internet of Things (IoT)
    • about / Real-time data processing
    • reference link / Real-time data processing
  • intra-worker communication
    • about / Storm internal message processing

J

  • Java
    • installing / Java
    • Spark job, coding in / Coding a Spark job in Java
    • Spark Streaming job, writing in / Writing our Spark Streaming job in Java
  • JdbcMapper interface
    • about / Storm's JDBC persistence framework
  • JdbcRDD
    • about / RDD APIs
    • reference link / RDD APIs
  • Joins
    • about / Joins

K

  • Kafka
    • about / Meet Kafka, Getting to know more about Kafka
    • cluster / Meet Kafka
    • components / Meet Kafka
    • reference / Meet Kafka
    • Time to live (TTL) / Getting to know more about Kafka
    • topics / Getting to know more about Kafka
    • consumers / Getting to know more about Kafka
    • offset / Getting to know more about Kafka
    • URL / The components of Spark Streaming
  • Key Performance Indicators (KPIs)
    • about / Batch data processing
  • key technologies, Hadoop ecosystem
    • about / The Big Data infrastructure
    • Hadoop / The Big Data infrastructure
    • NoSQL / The Big Data infrastructure
    • MPP / The Big Data infrastructure
  • Kinesis
    • architectural overview / Architectural overview of Kinesis
    • URL / The components of Spark Streaming
  • Kinesis Client Library (KCL)
    • about / Components of Kinesis
  • Kinesis Producer Library (KPL)
    • about / Components of Kinesis
    • retry mechanism / Components of Kinesis
    • batching of records / Components of Kinesis
    • aggregation / Components of Kinesis
    • deaggregation / Components of Kinesis
    • monitoring / Components of Kinesis
  • Kinesis streaming service
    • creating / Creating a Kinesis streaming service
    • AWS Kinesis, accessing / Access to AWS Kinesis
    • development environment, configuring / Configuring the development environment
    • Kinesis streams, creating / Creating Kinesis streams
    • Kinesis stream producers, creating / Creating Kinesis stream producers
    • Kinesis stream consumers, creating / Creating Kinesis stream consumers
    • crime alerts, generating / Generating and consuming crime alerts
    • crime alerts, consuming / Generating and consuming crime alerts
  • Kinesis stream producers
    • sample dataset / Creating Kinesis stream producers
    • use case / Creating Kinesis stream producers
  • Kryo documentation
    • reference / Serialization
  • Kryo serialization
    • reference / Serialization
  • Kyro
    • URL / Handling persistence in Spark

L

  • Lambda Architecture
    • about / What is Lambda Architecture
    • need for / The need for Lambda Architecture
    • features / The need for Lambda Architecture
    • components/layers / Layers/components of Lambda Architecture
    • Big Data problem statements / Layers/components of Lambda Architecture
    • technology matrix / The technology matrix for Lambda Architecture
    • realization / Realization of Lambda Architecture
  • least-recently-used (LRU) / Handling persistence in Spark
  • LMAX
    • about / Understanding LMAX
    • memory / Memory and cache
    • cache / Memory and cache
    • ring buffer / Ring buffer – the heart of the disruptor
  • LMAX Disruptor / Storm internal message processing
  • log analysis
    • reference link / Batch data processing
  • Logstash
    • URL / The technology matrix for Lambda Architecture

M

  • MapReduce / Components of the Big Data ecosystem
    • URL / The emergence of Spark SQL
  • Massively Parallel Processing (MPP) / The Big Data infrastructure
  • membar / Memory and cache
  • memory barrier / Memory and cache
  • memory fence / Memory and cache
  • memory tuning
    • about / Memory tuning
    • garbage collection / Memory tuning
    • object sizes / Memory tuning
    • executor memory / Memory tuning
  • Mesos
    • URL / High-level architecture
  • Message Processing Interface (MPI) / Batch processing in distributed mode
  • microbatches / High-level architecture
  • MLlib
    • reference link / When to use Spark – practical use cases, Spark extensions/libraries
    • about / When to use Spark – practical use cases, Spark extensions/libraries
  • modes, YARN
    • YARN client mode / The Spark execution model – master-worker view
    • YARN cluster mode / The Spark execution model – master-worker view
  • monitoring
    • about / Deployment and monitoring
  • MultiLangDaemon interface / Components of Kinesis

N

  • near real-time (NRT) systems
    • about / Real-time data processing
  • Netty
    • about / Netty
  • NewHadoopRDD
    • reference link / RDD APIs
  • Nimbus
    • about / A Storm cluster
    / Optimizing Storm performance
  • node manager (NM) / The Spark execution model – master-worker view
  • NodeManager (NM) / Executing Spark Streaming applications on Yarn
  • NoSQL / The Big Data infrastructure
  • NoSQL databases
    • advantages / Advantages of NoSQL databases
    • choice / Choosing a NoSQL database
  • NoSQL databases, distinguishing
    • key-value store / Distributed databases (NoSQL)
    • column store / Distributed databases (NoSQL)
    • wide column store / Distributed databases (NoSQL)
    • document database / Distributed databases (NoSQL)
    • graph database / Distributed databases (NoSQL)

O

  • operations, RDD API
    • reference link / RDD APIs
  • Oracle Java 7
    • download link / Java
  • OrderedRDDFunctions
    • about / RDD APIs
    • reference link / RDD APIs
  • org.apache.spark.streaming.dstream.DStream.scala / Spark Streaming APIs
  • org.apache.spark.streaming.flume.*
    • reference link / Spark Streaming APIs
  • org.apache.spark.streaming.kafka.*
    • reference link / Spark Streaming APIs
  • org.apache.spark.streaming.kinesis.*
    • reference link / Spark Streaming APIs
  • org.apache.spark.streaming.StreamingContext / Spark Streaming APIs
  • org.apache.spark.streaming.twitter.*
    • reference link / Spark Streaming APIs
  • org.apache.spark.streaming.zeromq.*
    • reference link / Spark Streaming APIs
  • or Illinois Uniform Crime Reporting (IUCR) / Programming Spark transformations and actions
  • output data streams
    • about / The components of Spark Streaming
  • output operations, DStreams
    • print() / Spark Streaming operations
    • saveAsTextFiles(prefix, suffix) / Spark Streaming operations
    • saveAsObjectFiles(prefix, suffix) / Spark Streaming operations
    • saveAsHadoopFiles(prefix, suffix) / Spark Streaming operations
    • foreachRDD(func) / Spark Streaming operations

P

  • packaging structure, Spark Streaming
    • about / The packaging structure of Spark Streaming
    • Spark Streaming APIs / Spark Streaming APIs
    • Spark Streaming operations / Spark Streaming operations
  • PairRDDFunctions
    • about / RDD APIs
    • reference link / RDD APIs
  • Parquet
    • about / Working with Parquet
    • working with / Working with Parquet
    • URL / Working with Parquet
    • data, persisting in HDFS / Persisting Parquet data in HDFS
  • partitioner
    • about / Understanding Spark transformations and actions
  • partitioning
    • about / Partitioning and schema evolution or merging , Partitioning
    • reference / Partitioning
  • partition keys
    • about / Components of Kinesis
  • partitions
    • about / Understanding Spark transformations and actions
    / Partitioning and parallelism
  • performance tuning
    • about / Performance tuning and best practices
    • partitioning / Partitioning and parallelism
    • parallelism / Partitioning and parallelism
    • serialization / Serialization
    • caching / Caching
    • memory tuning / Memory tuning
  • persistence
    • handling, in Spark / Handling persistence in Spark
  • phases, Spark SQL
    • analysis / The Catalyst optimizer
    • logical optimization / The Catalyst optimizer
    • physical planning / The Catalyst optimizer
    • code generation / The Catalyst optimizer
  • Pig / Components of the Big Data ecosystem
    • URL / The emergence of Spark SQL
  • practical use cases, Spark
    • batch processing / When to use Spark – practical use cases
    • streaming / When to use Spark – practical use cases
    • data mining / When to use Spark – practical use cases
    • MLlib / When to use Spark – practical use cases
    • graph computing / When to use Spark – practical use cases
    • GraphX / When to use Spark – practical use cases
    • interactive analysis / When to use Spark – practical use cases
  • ProtocolBuffer
    • reference / Schema evolution/merging
  • pub-sub
    • about / Getting to know more about Kafka

Q

  • quasiquotes
    • reference / The Catalyst optimizer
  • queue
    • about / Getting to know more about Kafka

R

  • RabbitMQ
    • URL / The technology matrix for Lambda Architecture
  • RandomSentenceSpout
    • about / How and when to use Storm
  • RDD
    • converting, to DataFrames / Converting RDDs to DataFrames
    • automated process / Converting RDDs to DataFrames, Automated process
    • manual process / Converting RDDs to DataFrames, The manual process
  • RDD.scala
    • about / RDD APIs
  • RDD action operations
    • about / RDD action operations
    • reduce(func) / RDD action operations
    • collect() / RDD action operations
    • count() / RDD action operations
    • countApproxDistinct(relativeSD* Double = 0.05) / RDD action operations
    • countByKey() / RDD action operations
    • first() / RDD action operations
    • take(n) / RDD action operations
    • takeSample(withReplacement, num, [seed]) / RDD action operations
    • takeOrdered(Int*num) / RDD action operations
    • saveAsTextFile (pathTopicn String) / RDD action operations
    • saveAsSequenceFile (path* String) / RDD action operations
    • saveAsObjectFile (path* String) / RDD action operations
  • RDD API
    • functionalities / Understanding Spark transformations and actions
  • RDD APIs
    • about / RDD APIs
    • RDD.scala / RDD APIs
    • DoubleRDDFunctions.scala / RDD APIs
    • HadoopRDD / RDD APIs
    • JdbcRDD / RDD APIs
    • PairRDDFunctions / RDD APIs
    • OrderedRDDFunctions / RDD APIs
    • SequenceFileRDDFunctions / RDD APIs
  • RDD transformation operations
    • about / RDD transformation operations
    • filter(filterFunc) / RDD transformation operations
    • map(mapFunc) / RDD transformation operations
    • flatMap(flatMapFunc) / RDD transformation operations
    • mapPartitions(mapPartFunc, preservePartitioning) / RDD transformation operations
    • distinct() / RDD transformation operations
    • union(otherDataset) / RDD transformation operations
    • intersection(otherDataset) / RDD transformation operations
    • groupByKey([numTasks]) / RDD transformation operations
    • reduceByKey(func, [numTasks]) / RDD transformation operations
    • coalesce(numPartitions) / RDD transformation operations
    • sortBy (f,[ascending], [numTasks]) / RDD transformation operations
    • sortByKey([ascending], [numTasks]) / RDD transformation operations
    • repartition(numPartitions) / RDD transformation operations
    • join(otherDataset, [numTasks]) / RDD transformation operations
  • real-time (RT) systems / Real-time data processing
  • real-time data processing
    • about / Real-time data processing
    • use cases / Real-time data processing
    • challenges / Real-time data processing
  • real-time processing
    • about / Real-time processing
    • telecom or cellular arena / The telecoms or cellular arena
    • transportation and logistics / Transportation and logistics
    • connected vehicle / The connected vehicle
    • financial sector / The financial sector
  • realization, of Lambda Architecture
    • about / Realization of Lambda Architecture
    • high level architecture / high-level architecture
    • Apache Cassandra, configuring / Configuring Apache Cassandra and Spark
    • Spark, configuring / Configuring Apache Cassandra and Spark
    • custom producer, coding / Coding the custom producer
    • real-time layers, coding / Coding the real-time layer
    • batch layers, coding / Coding the batch layer
    • serving layers, coding / Coding the serving layer
    • layers, executing / Executing all the layers
  • Redshift
    • reference / Components of Kinesis
  • reduce functionality
    • reference link / RDD action operations
  • Relational Database Management Systems (RDBMS) / The emergence of Spark SQL
  • relaxed SLAs
    • about / Batch data processing
  • replication / Distributed databases (NoSQL)
  • resilient distributed datasets (RDD)
    • about / The architecture of Spark, The Spark execution model – master-worker view, Resilient distributed datasets (RDD)
    • features / RDD – by definition
    • functions / Storage
  • Resilient Distributed Datasets (RDD)
    • about / Understanding Spark transformations and actions
  • Resilient Distributed Datasets (RDDs)
    • reference link / Shuffling
    • about / High-level architecture
  • resource manager
    • about / The Spark execution model – master-worker view
  • resource manager (RM) / The Spark execution model – master-worker view
  • ResourceManager (RM) / Executing Spark Streaming applications on Yarn
  • resource managers, Spark
    • Apache Mesos / The Spark execution model – master-worker view
    • Hadoop YARN / The Spark execution model – master-worker view
    • standalone mode / The Spark execution model – master-worker view
    • local mode / The Spark execution model – master-worker view
  • ring buffer
    • about / Ring buffer – the heart of the disruptor
    • producers / Producers
    • consumers / Consumers
  • rule based optimizations / The Catalyst optimizer

S

  • S3
    • reference / Components of Kinesis
  • Scala
    • reference link / Spark packaging structure and core APIs
    • installing / Scala
    • Spark job, coding in / Coding a Spark job in Scala
    • Spark Streaming job, writing in / Writing our Spark Streaming job in Scala
  • Scala 2.10.5 compressed tarball
    • download link / Scala
  • Scala APIs, by Spark Core
    • org.apache.spark / Spark packaging structure and core APIs
    • org.apache.spark.SparkContext / Spark packaging structure and core APIs
    • org.apache.spark.rdd.RDD.scala / Spark packaging structure and core APIs
    • org.apache.spark.annotation / Spark packaging structure and core APIs
    • org.apache.spark.broadcast / Spark packaging structure and core APIs
    • HttpBroadcast / Spark packaging structure and core APIs
    • TorrentBroadcast / Spark packaging structure and core APIs
    • org.apache.spark.io / Spark packaging structure and core APIs
    • org.apache.spark.scheduler / Spark packaging structure and core APIs
    • org.apache.spark.storage / Spark packaging structure and core APIs
    • org.apache.spark.util / Spark packaging structure and core APIs
  • scalability
    • reference link / Batch data processing, The need for Lambda Architecture
  • schema evolution
    • about / Schema evolution/merging
  • schema merging
    • about / Schema evolution/merging
  • SequenceFileRDDFunctions
    • about / RDD APIs
    • reference link / RDD APIs
  • serialization process
    • URL / Handling persistence in Spark
  • shards
    • about / Components of Kinesis
    • for reads / Components of Kinesis
    • for writes / Components of Kinesis
  • single point of failure (SPOF) / The need for Lambda Architecture
  • SLAs
    • about / Batch data processing
  • smart traversing
    • about / Ring buffer – the heart of the disruptor
  • software development kit (SDK) / Components of Kinesis
  • Spark
    • overview / An overview of Spark
    • about / Apache Spark – a one-stop solution
    • features / Apache Spark – a one-stop solution
    • practical use cases / When to use Spark – practical use cases
    • packaging structure / Spark packaging structure and core APIs
    • core APIs / Spark packaging structure and core APIs
    • hardware requisites / Hardware requirements
    • installing / Spark
    • persistence handling / Handling persistence in Spark
    • storage levels / Handling persistence in Spark
  • Spark-Cassandra connector
    • reference link / Configuring Apache Cassandra and Spark
  • Spark-Cassandra Java library
    • reference link / Configuring Apache Cassandra and Spark
  • Spark 1.4.0
    • download link / Configuring Apache Cassandra and Spark
  • Spark actions
    • about / Understanding Spark transformations and actions
    • programming / Programming Spark transformations and actions
  • Spark architecture
    • about / The architecture of Spark
    • high-level architecture / High-level architecture
  • Spark cluster
    • configuring / Configuring the Spark cluster
  • Spark compressed tarball
    • download link / Spark
  • Spark Core
    • about / Spark packaging structure and core APIs
  • Spark core engine
    • about / The components of Spark Streaming
  • Spark driver
    • about / The Spark execution model – master-worker view
  • Spark execution model
    • about / Spark packaging structure and core APIs
  • Spark extensions
    • about / Spark packaging structure and core APIs
  • Spark framework
    • error / Working with Parquet
    • overwrite / Working with Parquet
    • append / Working with Parquet
    • ignore / Working with Parquet
  • Spark job
    • coding, in Scala / Coding a Spark job in Scala
    • coding, in Java / Coding a Spark job in Java
  • Spark master
    • about / The Spark execution model – master-worker view
  • Spark packages
    • reference link / Spark extensions/libraries
  • SparkR
    • about / Spark extensions/libraries
    • reference link / Spark extensions/libraries
  • Spark SQL
    • reference link / Spark extensions/libraries
    • phases / The Catalyst optimizer
  • SPARK SQL
    • architecture / The architecture of Spark SQL
    • emergence / The emergence of Spark SQL
    • about / The emergence of Spark SQL
    • features / The emergence of Spark SQL
    • components / The components of Spark SQL
    • DataFrame API / The components of Spark SQL
    • catalyst optimizer / The components of Spark SQL
  • Spark SQL job
    • coding / Coding our first Spark SQL job
    • reference / Coding our first Spark SQL job
    • coding, in Scala / Coding a Spark SQL job in Scala
    • coding, in Java / Coding a Spark SQL job in Java
  • Spark Steaming job
    • coding / Coding our first Spark Streaming job
  • Spark Streaming
    • reference link / When to use Spark – practical use cases, Spark extensions/libraries
    • about / Spark extensions/libraries
    • high level architecture / High-level architecture
    • components / The components of Spark Streaming
    • packaging structure / The packaging structure of Spark Streaming
  • Spark Streaming APIs
    • about / Spark Streaming APIs
    • reference link / Spark Streaming APIs
  • Spark Streaming applications
    • executing, on YARN / Executing Spark Streaming applications on Yarn
    • executing, on Apache Mesos / Executing Spark Streaming applications on Apache Mesos
    • monitoring / Monitoring Spark Streaming applications
    • reference link / Monitoring Spark Streaming applications
  • Spark Streaming job
    • writing, in Scala / Writing our Spark Streaming job in Scala
    • writing, in Java / Writing our Spark Streaming job in Java
    • executing / Executing our Spark Streaming job
  • Spark streaming job
    • about / The components of Spark Streaming
    • data receiver / The components of Spark Streaming
    • batches / The components of Spark Streaming
    • DStreams / The components of Spark Streaming
    • streaming contexts / The components of Spark Streaming
  • Spark Streaming operations
    • about / Spark Streaming operations
  • Spark transformation
    • about / Understanding Spark transformations and actions
    • programming / Programming Spark transformations and actions
  • Spark UI
    • workers / Configuring the Spark cluster
    • running applications / Configuring the Spark cluster
    • completed application / Configuring the Spark cluster
  • Spark worker/executors
    • about / The Spark execution model – master-worker view
  • speed layers
    • about / Layers/components of Lambda Architecture
  • splits
    • about / Understanding Spark transformations and actions
  • spout collector / The concept of anchoring and reliability
  • SQL Streaming Crime Analyzer
    • high-level architecture / The high-level architecture of our job
    • crime producer, coding / Coding the crime producer
    • stream consumer, coding / Coding the stream consumer and transformer
    • stream transformer, coding / Coding the stream consumer and transformer
    • executing / Executing the SQL Streaming Crime Analyzer
  • standalone resource manager
    • about / Configuring the Spark cluster
  • StorageLevel class
    • reference link / Persistence
  • storage levels, Spark
    • StorageLevel.MEMORY_ONLY / Handling persistence in Spark
    • StorageLevel.MEMORY_ONLY_SER / Handling persistence in Spark
    • StorageLevel.MEMORY_AND_DISK / Handling persistence in Spark
    • StorageLevel.MEMORY_AND_DISK_SER / Handling persistence in Spark
    • StorageLevel.DISK_ONLY / Handling persistence in Spark
    • StorageLevel.MEMORY_ONLY_2, MEMORY_AND_DISK_2 / Handling persistence in Spark
    • StorageLevel.OFF_HEAP / Handling persistence in Spark
  • Storm
    • about / Real-time processing
    • overview / An overview of Storm
    • journey / The journey of Storm
    • performance / The journey of Storm
    • scalability / The journey of Storm
    • fail safe / The journey of Storm
    • reliability / The journey of Storm
    • easy / The journey of Storm
    • open source / The journey of Storm
    • abstractions / Storm abstractions
    • architecture / Storm architecture and its components
    • components / Storm architecture and its components
    • local mode / Storm architecture and its components
    • distributed mode / Storm architecture and its components
    • reference / Storm architecture and its components
    • using / How and when to use Storm
    • input sources / Storm input sources
    • performance, optimizing / Optimizing Storm performance
    • reference link / Apache Spark – a one-stop solution
  • Storm abstractions
    • stream / Streams
    • topology / Topology
    • spout / Spouts
    • bolts / Bolts
  • Storm acking framework
    • about / The Storm acking framework
  • Storm cluster
    • about / A Storm cluster
    • Nimbus / A Storm cluster
    • Supervisors / A Storm cluster
    • UI / A Storm cluster
  • Storm internal message processing
    • about / Storm internal message processing
    • inter-worker communication / Storm internal message processing
    • intra-worker communication / Storm internal message processing
  • Storm internals
    • about / Storm internals
    • Storm parallelism / Storm parallelism
    • Storm internal message processing / Storm internal message processing
  • Storm internode communication
    • about / Storm internode communication
    • ZeroMQ / ZeroMQ
    • Netty / Netty
  • Storm parallelism
    • about / Storm parallelism
    • worker process / Storm parallelism
    • executors / Storm parallelism
    • tasks / Storm parallelism
  • Storm persistence
    • about / Storm persistence
    • JDBC persistence framework / Storm's JDBC persistence framework
  • Storm simple patterns
    • about / Storm simple patterns
    • Joins / Joins
    • batching / Batching
  • Storm UI
    • about / Understanding the Storm UI
    • landing page / Storm UI landing page
    • topology home page / Topology home page
  • StreamingContext
    • URL / The components of Spark Streaming
  • streaming data
    • querying / Querying streaming data in real time
  • stream producer
    • creating / Creating a stream producer
  • Supervisors
    • about / A Storm cluster
    • workers / A Storm cluster
    • executors / A Storm cluster
    • tasks / A Storm cluster
    / Optimizing Storm performance

T

  • Tachyon
    • URL / Apache Spark – a one-stop solution, Handling persistence in Spark
  • Taychon
    • URL / Handling persistence in Spark
  • TextInputFormat
    • reference link / Understanding Spark transformations and actions
  • Thrift
    • reference / Schema evolution/merging
  • transformation / Dataset processing
  • transformation operations, on input streams
    • reference link / Spark Streaming operations
  • transformation operations, on streaming data
    • windowing operations / Spark Streaming operations
    • transform operations / Spark Streaming operations
    • updateStateByKey Operation / Spark Streaming operations
    • output operations / Spark Streaming operations
  • Trident
    • working with / Working with Trident
    • transactions / Transactions
    • topology / Trident topology
    • operations / Trident operations
  • Trident operations
    • about / Trident operations
    • merging / Merging and joining
    • joining / Merging and joining
    • filter / Filter, Function
    • aggregation / Aggregation
    • grouping / Grouping
    • state maintenance / State maintenance
  • Trident topology
    • about / Trident topology
    • Trident tuples / Trident tuples
    • Trident spout / Trident spout
  • troubleshooting tips
    • about / Troubleshooting – tips and tricks
    • port numbers, used by Spark / Port numbers used by Spark
    • classpath issues / Classpath issues – class not found exception
    • other common exceptions / Other common exceptions

U

  • use cases, for batch data processing
    • log analysis/analytics / Batch data processing
    • predictive maintenance / Batch data processing
    • faster claim processing / Batch data processing
    • pricing analytics / Batch data processing
  • use cases, real-time data processing
    • Internet of Things (IoT) / Real-time data processing
    • online trading systems / Real-time data processing
    • online publishing / Real-time data processing
    • assembly lines / Real-time data processing
    • online gaming systems / Real-time data processing

W

  • WordCountTopology
    • about / How and when to use Storm
  • Write Ahead Logs (WAL) / The technology matrix for Lambda Architecture

Y

  • YARN
    • URL / High-level architecture
    • modes / The Spark execution model – master-worker view
    • Spark Streaming applications, executing on / Executing Spark Streaming applications on Yarn
    • reference link / Executing Spark Streaming applications on Yarn
  • YARN client mode / The Spark execution model – master-worker view
  • YARN cluster mode / The Spark execution model – master-worker view
  • Yet Another Resource Negotiator (YARN) / Batch processing in distributed mode

Z

  • ZeroMQ
    • about / ZeroMQ
    • Storm ZeroMQ configurations / Storm ZeroMQ configurations
  • ZooKeeper / Optimizing Storm performance
  • Zookeeper
    • about / A Storm cluster
  • Zookeeper cluster
    • about / A Zookeeper cluster
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime
Visually different images