Packt+ | Advance your knowledge in tech

You're reading from Big Data Analytics with Java Data analysis, visualization & machine learning techniques

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781787288980

Length 418 pages

Edition 1st Edition

Languages

Java

Tools

Apache Spark

Concepts

Big Data

Author (1):

RAJAT MEHTA

View More author details

Table of Contents (21) Chapters

Big Data Analytics with Java

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Big Data Analytics with Java FREE CHAPTER

2. First Steps in Data Analysis

3. Data Visualization

4. Basics of Machine Learning

5. Regression on Big Data

6. Naive Bayes and Sentiment Analysis

7. Decision Trees

8. Ensembling on Big Data

9. Recommendation Systems

10. Clustering and Customer Segmentation on Big Data

11. Massive Graphs on Big Data

12. Real-Time Analytics on Big Data

13. Deep Learning Using Big Data

Index

A

Activation Function / Perceptron
advanced visualization technique
- about / Advanced visualization technique
- prefuse / Prefuse
- IVTK Graph toolkit / IVTK Graph toolkit
Alternating Least Square (ALS) / Alternating least square – collaborative filtering
Apache Kafka
- about / Apache Kafka
- IoT sensors, integration / Apache Kafka
- social media real-time analytics / Apache Kafka
- healthcare analytics / Apache Kafka
- log analytics / Apache Kafka
- risk aggregation, in finance / Apache Kafka
Apache Spark
- about / Apache Spark
- concepts / Concepts
- transformations / Transformations
- actions / Actions
- Spark Java API / Spark Java API
- samples, Java 8 used / Spark samples using Java 8
- data, loading / Loading data
- data operations / Data operations – cleansing and munging
- data, analyzing / Analyzing data – count, projection, grouping, aggregation, and max/min
- common transformations, on Spark RDDs / Analyzing data – count, projection, grouping, aggregation, and max/min
- actions, on RDDs / Actions on RDDs
- paired RDDs / Paired RDDs
- data, saving / Saving data
- results, collecting / Collecting and printing results
- results, printing / Collecting and printing results
- programs, executing on Hadoop / Executing Spark programs on Hadoop
- subprojects / Apache Spark sub-projects
- machine learning modules / Spark machine learning modules
- Apache Mahout / Mahout – a popular Java ML library
- Deeplearning4j / Deeplearning4j – a deep learning library
- Apriori algorithm, implementation / Implementation of the Apriori algorithm in Apache Spark
- FP-Growth algorithm, executing / Running FP-Growth on Apache Spark
Apache Spark, machine learning modules
- MLlib Java API / MLlib Java API
- machine learning libraries / Other machine learning libraries
Apache Spark machine learning API
- about / The new Spark ML API
- machine learning algorithms / The new Spark ML API
- features handling tools / The new Spark ML API
- model selection / The new Spark ML API
- tuning tools / The new Spark ML API
- utility methods / The new Spark ML API
Apriori algorithm
- implementation, in Apache Spark / Implementation of the Apriori algorithm in Apache Spark
- using / Implementation of the Apriori algorithm in Apache Spark
- disadvantages / Implementation of the Apriori algorithm in Apache Spark
artificial neural network / Introduction to neural networks

B

bagging / Bagging
bag of words / Bag of words
bar chart
- about / Bar charts
- dataset, creating / Bar charts
base project setup / Base project setup
- default Kafka configurations, used / Base project setup
- Maven Java project, for Spark Streaming / Base project setup
bayes theorem / Bayes theorem
bid data
- Analytical products / Basics of Hadoop – a Java sub-project
- Batch products / Basics of Hadoop – a Java sub-project
- Streamlining / Basics of Hadoop – a Java sub-project
- Machine learning libraries / Basics of Hadoop – a Java sub-project
- NoSQL / Basics of Hadoop – a Java sub-project
- Search / Basics of Hadoop – a Java sub-project
bidirected graph / Refresher on graphs
big data
- data analytics on / Why data analytics on big data?
- for data analytics / Big data for analytics
- to bigger pay package, for Java developers / Big data – a bigger pay package for Java developers
- Hadoop, basics / Basics of Hadoop – a Java sub-project
big data stack
- HDFS / Basics of Hadoop – a Java sub-project
- Spark / Basics of Hadoop – a Java sub-project
- Impala / Basics of Hadoop – a Java sub-project
- MapReduce / Basics of Hadoop – a Java sub-project
- Sqoop / Basics of Hadoop – a Java sub-project
- Oozie / Basics of Hadoop – a Java sub-project
- Flume / Basics of Hadoop – a Java sub-project
- Kafka / Basics of Hadoop – a Java sub-project
- Yarn / Basics of Hadoop – a Java sub-project
binary classification dataset / What are the feature types that can be extracted from the datasets?
boosting / Boosting
bootstrapping / Bagging
box plots / Box plots

C

charts
- used, in big data analytics / Using charts in big data analytics
- for initial data exploration / Using charts in big data analytics
- for data visualization and reporting / Using charts in big data analytics
clustering
- about / Clustering
- customer segmentation / Clustering
- search engines / Clustering
- data exploration / Clustering
- epidemic breakout zones, finding / Clustering
- biology / Clustering
- news categorization / Clustering
- news, summarization / Clustering
- types / Types of clustering
- hierarchical clustering / Hierarchical clustering
- K-means clustering / K-means clustering
- k-means clustering, bisecting / Bisecting k-means clustering
- for customer segmentation / Clustering for customer segmentation
clustering algorithm
- changing / Changing the clustering algorithm
code
- diving / Diving into the code:
cold start problem / Content-based recommendation systems
collaborative recommendation systems
- about / Collaborative recommendation systems
- advantages / Advantages
- disadvantages / Disadvantages
- collaborative filtering / Alternating least square – collaborative filtering
common transformations, on Spark RDDs
- Filter / Analyzing data – count, projection, grouping, aggregation, and max/min
- Map / Analyzing data – count, projection, grouping, aggregation, and max/min
- FlatMap / Analyzing data – count, projection, grouping, aggregation, and max/min
- other transformations / Analyzing data – count, projection, grouping, aggregation, and max/min
Conditional-FP tree / Efficient market basket analysis using FP-Growth algorithm
Conditional FP Tree / Efficient market basket analysis using FP-Growth algorithm
Conditional Pattern / Efficient market basket analysis using FP-Growth algorithm
Conditional Patterns Base / Efficient market basket analysis using FP-Growth algorithm
conditional probability / Conditional probability
content-based recommendation systems
- about / Content-based recommendation systems
- Euclidean Distance / Content-based recommendation systems
- Pearson Correlation / Content-based recommendation systems
- dataset / Dataset
- content-based recommender, on MovieLens dataset / Content-based recommender on MovieLens dataset
- collaborative recommendation systems / Collaborative recommendation systems
content-based recommender
- on MovieLens dataset / Content-based recommender on MovieLens dataset
context
- building / Building SparkConf and context
customer segmentation / Customer segmentation
- clustering / Clustering for customer segmentation

D

data
- cleaning / Data cleaning and munging, Cleaning and munging the data
- munging / Data cleaning and munging, Cleaning and munging the data
- unwanted data, filtering / Data cleaning and munging
- missing data, handling / Data cleaning and munging
- incomplete data, handling / Data cleaning and munging
- discarding / Data cleaning and munging
- constant value, filling / Data cleaning and munging
- average value, populating / Data cleaning and munging
- nearest neighbor approach / Data cleaning and munging
- converting, to proper format / Data cleaning and munging
- basic analysis, with Spark SQL / Basic analysis of data with Spark SQL
- parsing / Load and parse data
- loading / Load and parse data
- Spark-SQL way / Analyzing data – the Spark-SQL way
- Spark SQL, for data exploration and analytics / Spark SQL for data exploration and analytics
- Apriori algorithm / Market basket analysis – Apriori algorithm
- Full Apriori algorithm / Full Apriori algorithm
- preparing / Preparing the data
- formatting / Formatting the data
- storing / Storing the data
data analytics
- on big data / Why data analytics on big data?
- distributed computing, on Hadoop / Distributed computing on Hadoop
- HDFS concepts / HDFS concepts
- Apache Spark / Apache Spark
data exploration
- of text data / Data exploration of text data
/ Data exploration, Data exploration
dataframe / Dataframe and datasets
DataNode / Main components of HDFS
dataset / Dataset, Dataset
- URL, for downloading / All India seasonal and annual average temperature series dataset
- fields / All India seasonal and annual average temperature series dataset
- data / All India seasonal and annual average temperature series dataset
- reference link / Predicting house prices using linear regression
- data, munging / Data cleaning and munging
- full batch approach / Accuracy of multi-layer perceptrons
- partial batch approach / Accuracy of multi-layer perceptrons
dataset, linear regression
- data, cleaning / Data cleaning and munging
- exploring / Exploring the dataset
- number of rows / Exploring the dataset
- average price per zipcode, sorting by highest on top / Exploring the dataset
- linear regression model, executing / Running and testing the linear regression model
- linear regression model, testing / Running and testing the linear regression model
dataset, logistic regression
- data, cleaning / Data cleaning and munging
- data, munging / Data cleaning and munging
- data, missing / Data cleaning and munging
- categorical data / Data cleaning and munging
- data exploration / Data exploration
- executing / Running and testing the logistic regression model
- testing / Running and testing the logistic regression model
dataset object / Training and testing the model
datasets / Datasets, Dataframe and datasets
- airports dataset / Datasets
- routes dataset / Datasets
- airlines dataset / Datasets
datasets splitting
- features selected / Choosing the best features for splitting the datasets
- Gini Impurity / Choosing the best features for splitting the datasets
data transfer techniques
- Flume / Getting and preparing data in Hadoop
- FTP / Getting and preparing data in Hadoop
- Kafka / Getting and preparing data in Hadoop
- HBase / Getting and preparing data in Hadoop
- Hive / Getting and preparing data in Hadoop
- Impala / Getting and preparing data in Hadoop
data visualization
- with Java JFreeChart / Data visualization with Java JFreeChart
- charts, used in big data analytics / Using charts in big data analytics
decision tree
- about / What is a decision tree?
- for classification / What is a decision tree?
- for regression / What is a decision tree?
- building / Building a decision tree
- datasets splitting, features selected / Choosing the best features for splitting the datasets
- advantages / Advantages of using decision trees
- disadvantages / Disadvantages of using decision trees
- dataset / Dataset
- data exploration / Data exploration
- data, cleaning / Cleaning and munging the data
- data, munging / Cleaning and munging the data
- model, training / Training and testing the model
- model, testing / Training and testing the model
deep learning
- about / Deep learning
- advantages / Advantages and use cases of deep learning
- use cases / Advantages and use cases of deep learning
- no feature engineering required / Advantages and use cases of deep learning
- accuracy / Advantages and use cases of deep learning
- information / More information on deep learning
deeplearning4j / Deeplearning4j
- references / Deeplearning4j
Deeplearning4j
- about / Deeplearning4j – a deep learning library
- data, compressing / Compressing data
- Avro / Avro and Parquet
- Parquet / Avro and Parquet
distributed computing
- on Hadoop / Distributed computing on Hadoop

E

edges / Refresher on graphs
efficient market basket analysis
- FP-Growth algorithm, used / Efficient market basket analysis using FP-Growth algorithm
ensembling
- about / Ensembling
- voting / Ensembling
- averaging / Ensembling
- machine learning algorithm, used / Ensembling
- types / Types of ensembling
- bagging / Bagging
- boosting / Boosting
- advantages / Advantages and disadvantages of ensembling
- disadvantages / Advantages and disadvantages of ensembling
- random forest / Random forests
- Gradient boosted trees (GBTs) / Gradient boosted trees (GBTs)

F

feature selection
- filter methods / How do you select the best features to train your models?
- pearson correlation / How do you select the best features to train your models?
- chi-square / How do you select the best features to train your models?
- wrapper method / How do you select the best features to train your models?
- forward selection / How do you select the best features to train your models?
- backward elimination / How do you select the best features to train your models?
- embedded method / How do you select the best features to train your models?
FP-Growth algorithm
- used, for efficient market basket analysis / Efficient market basket analysis using FP-Growth algorithm
- transaction dataset / Efficient market basket analysis using FP-Growth algorithm
- frequency of items, calculating / Efficient market basket analysis using FP-Growth algorithm
- priority, assigning to items / Efficient market basket analysis using FP-Growth algorithm
- array items, by priority / Efficient market basket analysis using FP-Growth algorithm
- FP-Tree, building / Efficient market basket analysis using FP-Growth algorithm
- frequent patterns, identifying from FP-Tree / Efficient market basket analysis using FP-Growth algorithm
- conditional patterns, mining / Efficient market basket analysis using FP-Growth algorithm
- conditional patterns, from leaf node Diapers / Efficient market basket analysis using FP-Growth algorithm
- executing, on Apache Spark / Running FP-Growth on Apache Spark
Frequent Item sets / Efficient market basket analysis using FP-Growth algorithm
Frequent Pattern Mining
- reference link / Running FP-Growth on Apache Spark
Full Apriori algorithm
- about / Full Apriori algorithm
- dataset / Full Apriori algorithm
- apriori implementation / Full Apriori algorithm

G

Gradient boosted trees (GBTs)
- about / Advantages and disadvantages of ensembling, Gradient boosted trees (GBTs)
- dataset, used / Classification problem and dataset used
- issues, classifying / Classification problem and dataset used
- data exploration / Data exploration
- random forest model, training / Training and testing our random forest model
- random forest model, testing / Training and testing our random forest model
- gradient boosted tree model, testing / Training and testing our gradient boosted tree model
- gradient boosted tree model, training / Training and testing our gradient boosted tree model
graph analytics
- about / Graph analytics
- path analytics / Graph analytics
- connectivity analytics / Graph analytics
- community analytics / Graph analytics
- centrality analytics / Graph analytics
- GraphFrames / GraphFrames
- GraphFrames, used for building a graph / Building a graph using GraphFrames
- on airports / Graph analytics on airports and their flights
- on flights / Graph analytics on airports and their flights
- datasets / Datasets
- on flights data / Graph analytics on flights data
graphs
- refresher / Refresher on graphs
- representing / Representing graphs
- adjacency matrix / Representing graphs
- adjacency list / Representing graphs
- common terminology / Common terminology on graphs
- common algorithms / Common algorithms on graphs
- plotting / Plotting graphs
graphs, common algorithms
- breadth first search / Common algorithms on graphs
- depth first search / Common algorithms on graphs
- dijkstra shortest path / Common algorithms on graphs
- PageRank algorithm / Common algorithms on graphs
graphs, common terminology
- vertices / Common terminology on graphs
- edges / Common terminology on graphs
- degrees / Common terminology on graphs
- indegrees / Common terminology on graphs
- outdegrees / Common terminology on graphs
GraphStream library
- reference link / Plotting graphs

H

Hadoop
- basics / Basics of Hadoop – a Java sub-project
- features / Basics of Hadoop – a Java sub-project
- distributed computing on / Distributed computing on Hadoop
- core / Distributed computing on Hadoop
- HDFS / Distributed computing on Hadoop
Hadoop Distributed File System (HDFS)
- about / Distributed computing on Hadoop
- Open Source / Design and architecture of HDFS
- Immense scalability, for amount of data / Design and architecture of HDFS
- failover support / Design and architecture of HDFS
- fault tolerance / Design and architecture of HDFS
- data locality / Design and architecture of HDFS
- NameNode / Main components of HDFS
- DataNode / Main components of HDFS
/ Real-time SQL queries using Impala
hand written digit recognizition
- using CNN / Hand written digit recognizition using CNN
HBase / Real-time data processing
HDFS concepts
- about / HDFS concepts
- architecture / Design and architecture of HDFS
- design / Design and architecture of HDFS
- components / Main components of HDFS
- simple commands / HDFS simple commands
hierarchical clustering / Hierarchical clustering
histogram
- about / Histograms
- using / When would you use a histogram?
- creating, JFreeChart used / How to make histograms using JFreeChart?
human neuron
- dendrite / Introduction to neural networks
- cell body / Introduction to neural networks
- axom terminal / Introduction to neural networks
hyperplane / Scatter plots, What is simple linear regression?

I

Impala
- used, for real-time SQL queries / Real-time SQL queries using Impala
- advantages / Real-time SQL queries using Impala
- flight delay analysis / Flight delay analysis using Impala
- Apache Kafka / Apache Kafka
- Spark Streaming / Spark Streaming, Typical uses of Spark Streaming
- trending videos / Trending videos
Iris dataset
- reference link / Flower species classification using multi-Layer perceptrons
IVTK Graph toolkit
- about / IVTK Graph toolkit
- other libraries / Other libraries

J

JFreeChart API
- dataset loading, Apache Spark used / Simple single Time Series chart
- chart object, creating / Simple single Time Series chart
- dataset object, filling / Bar charts
- chart component, creating / Bar charts

K

k-means clustering
- bisecting / Bisecting k-means clustering
K-means clustering / K-means clustering

L

linear regression
- about / Linear regression
- using / Where is linear regression used?
- used, for predicting house prices / Predicting house prices using linear regression
- dataset / Dataset
line charts / Line charts
logistic regression
- about / Logistic regression
- mathematical functions, used / Which mathematical functions does logistic regression use?
- Gradient ascent or descent / Which mathematical functions does logistic regression use?
- Stochastic gradient descent / Which mathematical functions does logistic regression use?
- used for / Where is logistic regression used?
- heart disease, predicting / Where is logistic regression used?
- dataset / Dataset

M

machine learning
- about / What is machine learning?
- example / Real-life examples of machine learning
- at Netflix / Real-life examples of machine learning
- spam filter / Real-life examples of machine learning
- Hand writing detection, on cheque submitted via ATMs / Real-life examples of machine learning
- type / Type of machine learning
- supervised learning / Type of machine learning
- un-supervised learning / Type of machine learning
- semi supervised learning / Type of machine learning
- supervised learning, case study / A small sample case study of supervised and unsupervised learning
- unsupervised learning, case study / A small sample case study of supervised and unsupervised learning
- issues / Steps for machine learning problems
- model, selecting / Choosing the machine learning model
- training/test set / Choosing the machine learning model
- cross validation / Choosing the machine learning model
- features extracted from datasets / What are the feature types that can be extracted from the datasets?
- categorical features / What are the feature types that can be extracted from the datasets?
- numerical features / What are the feature types that can be extracted from the datasets?
- text features / What are the feature types that can be extracted from the datasets?
- features, selecting to train models / How do you select the best features to train your models?
- analytics, executing on big data / How do you run machine learning analytics on big data?
- data, preparing in Hadoop / Getting and preparing data in Hadoop
- data, obtaining in Hadoop / Getting and preparing data in Hadoop
- models, storing on big data / Training and storing models on big data
- models, training on big data / Training and storing models on big data
- Apache Spark machine learning API / Apache Spark machine learning API
massive graphs
- on big data / Massive graphs on big data
- graph analytics / Graph analytics
- graph analytics, on airports / Graph analytics on airports and their flights
maths stats
- min / Box plots
- max / Box plots
- mean / Box plots
- median / Box plots
- lower quartile / Box plots
- upper quartile / Box plots
- outliers / Box plots
mean squared error (MSE) / Bisecting k-means clustering
median value / Box plots
MNIST database
- reference link / Hand written digit recognizition using CNN
model
- selecting / Training and storing models on big data
- training / Training and storing models on big data, Training and testing the model
- storing / Training and storing models on big data
- testing / Training and testing the model
multi-Layer perceptron
- used, for flower species classification / Flower species classification using multi-Layer perceptrons
multi-layer perceptron
- about / Multi-layer perceptrons
- accuracy / Accuracy of multi-layer perceptrons
multiple linear regression / What is simple linear regression?

N

N-grams
- about / N-grams
- examples / N-grams
NameNode / Main components of HDFS
Natural Language Processing (NLP) / What are the feature types that can be extracted from the datasets?, Concepts for sentimental analysis
Naïve bayes algorithm
- about / Naive Bayes algorithm
- advantages / Advantages of Naive Bayes
- disadvantages / Disadvantages of Naive Bayes
neural networks / Introduction to neural networks

O

OpenFlights airports database
- reference link / Datasets

P

paired RDDs
- about / Paired RDDs
- transformations / Transformations on paired RDDs
perceptron
- about / Perceptron
- issues / Problems with perceptrons
- Logical AND / Problems with perceptrons
- Logical OR / Problems with perceptrons
- sigmoid neuron / Sigmoid neuron
- multi-layer perceptron / Multi-layer perceptrons
PFP / Running FP-Growth on Apache Spark
prefuse
- about / Prefuse
- reference link / Prefuse

R

random forest / Random forests
real-time analytics
- about / Real-time analytics
- fraud analytics / Real-time analytics
- sensor data analysis (Internet of Things) / Real-time analytics
- recommendations, giving to users / Real-time analytics
- in healthcare / Real-time analytics
- ad-processing / Real-time analytics
- big data stack / Big data stack for real-time analytics
real-time data ingestion / Real-time data ingestion and storage
- Apache Kafka / Real-time data ingestion and storage
- Apache Flume / Real-time data ingestion and storage
- HBase / Real-time data ingestion and storage
- Cassandra / Real-time data ingestion and storage
real-time data processing / Real-time data processing
- Spark Streaming / Real-time data processing
- Storm / Real-time data processing
real-time SQL queries
- on big data / Real-time SQL queries on big data
- impala / Real-time SQL queries on big data
- Apache Drill / Real-time SQL queries on big data
- Impala, used / Real-time SQL queries using Impala
real-time storage / Real-time data ingestion and storage
Recency, Frequency, and Monetary (RFM) / Customer segmentation
recommendation system
- about / Recommendation systems and their types
- types / Recommendation systems and their types
- content-based recommendation systems / Content-based recommendation systems
Resilient Distributed Dataset (RDD) / Concepts, Dataframe and datasets

S

scatter plots / Scatter plots
sentimental analysis
- about / Sentimental analysis
- concepts / Concepts for sentimental analysis
- tokenization / Tokenization
- stemming / Stemming
- N-grams / N-grams
- term presence / Term presence and Term Frequency
- term frequency / Term presence and Term Frequency
- Term Frequency and Inverse Document Frequency (TF-IDF) / TF-IDF
- bag of words / Bag of words
- dataset / Dataset
- text data, data exploration / Data exploration of text data
- on dataset / Sentimental analysis on this dataset
sigmoid neuron / Sigmoid neuron
simple linear regression / Linear regression, What is simple linear regression?
smoothing factor / Disadvantages of Naive Bayes
SOLR / Real-time data processing
SPAM Detector Model / Type of machine learning
SparkConf
- building / Building SparkConf and context
Spark ML / Apache Spark machine learning API
Spark SQL
- used, for basic analysis on data / Basic analysis of data with Spark SQL
- SparkConf, building / Building SparkConf and context
- context, building / Building SparkConf and context
- dataframe / Dataframe and datasets
- datasets / Dataframe and datasets
- data, loading / Load and parse data
- data, parsing / Load and parse data
Spark Streaming
- about / Spark Streaming, Typical uses of Spark Streaming
- use cases / Typical uses of Spark Streaming
- data collection, in real time / Typical uses of Spark Streaming
- storage, in real time / Typical uses of Spark Streaming
- predictive analytics, in real time / Typical uses of Spark Streaming
- windowed calculations / Typical uses of Spark Streaming
- cumulative calculations / Typical uses of Spark Streaming
- base project setup / Base project setup
stemming / Stemming
stop words removal / Stop words removal
Storm / Spark Streaming
sum of mean squared errors (SMEs) / Bisecting k-means clustering
supervised learning
- about / Type of machine learning
- classification / Type of machine learning
- regression / Type of machine learning
Support Vector Machine (SVM) / SVM or Support Vector Machine

T

tendency / Content-based recommendation systems
term frequency
- about / Term presence and Term Frequency
- example / Term presence and Term Frequency
Term Frequency and Inverse Document Frequency (TF-IDF) / TF-IDF
- about / TF-IDF
- term frequency / TF-IDF
- inverse document frequency / TF-IDF
TimeSeries chart
- about / Time Series chart
- all india seasonal / All India seasonal and annual average temperature series dataset
- annual average temperature series dataset / All India seasonal and annual average temperature series dataset
- simple single TimeSeries chart / Simple single Time Series chart
- multiple TimeSeries, on single chart window / Multiple Time Series on a single chart window
tokenization
- about / Tokenization
- regular expression, used / Tokenization
- pre-trained model, used / Tokenization
- stop words removal / Stop words removal
trending videos
- about / Trending videos
- sentiment analysis, at real time / Sentiment analysis in real time