(For more resources related to this topic, see here.)

Era of Big Data

In this article by Rajesh Nadipalli, the author of HDInsight Essentials Second Edition, we will take a look at the concept of Big Data and how to tame it using HDInsight. We live in a digital era and are always connected with friends and family using social media and smartphones. In 2014, every second, about 5,700 tweets were sent and 800 links were shared using Facebook, and the digital universe was about 1.7 MB per minute for every person on earth (source: IDC 2014 report). This amount of data sharing and storing is unprecedented and is contributing to what is known as Big Data. The following infographic shows you the details of our current use of the top social media sites (source: https://leveragenewagemedia.com/).

taming-big-data-using-hdinsight-img-0

Another contributor to Big Data are the smart, connected devices such as smartphones, appliances, cars, sensors, and pretty much everything that we use today and is connected to the Internet. These devices, which will soon be in trillions, continuously collect data and communicate with each other about their environment to make intelligent decisions and help us live better. This digitization of the world has added to the exponential growth of Big Data.

According to the 2014 IDC digital universe report, the growth trend will continue and double in size every two years. In 2013, about 4.4 zettabytes were created and in 2020, the forecast is 44 zettabytes, which is 44 trillion gigabytes, (source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm).

Business value of Big Data

While we generated 4.4 zettabytes of data in 2013, only 5 percent of it was actually analyzed, and this is the real opportunity of Big Data. The IDC report forecasts that by 2020, we will analyze over 35 percent of the generated data by making smarter sensors and devices. This data will drive new consumer and business behavior that will drive trillions of dollars in opportunity for IT vendors and organizations analyzing this data.

Let's take a look at some real use cases that have benefited from Big Data:

IT systems in all major banks are constantly monitoring fraudulent activities and alerting customers within milliseconds. These systems apply complex business rules and analyze the historical data, geography, type of vendor, and other parameters based on the customer to get accurate results.
Commercial drones are transforming agriculture by analyzing real-time aerial images and identifying the problem areas. These drones are cheaper and efficient than satellite imagery, as they fly under the clouds and can be used anytime. They identify the irrigation issues related to water, pests, or fungal infections thereby increasing the crop productivity and quality. These drones are equipped with technology to capture high-quality images every second and transfer them to a cloud-hosted Big Data system for further processing (reference: http://www.technologyreview.com/featuredstory/526491/agricultural-drones/).
Developers of the blockbuster Halo 4 game were tasked to analyze player preferences and support an online tournament in the cloud. The game attracted over 4 million players in its first five days after its launch. The development team had to also design a solution that kept track of a leader board for the global Halo 4 Infinity challenge, which was open to all the players. The development team chose the Azure HDInsight service to analyze the massive amounts of unstructured data in a distributed manner. The results from HDInsight were reported using Microsoft SQL Server PowerPivot and Sharepoint and the business was extremely happy with the response times for their queries, which was a few hours or less, (source: http://www.microsoft.com/casestudies/Windows-Azure/343-Industries/343-Industries-Gets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102)

Hadoop Concepts

Apache Hadoop is the leading open source Big Data platform that can store and analyze massive amounts of structured and unstructured data efficiently and can be hosted on low-cost commodity hardware. There are other technologies that complement Hadoop under the Big Data umbrella such as MongoDB (a NoSQL database), Cassandra (a document database), and VoltDB (an in-memory database). This section describes Apache Hadoop core concepts and its ecosystem.

A brief history of Hadoop

Doug Cutting created Hadoop and named it after his kid's stuffed yellow elephant and has no real meaning. In 2004, the initial version of Hadoop was launched as Nutch Distributed Filesystem. In February 2006, the Apache Hadoop project was officially started as a standalone development for MapReduce and HDFS. By 2008, Yahoo adopted Hadoop as the engine of its web search with a cluster size of around 10,000. In the same year, Hadoop graduated as the top-level Apache project confirming its success. In 2012, Hadoop 2.x was launched with YARN enabling Hadoop to take on various types of workloads.

Today, Hadoop is known by just about every IT architect and business executive as a open source Big Data platform and is used across all industries and sizes of organizations.

Core components

In this section, we will explore what Hadoop is actually comprised of. At the basic level, Hadoop consists of 4 layers:

Hadoop Common: A set of common libraries and utilities used by Hadoop modules.
Hadoop Distributed File System (HDFS): A scalable and fault tolerant distributed filesystem for data in any form. HDFS can be installed on commodity hardware and replicates data three times (which is configurable) to make the filesystem robust and tolerate partial hardware failures.
Yet Another Resource Negotiator (YARN): From Hadoop 2.0, YARN is the cluster management layer to handle various workloads on the cluster.
MapReduce: MapReduce is a framework that allows parallel processing of data in Hadoop. MapReduce breaks a job into smaller tasks and distributes the load to servers that have the relevant data. The design model is "move code and not data" making this framework efficient as it reduces the network and disk I/O required to move the data.

The following diagram shows you the high-level Hadoop 2.0 core components:

taming-big-data-using-hdinsight-img-1

The preceding diagram shows you the components that form the basic Hadoop framework. In the past few years, a vast array of new components have emerged in the Hadoop ecosystem that take advantage of YARN making Hadoop faster, better, and suitable for various types of workloads. The following diagram shows you the Hadoop framework with these new components:

taming-big-data-using-hdinsight-img-2

Hadoop cluster layout

Each Hadoop cluster has two types of machines, which are as follows:

Master nodes: This includes HDFS Name Node, HDFS Secondary Name Node, and YARN Resource Manager.
Worker nodes: This includes HDFS Data Nodes and YARN Node Managers. The data nodes and node managers are colocated for optimal data locality and performance.

A network switch interconnects the master and worker nodes.

It is recommended that you have separate servers for each of the master nodes; however, it is possible to deploy all the master nodes onto a single server for development or testing workloads.

The following diagram shows you the typical cluster layout:

taming-big-data-using-hdinsight-img-3

Let's review the key functions of the master and worker nodes:

Name node: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files, and the location of each block of a file that are stored across the various slaves. Without a name node, HDFS is not accessible. From Hadoop 2.0 onwards, name node HA (High Availability) can be configured with active and standby servers.
Secondary name node: This is an assistant to the name node. It communicates only with the name node to take snapshots of HDFS metadata at intervals that is configured at the cluster level.
YARN resource manager: This server is a scheduler that allocates the available resources in the cluster among the competing applications.
Worker nodes: The Hadoop cluster will have several worker nodes that handle two types of functions—HDFS Data Node and YARN Node Manager. It is typical that each worker node handles both the functions for optimal data locality. This means processing happens on the data that is local to the node and follows the principle "move code and not data".

HDInsight Overview

HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on the Azure HDInsight cloud service (PaaS). It is 100 percent Apache Hadoop based service in the cloud. HDInsight was developed with the partnership of Hortonworks, and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and the Windows Azure cloud service.

The following are the key differentiators for a HDInsight distribution:

Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead.
Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight.
Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more.
Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage.
Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios.
Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP).
HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer.

Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead.
Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight.
Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more.
Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage.
Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios.
Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP).
HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer.

Summary

We live in a connected digital era and are witnessing unprecedented growth of data. Organizations that are able to analyze Big Data are demonstrating significant return on investment by detecting fraud, improved operations, and reduced time to analyze with a scale-out architecture. Apache Hadoop is the leading open source Big Data platform with strong and diverse ecosystem projects that enable organizations to build a modern data architecture. At the core, Hadoop has two key components: Hadoop Distributed File System also known as HDFS, and a cluster resource manager known as YARN. YARN has enabled Hadoop to be a true multi-use data platform that can handle batch processing, real-time streaming, interactive SQL, and others.

Microsoft HDInsight is an enterprise-ready distribution of Hadoop on the cloud that has been developed with the partnership of Hortonworks and Microsoft. The key benefits of HDInsight include scaling up/down as required, analysis using Excel, connecting an on-premise Hadoop cluster with the cloud, and flexible programming and support for NoSQL transactional databases.