Data | 0 articles | Tech News, Tutorials & Expert Insights

04 Feb 2015

28 min read

Working with Incanter Datasets

04 Feb 2015

0
0
3024

Packt

22 Jan 2015

14 min read

In the Cloud

Packt

22 Jan 2015

14 min read

0
0
1494

article-image-taming-big-data-using-hdinsight

Packt

22 Jan 2015

10 min read

Taming Big Data using HDInsight

Packt

22 Jan 2015

10 min read

(For more resources related to this topic, see here.) Era of Big Data In this article by Rajesh Nadipalli, the author of HDInsight Essentials Second Edition, we will take a look at the concept of Big Data and how to tame it using HDInsight. We live in a digital era and are always connected with friends and family using social media and smartphones. In 2014, every second, about 5,700 tweets were sent and 800 links were shared using Facebook, and the digital universe was about 1.7 MB per minute for every person on earth (source: IDC 2014 report). This amount of data sharing and storing is unprecedented and is contributing to what is known as Big Data. The following infographic shows you the details of our current use of the top social media sites (source: https://leveragenewagemedia.com/). Another contributor to Big Data are the smart, connected devices such as smartphones, appliances, cars, sensors, and pretty much everything that we use today and is connected to the Internet. These devices, which will soon be in trillions, continuously collect data and communicate with each other about their environment to make intelligent decisions and help us live better. This digitization of the world has added to the exponential growth of Big Data. According to the 2014 IDC digital universe report, the growth trend will continue and double in size every two years. In 2013, about 4.4 zettabytes were created and in 2020, the forecast is 44 zettabytes, which is 44 trillion gigabytes, (source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm). Business value of Big Data While we generated 4.4 zettabytes of data in 2013, only 5 percent of it was actually analyzed, and this is the real opportunity of Big Data. The IDC report forecasts that by 2020, we will analyze over 35 percent of the generated data by making smarter sensors and devices. This data will drive new consumer and business behavior that will drive trillions of dollars in opportunity for IT vendors and organizations analyzing this data. Let's take a look at some real use cases that have benefited from Big Data: IT systems in all major banks are constantly monitoring fraudulent activities and alerting customers within milliseconds. These systems apply complex business rules and analyze the historical data, geography, type of vendor, and other parameters based on the customer to get accurate results. Commercial drones are transforming agriculture by analyzing real-time aerial images and identifying the problem areas. These drones are cheaper and efficient than satellite imagery, as they fly under the clouds and can be used anytime. They identify the irrigation issues related to water, pests, or fungal infections thereby increasing the crop productivity and quality. These drones are equipped with technology to capture high-quality images every second and transfer them to a cloud-hosted Big Data system for further processing (reference: http://www.technologyreview.com/featuredstory/526491/agricultural-drones/). Developers of the blockbuster Halo 4 game were tasked to analyze player preferences and support an online tournament in the cloud. The game attracted over 4 million players in its first five days after its launch. The development team had to also design a solution that kept track of a leader board for the global Halo 4 Infinity challenge, which was open to all the players. The development team chose the Azure HDInsight service to analyze the massive amounts of unstructured data in a distributed manner. The results from HDInsight were reported using Microsoft SQL Server PowerPivot and Sharepoint and the business was extremely happy with the response times for their queries, which was a few hours or less, (source: http://www.microsoft.com/casestudies/Windows-Azure/343-Industries/343-Industries-Gets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102) Hadoop Concepts Apache Hadoop is the leading open source Big Data platform that can store and analyze massive amounts of structured and unstructured data efficiently and can be hosted on low-cost commodity hardware. There are other technologies that complement Hadoop under the Big Data umbrella such as MongoDB (a NoSQL database), Cassandra (a document database), and VoltDB (an in-memory database). This section describes Apache Hadoop core concepts and its ecosystem. A brief history of Hadoop Doug Cutting created Hadoop and named it after his kid's stuffed yellow elephant and has no real meaning. In 2004, the initial version of Hadoop was launched as Nutch Distributed Filesystem. In February 2006, the Apache Hadoop project was officially started as a standalone development for MapReduce and HDFS. By 2008, Yahoo adopted Hadoop as the engine of its web search with a cluster size of around 10,000. In the same year, Hadoop graduated as the top-level Apache project confirming its success. In 2012, Hadoop 2.x was launched with YARN enabling Hadoop to take on various types of workloads. Today, Hadoop is known by just about every IT architect and business executive as a open source Big Data platform and is used across all industries and sizes of organizations. Core components In this section, we will explore what Hadoop is actually comprised of. At the basic level, Hadoop consists of 4 layers: Hadoop Common: A set of common libraries and utilities used by Hadoop modules. Hadoop Distributed File System (HDFS): A scalable and fault tolerant distributed filesystem for data in any form. HDFS can be installed on commodity hardware and replicates data three times (which is configurable) to make the filesystem robust and tolerate partial hardware failures. Yet Another Resource Negotiator (YARN): From Hadoop 2.0, YARN is the cluster management layer to handle various workloads on the cluster. MapReduce: MapReduce is a framework that allows parallel processing of data in Hadoop. MapReduce breaks a job into smaller tasks and distributes the load to servers that have the relevant data. The design model is "move code and not data" making this framework efficient as it reduces the network and disk I/O required to move the data. The following diagram shows you the high-level Hadoop 2.0 core components: The preceding diagram shows you the components that form the basic Hadoop framework. In the past few years, a vast array of new components have emerged in the Hadoop ecosystem that take advantage of YARN making Hadoop faster, better, and suitable for various types of workloads. The following diagram shows you the Hadoop framework with these new components: Hadoop cluster layout Each Hadoop cluster has two types of machines, which are as follows: Master nodes: This includes HDFS Name Node, HDFS Secondary Name Node, and YARN Resource Manager. Worker nodes: This includes HDFS Data Nodes and YARN Node Managers. The data nodes and node managers are colocated for optimal data locality and performance. A network switch interconnects the master and worker nodes. It is recommended that you have separate servers for each of the master nodes; however, it is possible to deploy all the master nodes onto a single server for development or testing workloads. The following diagram shows you the typical cluster layout: Let's review the key functions of the master and worker nodes: Name node: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files, and the location of each block of a file that are stored across the various slaves. Without a name node, HDFS is not accessible. From Hadoop 2.0 onwards, name node HA (High Availability) can be configured with active and standby servers. Secondary name node: This is an assistant to the name node. It communicates only with the name node to take snapshots of HDFS metadata at intervals that is configured at the cluster level. YARN resource manager: This server is a scheduler that allocates the available resources in the cluster among the competing applications. Worker nodes: The Hadoop cluster will have several worker nodes that handle two types of functions—HDFS Data Node and YARN Node Manager. It is typical that each worker node handles both the functions for optimal data locality. This means processing happens on the data that is local to the node and follows the principle "move code and not data". HDInsight Overview HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on the Azure HDInsight cloud service (PaaS). It is 100 percent Apache Hadoop based service in the cloud. HDInsight was developed with the partnership of Hortonworks, and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and the Windows Azure cloud service. The following are the key differentiators for a HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Summary We live in a connected digital era and are witnessing unprecedented growth of data. Organizations that are able to analyze Big Data are demonstrating significant return on investment by detecting fraud, improved operations, and reduced time to analyze with a scale-out architecture. Apache Hadoop is the leading open source Big Data platform with strong and diverse ecosystem projects that enable organizations to build a modern data architecture. At the core, Hadoop has two key components: Hadoop Distributed File System also known as HDFS, and a cluster resource manager known as YARN. YARN has enabled Hadoop to be a true multi-use data platform that can handle batch processing, real-time streaming, interactive SQL, and others. Microsoft HDInsight is an enterprise-ready distribution of Hadoop on the cloud that has been developed with the partnership of Hortonworks and Microsoft. The key benefits of HDInsight include scaling up/down as required, analysis using Excel, connecting an on-premise Hadoop cluster with the cloud, and flexible programming and support for NoSQL transactional databases. Resources for Article: Further resources on this subject: Hadoop and HDInsight in a Heartbeat [article] Sizing and Configuring your Hadoop Cluster [article] Introducing Kafka [article]

0
0
1531

Packt

21 Jan 2015

53 min read

Highcharts Configurations

Packt

21 Jan 2015

53 min read

0
0
5870

Packt

29 Dec 2014

12 min read

Evolution of Hadoop

Packt

29 Dec 2014

12 min read

In this article by Sandeep Karanth, author of the book Mastering Hadoop, we will see about the Hadoop's timeline, Hadoop 2.X and Hadoop YARN. Hadoop's timeline The following figure gives a timeline view of the major releases and milestones of Apache Hadoop. The project has been there for 8 years, but the last 4 years has seen Hadoop make giant strides in big data processing. In January 2010, Google was awarded a patent for the MapReduce technology. This technology was licensed to the Apache Software Foundation 4 months later, a shot in the arm for Hadoop. With legal complications out of the way, enterprises—small, medium, and large—were ready to embrace Hadoop. Since then, Hadoop has come up with a number of major enhancements and releases. It has given rise to businesses selling Hadoop distributions, support, training, and other services. Hadoop 1.0 releases, referred to as 1.X in this book, saw the inception and evolution of Hadoop as a pure MapReduce job-processing framework. It has exceeded its expectations with a wide adoption of massive data processing. The stable 1.X release at this point of time is 1.2.1, which includes features such as append and security. Hadoop 1.X tried to stay flexible by making changes, such as HDFS append, to support online systems such as HBase. Meanwhile, big data applications evolved in range beyond MapReduce computation models. The flexibility of Hadoop 1.X releases had been stretched; it was no longer possible to widen its net to cater to the variety of applications without architectural changes. Hadoop 2.0 releases, referred to as 2.X in this book, came into existence in 2013. This release family has major changes to widen the range of applications Hadoop can solve. These releases can even increase efficiencies and mileage derived from existing Hadoop clusters in enterprises. Clearly, Hadoop is moving fast beyond MapReduce to stay as the leader in massive scale data processing with the challenge of being backward compatible. It is becoming a generic cluster-computing and storage platform from being only a MapReduce-specific framework. Hadoop 2.X The extensive success of Hadoop 1.X in organizations also led to the understanding of its limitations, which are as follows: Hadoop gives unprecedented access to cluster computational resources to every individual in an organization. The MapReduce programming model is simple and supports a develop once deploy at any scale paradigm. This leads to users exploiting Hadoop for data processing jobs where MapReduce is not a good fit, for example, web servers being deployed in long-running map jobs. MapReduce is not known to be affable for iterative algorithms. Hacks were developed to make Hadoop run iterative algorithms. These hacks posed severe challenges to cluster resource utilization and capacity planning. Hadoop 1.X has a centralized job flow control. Centralized systems are hard to scale as they are the single point of load lifting. JobTracker failure means that all the jobs in the system have to be restarted, exerting extreme pressure on a centralized component. Integration of Hadoop with other kinds of clusters is difficult with this model. The early releases in Hadoop 1.X had a single NameNode that stored all the metadata about the HDFS directories and files. The data on the entire cluster hinged on this single point of failure. Subsequent releases had a cold standby in the form of a secondary NameNode. The secondary NameNode merged the edit logs and NameNode image files, periodically bringing in two benefits. One, the primary NameNode startup time was reduced as the NameNode did not have to do the entire merge on startup. Two, the secondary NameNode acted as a replica that could minimize data loss on NameNode disasters. However, the secondary NameNode (secondary NameNode is not a backup node for NameNode) was still not a hot standby, leading to high failover and recovery times and affecting cluster availability. Hadoop 1.X is mainly a Unix-based massive data processing framework. Native support on machines running Microsoft Windows Server is not possible. With Microsoft entering cloud computing and big data analytics in a big way, coupled with existing heavy Windows Server investments in the industry, it's very important for Hadoop to enter the Microsoft Windows landscape as well. Hadoop's success comes mainly from enterprise play. Adoption of Hadoop mainly comes from the availability of enterprise features. Though Hadoop 1.X tries to support some of them, such as security, there is a list of other features that are badly needed by the enterprise. Yet Another Resource Negotiator (YARN) In Hadoop 1.X, resource allocation and job execution were the responsibilities of JobTracker. Since the computing model was closely tied to the resources in the cluster, MapReduce was the only supported model. This tight coupling led to developers force-fitting other paradigms, leading to unintended use of MapReduce. The primary goal of YARN is to separate concerns relating to resource management and application execution. By separating these functions, other application paradigms can be added onboard a Hadoop computing cluster. Improvements in interoperability and support for diverse applications lead to efficient and effective utilization of resources. It integrates well with the existing infrastructure in an enterprise. Achieving loose coupling between resource management and job management should not be at the cost of loss in backward compatibility. For almost 6 years, Hadoop has been the leading software to crunch massive datasets in a parallel and distributed fashion. This means huge investments in development; testing and deployment were already in place. YARN maintains backward compatibility with Hadoop 1.X (hadoop-0.20.205+) APIs. An older MapReduce program can continue execution in YARN with no code changes. However, recompiling the older code is mandatory. Architecture overview The following figure lays out the architecture of YARN. YARN abstracts out resource management functions to a platform layer called ResourceManager (RM). There is a per-cluster RM that primarily keeps track of cluster resource usage and activity. It is also responsible for allocation of resources and resolving contentions among resource seekers in the cluster. RM uses a generalized resource model and is agnostic to application-specific resource needs. For example, RM need not know the resources corresponding to a single Map or Reduce slot. Planning and executing a single job is the responsibility of ApplicationMaster (AM). There is an AM instance per running application. For example, there is an AM for each MapReduce job. It has to request for resources from the RM, use them to execute the job, and work around failures, if any. The general cluster layout has RM running as a daemon on a dedicated machine with a global view of the cluster and its resources. Being a global entity, RM can ensure fairness depending on the resource utilization of the cluster resources. When requested for resources, RM allocates them dynamically as a node-specific bundle called a container. For example, 2 CPUs and 4 GB of RAM on a particular node can be specified as a container. Every node in the cluster runs a daemon called NodeManager (NM). RM uses NM as its node local assistant. NMs are used for container management functions, such as starting and releasing containers, tracking local resource usage, and fault reporting. NMs send heartbeats to RM. The RM view of the system is the aggregate of the views reported by each NM. Jobs are submitted directly to RMs. Based on resource availability, jobs are scheduled to run by RMs. The metadata of the jobs are stored in persistent storage to recover from RM crashes. When a job is scheduled, RM allocates a container for the AM of the job on a node in the cluster. AM then takes over orchestrating the specifics of the job. These specifics include requesting resources, managing task execution, optimizations, and handling tasks or job failures. AM can be written in any language, and different versions of AM can execute independently on a cluster. An AM resource request contains specifications about the locality and the kind of resource expected by it. RM puts in its best effort to satisfy AM's needs based on policies and availability of resources. When a container is available for use by AM, it can launch application-specific code in this container. The container is free to communicate with its AM. RM is agnostic to this communication. Storage layer enhancements A number of storage layer enhancements were undertaken in the Hadoop 2.X releases. The number one goal of the enhancements was to make Hadoop enterprise ready. High availability NameNode is a directory service for Hadoop and contains metadata pertaining to the files within cluster storage. Hadoop 1.X had a secondary Namenode, a cold standby that needed minutes to come up. Hadoop 2.X provides features to have a hot standby of NameNode. On the failure of an active NameNode, the standby can become the active Namenode in a matter of minutes. There is no data loss or loss of NameNode service availability. With hot standbys, automated failover becomes easier too. The key to keep the standby in a hot state is to keep its data as current as possible with respect to the active Namenode. This is achieved by reading the edit logs of the active NameNode and applying it onto itself with very low latency. The sharing of edit logs can be done using the following two methods: A shared NFS storage directory between the active and standby NameNodes: the active writes the logs to the shared location. The standby monitors the shared directory and pulls in the changes. A quorum of Journal Nodes: the active NameNode presents its edits to a subset of journal daemons that record this information. The standby node constantly monitors these journal daemons for updates and syncs the state with itself. The following figure shows the high availability architecture using a quorum of Journal Nodes. The data nodes themselves send block reports directly to both the active and standby NameNodes: Zookeeper or any other High Availability monitoring service can be used to track NameNode failures. With the assistance of Zookeeper, failover procedures to promote the hot standby as the active NameNode can be triggered. HDFS Federation Similar to what YARN did to Hadoop's computation layer, a more generalized storage model has been implemented in Hadoop 2.X. The block storage layer has been generalized and separated out from the filesystem layer. This separation has given an opening for other storage services to be integrated into a Hadoop cluster. Previously, HDFS and the block storage layer were tightly coupled. One use case that has come forth from this generalized storage model is HDFS Federation. Federation allows multiple HDFS namespaces to use the same underlying storage. Federated NameNodes provide isolation at the filesystem level. HDFS snapshots Snapshots are point-in-time, read-only images of the entire or a particular subset of a filesystem. Snapshots are taken for three general reasons: Protection against user errors Backup Disaster recovery Snapshotting is implemented only on NameNode. It does not involve copying data from the data nodes. It is a persistent copy of the block list and file size. The process of taking a snapshot is almost instantaneous and does not affect the performance of NameNode. Other enhancements There are a number of other enhancements in Hadoop 2.X, which are as follows: The wire protocol for RPCs within Hadoop is now based on Protocol Buffers. Previously, Java serialization via Writables was used. This improvement not only eases maintaining backward compatibility, but also aids in rolling the upgrades of different cluster components. RPCs allow for client-side retries as well. HDFS in Hadoop 1.X was agnostic about the type of storage being used. Mechanical or SSD drives were treated uniformly. The user did not have any control on data placement. Hadoop 2.X releases in 2014 are aware of the type of storage and expose this information to applications as well. Applications can use this to optimize their data fetch and placement strategies. HDFS append support has been brought into Hadoop 2.X. HDFS access in Hadoop 1.X releases has been through HDFS clients. In Hadoop 2.X, support for NFSv3 has been brought into the NFS gateway component. Clients can now mount HDFS onto their compatible local filesystem, allowing them to download and upload files directly to and from HDFS. Appends to files are allowed, but random writes are not. A number of I/O improvements have been brought into Hadoop. For example, in Hadoop 1.X, clients collocated with data nodes had to read data via TCP sockets. However, with short-circuit local reads, clients can directly read off the data nodes. This particular interface also supports zero-copy reads. The CRC checksum that is calculated for reads and writes of data has been optimized using the Intel SSE4.2 CRC32 instruction. Support enhancements Hadoop is also widening its application net by supporting other platforms and frameworks. One dimension we saw was onboarding of other computational models with YARN or other storage systems with the Block Storage layer. The other enhancements are as follows: Hadoop 2.X supports Microsoft Windows natively. This translates to a huge opportunity to penetrate the Microsoft Windows server land for massive data processing. This was partially possible because of the use of the highly portable Java programming language for Hadoop development. The other critical enhancement was the generalization of compute and storage management to include Microsoft Windows. As part of Platform-as-a-Service offerings, cloud vendors give out on-demand Hadoop as a service. OpenStack support in Hadoop 2.X makes it conducive for deployment in elastic and virtualized cloud environments. Summary In this article, we saw the evolution of Hadoop and some of its milestones and releases. We went into depth on Hadoop 2.X and the changes it brings into Hadoop. The key takeaways from this article are: In over 6 years of its existence, Hadoop has become the number one choice as a framework for massively parallel and distributed computing. The community has been shaping Hadoop to gear up for enterprise use. In 1.X releases, HDFS append and security, were the key features that made Hadoop enterprise-friendly. Hadoop's storage layer was enhanced in 2.X to separate the filesystem from the block storage service. This enables features such as supporting multiple namespaces and integration with other filesystems. 2.X shows improvements in Hadoop storage availability and snapshotting. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [article] Sizing and Configuring your Hadoop Cluster [article] HDFS and MapReduce [article]

0
0
2076

Packt

29 Dec 2014

11 min read

Creating a Map

Packt

29 Dec 2014

11 min read

In this article by Thomas Newton and Oscar Villarreal, authors of the book Learning D3.js Mapping, we will cover the following topics through a series of experiments: Foundation – creating your basic map Experiment 1 – adjusting the bounding box Experiment 2 – creating choropleths Experiment 3 – adding click events to our visualization (For more resources related to this topic, see here.) Foundation – creating your basic map In this section, we will walk through the basics of creating a standard map. Let's walk through the code to get a step-by-step explanation of how to create this map. The width and height can be anything you want. Depending on where your map will be visualized (cellphones, tablets, or desktops), you might want to consider providing a different width and height: var height = 600; var width = 900; The next variable defines a projection algorithm that allows you to go from a cartographic space (latitude and longitude) to a Cartesian space (x,y)—basically a mapping of latitude and longitude to coordinates. You can think of a projection as a way to map the three-dimensional globe to a flat plane. There are many kinds of projections, but geo.mercator is normally the default value you will use: var projection = d3.geo.mercator(); var mexico = void 0; If you were making a map of the USA, you could use a better projection called albersUsa. This is to better position Alaska and Hawaii. By creating a geo.mercator projection, Alaska would render proportionate to its size, rivaling that of the entire US. The albersUsa projection grabs Alaska, makes it smaller, and puts it at the bottom of the visualization. The following screenshot is of geo.mercator: This following screenshot is of geo.albersUsa: The D3 library currently contains nine built-in projection algorithms. An overview of each one can be viewed at https://github.com/mbostock/d3/wiki/Geo-Projections. Next, we will assign the projection to our geo.path function. This is a special D3 function that will map the JSON-formatted geographic data into SVG paths. The data format that the geo.path function requires is named GeoJSON: var path = d3.geo.path().projection(projection); var svg = d3.select("#map") .append("svg") .attr("width", width) .attr("height", height); Including the dataset The necessary data has been provided for you within the data folder with the filename geo-data.json: d3.json('geo-data.json', function(data) { console.log('mexico', data); We get the data from an AJAX call. After the data has been collected, we want to draw only those parts of the data that we are interested in. In addition, we want to automatically scale the map to fit the defined height and width of our visualization. If you look at the console, you'll see that "mexico" has an objects property. Nested inside the objects property is MEX_adm1. This stands for the administrative areas of Mexico. It is important to understand the geographic data you are using, because other data sources might have different names for the administrative areas property: Notice that the MEX_adm1 property contains a geometries array with 32 elements. Each of these elements represents a state in Mexico. Use this data to draw the D3 visualization. var states = topojson.feature(data, data.objects.MEX_adm1); Here, we pass all of the administrative areas to the topojson.feature function in order to extract and create an array of GeoJSON objects. The preceding states variable now contains the features property. This features array is a list of 32 GeoJSON elements, each representing the geographic boundaries of a state in Mexico. We will set an initial scale and translation to 1 and 0,0 respectively: // Setup the scale and translate projection.scale(1).translate([0, 0]); This algorithm is quite useful. The bounding box is a spherical box that returns a two-dimensional array of min/max coordinates, inclusive of the geographic data passed: var b = path.bounds(states); To quote the D3 documentation: "The bounding box is represented by a two-dimensional array: [[left, bottom], [right, top]], where left is the minimum longitude, bottom is the minimum latitude, right is maximum longitude, and top is the maximum latitude." This is very helpful if you want to programmatically set the scale and translation of the map. In this case, we want the entire country to fit in our height and width, so we determine the bounding box of every state in the country of Mexico. The scale is calculated by taking the longest geographic edge of our bounding box and dividing it by the number of pixels of this edge in the visualization: var s = .95 / Math.max((b[1][0] - b[0][0]) / width, (b[1][1] - b[0][1]) / height); This can be calculated by first computing the scale of the width, then the scale of the height, and, finally, taking the larger of the two. All of the logic is compressed into the single line given earlier. The three steps are explained in the following image: The value 95 adjusts the scale, because we are giving the map a bit of a breather on the edges in order to not have the paths intersect the edges of the SVG container item, basically reducing the scale by 5 percent. Now, we have an accurate scale of our map, given our set width and height. var t = [(width - s * (b[1][0] + b[0][0])) / 2, (height - s * (b[1][1] + b[0][1])) / 2]; When we scale in SVG, it scales all the attributes (even x and y). In order to return the map to the center of the screen, we will use the translate function. The translate function receives an array with two parameters: the amount to translate in x, and the amount to translate in y. We will calculate x by finding the center (topRight – topLeft)/2 and multiplying it by the scale. The result is then subtracted from the width of the SVG element. Our y translation is calculated similarly but using the bottomRight – bottomLeft values divided by 2, multiplied by the scale, then subtracted from the height. Finally, we will reset the projection to use our new scale and translation: projection.scale(s).translate(t); Here, we will create a map variable that will group all of the following SVG elements into a <g> SVG tag. This will allow us to apply styles and better contain all of the proceeding paths' elements: var map = svg.append('g').attr('class', 'boundary'); Finally, we are back to the classic D3 enter, update, and exit pattern. We have our data, the list of Mexico states, and we will join this data to the path SVG element: mexico = map.selectAll('path').data(states.features); //Enter mexico.enter() .append('path') .attr('d', path); The enter section and the corresponding path functions are executed on every data element in the array. As a refresher, each element in the array represents a state in Mexico. The path function has been set up to correctly draw the outline of each state as well as scale and translate it to fit in our SVG container. Congratulations! You have created your first map! Experiment 1 – adjusting the bounding box Now that we have our foundation, let's start with our first experiment. For this experiment, we will manually zoom in to a state of Mexico using what we learned in the previous section. For this experiment, we will modify one line of code: var b = path.bounds(states.features[5]); Here, we are telling the calculation to create a boundary based on the sixth element of the features array instead of every state in the country of Mexico. The boundaries data will now run through the rest of the scaling and translation algorithms to adjust the map to the one shown in the following screenshot: We have basically reduced the min/max of the boundary box to include the geographic coordinates for one state in Mexico (see the next screenshot), and D3 has scaled and translated this information for us automatically: This can be very useful in situations where you might not have the data that you need in isolation from the surrounding areas. Hence, you can always zoom in to your geography of interest and isolate it from the rest. Experiment 2 – creating choropleths One of the most common uses of D3.js maps is to make choropleths. This visualization gives you the ability to discern between regions, giving them a different color. Normally, this color is associated with some other value, for instance, levels of influenza or a company's sales. Choropleths are very easy to make in D3.js. In this experiment, we will create a quick choropleth based on the index value of the state in the array of all the states. We will only need to modify two lines of code in the update section of our D3 code. Right after the enter section, add the following two lines: //Update var color = d3.scale.linear().domain([0,33]).range(['red', 'yellow']); mexico.attr('fill', function(d,i) {return color(i)}); The color variable uses another valuable D3 function named scale. Scales are extremely powerful when creating visualizations in D3; much more detail on scales can be found at https://github.com/mbostock/d3/wiki/Scales. For now, let's describe what this scale defines. Here, we created a new function called color. This color function looks for any number between 0 and 33 in an input domain. D3 linearly maps these input values to a color between red and yellow in the output range. D3 has included the capability to automatically map colors in a linear range to a gradient. This means that executing the new function, color, with 0 will return the color red, color(15) will return an orange color, and color(33) will return yellow. Now, in the update section, we will set the fill property of the path to the new color function. This will provide a linear scale of colors and use the index value i to determine what color should be returned. If the color was determined by a different value of the datum, for instance, d.sales, then you would have a choropleth where the colors actually represent sales. The preceding code should render something as follows: Experiment 3 – adding click events to our visualization We've seen how to make a map and set different colors to the different regions of this map. Next, we will add a little bit of interactivity. This will illustrate a simple reference to bind click events to maps. First, we need a quick reference to each state in the country. To accomplish this, we will create a new function called geoID right below the mexico variable: var height = 600; var width = 900; var projection = d3.geo.mercator(); var mexico = void 0; var geoID = function(d) { return "c" + d.properties.ID_1; }; This function takes in a state data element and generates a new selectable ID based on the ID_1 property found in the data. The ID_1 property contains a unique numeric value for every state in the array. If we insert this as an id attribute into the DOM, then we would create a quick and easy way to select each state in the country. The following is the geoID function, creating another function called click: var click = function(d) { mexico.attr('fill-opacity', 0.2); // Another update! d3.select('#' + geoID(d)).attr('fill-opacity', 1); }; This method makes it easy to separate what the click is doing. The click method receives the datum and changes the fill opacity value of all the states to 0.2. This is done so that when you click on one state and then on the other, the previous state does not maintain the clicked style. Notice that the function call is iterating through all the elements of the DOM, using the D3 update pattern. After making all the states transparent, we will set a fill-opacity of 1 for the given clicked item. This removes all the transparent styling from the selected state. Notice that we are reusing the geoID function that we created earlier to quickly find the state element in the DOM. Next, let's update the enter method to bind our new click method to every new DOM element that enter appends: //Enter mexico.enter() .append('path') .attr('d', path) .attr('id', geoID) .on("click", click); We also added an attribute called id; this inserts the results of the geoID function into the id attribute. Again, this makes it very easy to find the clicked state. The code should produce a map as follows. Check it out and make sure that you click on any of the states. You will see its color turn a little brighter than the surrounding states. Summary You learned how to build many different kinds of maps that cover different kinds of needs. Choropleths and data visualizations on maps are some of the most common geographic-based data representations that you will come across. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]

0
0
1389

Packt

24 Dec 2014

5 min read

Cassandra High Availability: Replication

Packt

24 Dec 2014

5 min read

This article by Robbie Strickland, the author of Cassandra High Availability, describes the data replication architecture used in Cassandra. Replication is perhaps the most critical feature of a distributed data store, as it would otherwise be impossible to make any sort of availability guarantee in the face of a node failure. As you already know, Cassandra employs a sophisticated replication system that allows fine-grained control over replica placement and consistency guarantees. In this article, we'll explore Cassandra's replication mechanism in depth. Let's start with the basics: how Cassandra determines the number of replicas to be created and where to locate them in the cluster. We'll begin the discussion with a feature that you'll encounter the very first time you create a keyspace: the replication factor. (For more resources related to this topic, see here.) The replication factor On the surface, setting the replication factor seems to be a fundamentally straightforward idea. You configure Cassandra with the number of replicas you want to maintain (during keyspace creation), and the system dutifully performs the replication for you, thus protecting you when something goes wrong. So by defining a replication factor of three, you will end up with a total of three copies of the data. There are a number of variables in this equation. Let's start with the basic mechanics of setting the replication factor. Replication strategies One thing you'll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster. There are two strategies available: SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads). NetworkTopologyStrategy: This strategy is used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster. SimpleStrategy As a way of introducing this concept, we'll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a keyspace called AddressBook with three replicas: CREATE KEYSPACE AddressBookWITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3}; The data is assigned to a node via a hash algorithm, resulting in each node owning a range of data. Let's take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm. The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows: Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary. Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal. This means reads and writes can be made to any node that holds a replica of the requested key. If you have a small cluster where all nodes reside in a single rack inside one data center, SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section). For production clusters, however, it is highly recommended that you use NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount. NetworkTopologyStrategy When it's time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose: Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines. Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration. Here's a basic example of a keyspace using NetworkTopologyStrategy: CREATE KEYSPACE AddressBookWITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2}; In this example, we're telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. Summary In this article, we introduced the foundational concepts of replication and consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. By now, you should be able to make sound decisions specific to your use cases. This article might serve as a handy reference in the future as it can be challenging to keep all these details in mind. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article] About Cassandra [Article]

0
0
1877

Packt

24 Dec 2014

13 min read

Analyzing Data

Packt

24 Dec 2014

13 min read

In this article by Amarpreet Singh Bassan and Debarchan Sarkar, authors of Mastering SQL Server 2014 Data Mining, we will begin our discussion with an introduction to the data mining life cycle, and this article will focus on its first three stages. You are expected to have basic understanding of the Microsoft business intelligence stack and familiarity of terms such as extract, transform, and load (ETL), data warehouse, and so on. (For more resources related to this topic, see here.) Data mining life cycle Before going into further details, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps: Understanding the business requirement. Understanding the data. Preparing the data for the analysis. Preparing the data mining models. Evaluating the results of the analysis prepared with the models. Deploying the models to the SQL Server Analysis Services Server. Repeating steps 1 to 6 in case the business requirement changes. Let's look at each of these stages in detail. The first and foremost task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions: What and whom are we targeting? What is the outcome we are targeting? What is the time frame for which we have the data and what is the target time period that our data is going to forecast? What would the success measures look like? Let's define a classic problem and understand more about the preceding questions. We can use them to discuss how to extract the information rather than spending our time on defining the schema. Consider an instance where you are a salesman for the AdventureWorks Cycle company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail. The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely: What is it that we are predicting? (for example: customers, product sales, and so on) What is the time period of the data that we are selecting for prediction? What time period are we going to have the prediction for? What is the expected outcome of the prediction exercise? From the marketing point of view, several follow-up questions that must be answered are as follows: What is our target for marketing, a new product or an older product? Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification? On what timeline in the past is our marketing going to be based on? We might observe that there are many questions that overlap the two categories and therefore, there is an opportunity to consolidate the questions and classify them as follows: What is the population that we are targeting? What are the factors that we will actually be looking at? What is the time period of the past data that we will be looking at? What is the time period in the future that we will be considering the data mining results for? Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement. What is the population that we are targeting? The target population might be classified according to the following aspects: Age Salary Number of kids What are the factors that we are actually looking at? They might be classified as follows: Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes. Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes. Affinity of components: The people who tend to buy bikes would also buy some accessories. What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look; for example, we can look at the data for the past year, past two years, or past five years. We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so does people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past 5 years data can be valid for the upcoming 2 or 3 years depending upon the results that we get. Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks has several stores in various locations and based on the location, we would like to get an insight on the following: Which products should be stocked where? Which products should be stocked together? How much of the products should be stocked? What is the trend of sales for a new product in an area? It is not necessary that we will get answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions. Staging data In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems to a centralized repository. Extract, transform, and load This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts: Extract: The data is extracted from a different source database and other databases that might contain the information that we seek Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on Load: The transformed data is loaded into the destination data store database We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure: Data warehouse As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices and then relate to these concepts with the help of AdventureWorksDW database. Measures and dimensions There are a few common terminologies you will encounter as you enter the world of data warehousing. They are as follows: Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension Schema A schema, basically, determines the relationship of the various entities with each other. There are essentially two types of schema, namely: Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot: Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema usually would involve hierarchies, as shown in the following screenshot: Data mart While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts. Let's consider the sales for AdventureWorks cycles. To make any predictions on the sales of AdventureWorks, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart. The Internet sales facts table has the following data: [ProductKey][OrderDateKey][DueDateKey][ShipDateKey][CustomerKey][PromotionKey][CurrencyKey][SalesTerritoryKey][SalesOrderNumber][SalesOrderLineNumber][RevisionNumber][OrderQuantity][UnitPrice][ExtendedAmount][UnitPriceDiscountPct][DiscountAmount][ProductStandardCost][TotalProductCost][SalesAmount][TaxAmt][Freight][CarrierTrackingNumber][CustomerPONumber][OrderDate][DueDate][ShipDate] From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following: Product: This provides the following data: [ProductKey][ListPrice] Date: This provides the following data: [DateKey] Customer: This provides the following data: [CustomerKey] Currency: This provides the following data: [CurrencyKey] Sales territory: This provides the following data: [SalesTerritoryKey] The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and hence, we can easily perform all the analysis pertaining to the sales of the organization. Refreshing data Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data. Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table: [BusinessEntityID],[NationalIDNumber],[LoginID],[OrganizationNode],[OrganizationLevel],[JobTitle],[BirthDate],[MaritalStatus],[Gender],[HireDate],[SalariedFlag],[VacationHours],[SickLeaveHours],[CurrentFlag],[rowguid],[ModifiedDate] Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions, prior to SQL 2008. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely: Type 1: The older values are overwritten by new values Type 2: A new row specifying the present value for the dimension is inserted Type 3: The column specifying TimeStamp from which the new value is effective is updated Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows: Create table employee_load_status(HireDate DateTime,LoadStatus varchar); The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database: With employee_loaded_date(HireDate) as(select ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='success'Union AllSelect ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='failed')Insert into DimEmployee select * from employee where HireDate>=(select Min(HireDate) from employee_loaded_date); This will reload all the data from the date of the first failure till the present day. A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks table, we see the following columns: [BusinessEntityID],[TerritoryID],[SalesQuota],[Bonus],[CommissionPct],[SalesYTD],[SalesLastYear],[rowguid],[ModifiedDate] The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column. Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on. Let's look at this with a simpler example. Consider we have the following sales table: Name SalesAmount Date Rama 1000 11-02-2014 Shyama 2000 11-02-2014 Consider we have the following fact table: id SalesAmount Datekey We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this: id SalesAmount Datekey ra 1000 11-02-2014 Sh 2000 11-02-2014 Ra 4000 12-02-2014 Sh 5000 13-02-2014 Also, a complete reload will appear as shown here: id TotalSalesAmount Datekey Ra 5000 12-02-2014 Sh 7000 13-02-2014 Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria. Summary In this article, we've covered the first three steps of any data mining process. We've considered the reasons why we would want to undertake a data mining activity and identified the goal we have in mind. We then looked to stage the data and cleanse it. Resources for Article: Further resources on this subject: Hadoop and SQL [Article] SQL Server Analysis Services – Administering and Monitoring Analysis Services [Article] SQL Server Integration Services (SSIS) [Article]

0
0
2120

Packt

23 Dec 2014

61 min read

Hadoop and SQL

Packt

23 Dec 2014

61 min read

0
0
1391

Packt

19 Dec 2014

11 min read

Evolving the data model

Packt

19 Dec 2014

11 min read

0
0
1109

Packt

19 Dec 2014

50 min read

Supervised learning

Packt

19 Dec 2014

50 min read

0
0
2464

article-image-navigation-mesh-generation

Packt

19 Dec 2014

9 min read

Navigation Mesh Generation

Packt

19 Dec 2014

9 min read

0
0
1988

Packt

17 Dec 2014

24 min read

Mastering Splunk: Lookups

Packt

17 Dec 2014

24 min read

0
0
8397

Packt

16 Dec 2014

9 min read

Adding Graded Activities

Packt

16 Dec 2014

9 min read

This article by Rebecca Barrington, author of Moodle Gradebook Second Edition, teaches you how to add assignments and set up how they will be graded, including how to use our custom scales and add outcomes for grading. (For more resources related to this topic, see here.) As with all content within Moodle, we need to select Turn editing on within the course in order to be able to add resources and activities. All graded activities are added through the Add an activity or resource text available within each section of within a Moodle course. This text can be found in the bottom right of each section after editing has been turned on. There are a number of items that can be graded and will appear within the Gradebook. Assignments are the most feature-rich of all the graded activities and have many options available in order to customize how assessments can be graded. They can be used to provide assessment information for students, store grades, and provide feedback. When setting up the assignment, we can choose for students to submit their work electronically—either through file submission or online text, or we can review the assessment offline and use only the grade and feedback features of the assignment. Adding assignments There are many options *within the assignments, and throughout this article we will set up a number of different assignments and you'll learn about some of their most useful features and options. Let's have a go at creating a range of assignments that are ready for grading. Creating an assignment with a scale The first assignment that we will add will *make use of the PMD scale Click on the Turn editing on button. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 1). In the Description box, provide some assignment details. In the Availability section, we need to disable the date options. We will not make use of these options, but they can be very useful. To disable the options, click on the tick next to the Enable text. However, details of these options have *been provided for future* reference. The Allow submissions from section* is mostly relevant when the assignment will be submitted electronically, as students won't be able to submit their work until the date and time indicated here. The Due date section* can be used to indicate when the assignment needs to be submitted by. If students electronically submit their assignment after the date and time indicated here, the submission date and time will be shown in red in order to notify the teacher that it was submitted past the due date. The Cut off date section* enables teachers to set an extension period after the due date where late submissions will continue to be accepted. In the* Submission types section, ensure *that the File submissions checkbox is enabled by adding a tick there. This will enable students to submit their assignment electronically. There are additional options that we can choose as well. With Maximum number of uploaded files, we can indicate how many files a student can upload. Keep this as 1. We can also determine the Maximum submission size option for each file using the drop-down list shown in the following screenshot: Within the Feedback types section, ensure that all options under the Feedback types *section are *selected. Feedback comments enables *us to provide written feedback along with the grade. Feedback files enables us *to upload a file in order to provide feedback to a student. Offline grading worksheet will *provide us with the option to download a .csv file that contains core information about the assignment, and this can be used to add grades and feedback while working offline. This completed .csv file can be uploaded and the grades will be added to the assignments within the Gradebook. In the Submission settings section, we have options related to how students will submit their assignment and how they will reattempt submission if required. If Require students click submit button is left as No, students will upload* their assignment* and it will be available *to the teacher for grading. If this option is changed to Yes, students can upload their assignment, but the teacher will see that it is in the draft form. Students will click on Submit to indicate that it is ready to be graded. Require that students accept the submission statement will provide students *with a statement that they need to agree to when they submit their assignment. The default statement is This assignment is my own work, except where I have acknowledged the use of works of other people. The submission statement can be changed by a site administrator by navigating to Site administration | plugins | Activity modules | Assignment settings. The Attempts reopened drop-down list* provides options for the status of the assignment after it has been graded. Students will only be able to resubmit their work when it is open. Therefore this setting will control when and if students are able to submit another version of their assignment. The options available to us are:Never: This option should be selected if students will not be able to submit another piece of work.Manually: This will enable anyone who has the role of a teacher to choose to reopen a submission that enables a student to submit their work again.Automatically until pass: This option works when a pass grade is set within the Gradebook. After grading, if the student is awarded the minimum pass *grade or higher, the submission *will remain closed in order to prevent any changes to the submission. However, if the assignment is graded lower than the assigned pass grade, the submission will automatically reopen in order to enable the student to submit the assignment again.Maximum attempts: The maximum *attempts allowed for this assignment will limit the number of times an assignment is reopened. For example, if this option is set to 3, then a student will only be able to submit their assignment three times. After they have submitted their assignment for a third time, they will not be allowed to submit it again. The default is unlimited, but it can be changed by clicking on the drop-down list. In the Submission settings section, ensure that the options for Require students click on submit button and Require that students accept the submission statement are set to Yes. Also, change the Attempts reopened to Automatically until passed. Within the Grade section, navigate to Grade | Type | Scale and choose the PMD scale. Select Use marking workflow by changing the drop-down list to Yes.Use marking workflow is a new feature of Moodle 2.6* that enables *the grading process to go through a range of stages in order to indicate that the marking is in progress or is complete, is being reviewed, or is ready for release to students. Click on* Save and return to course. Creating an online assignment with a number grade The next *assignment that we will create will have an online* text option that will have a maximum grade of 20. The following steps show you how to create an online assignment with a number grade: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 2). In the Description box, provide the assignment details. In the Submission types section, ensure that Online text has a tick next to it. This will enable students to type directly into Moodle. When choosing this option, we can also set a maximum word limit by clicking on the tick box next to the Enable text. After enabling this option, we can add a number to the textbox. For this assignment, enable a word limit of 200 words. When using* online text* submission, we have an additional feedback option within the Feedback types section. Under the Comment inline text, click on No and switch to Yes to enable yourself to add written feedback for students within the written text submitted by students. In the Submission settings section, ensure that the options for Require students click submit button and Require that students accept the submission statement are set to Yes. Also, change Attempts reopened to Automatically until passed. Within the Grades section, navigate to Grade | Type | Point and ensure that Maximum points is set to 20. Click *on* Save and return to course. Creating an assignment including outcomes The next assignment that we will *create will add some of the Outcomes: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 3). In the Description box, provide the assignment details. In the Submission types box, ensure that Online text and File submissions are selected. Set Maximum number of uploaded files to 2. In the Submission settings section, ensure that the options for Require students to click submit button and Require that students accept the submission statement are amended to Yes. Change Attempts reopened to Manually. Within the Grades section, navigate to Grade | Type | Point and Maximum points is set to 100. In the Outcomes *section, choose the outcomes as Evidence provided and Criteria 1 met. Scroll to the bottom of the screen and click on Save and return to course. Summary In this article, we added a range of assignments that made use of number and scale grades as well as added outcomes to an assignment. Resources for Article: Further resources on this subject: Moodle for Online Communities [article] What's New in Moodle 2.0 [article] Moodle 2.0: What's New in Add a Resource [article]

0
0
1160

Packt

16 Dec 2014

9 min read

Ridge Regression

Packt

16 Dec 2014

9 min read

In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]], val y: DblVector, val lambda: Double) { extends AbstractMultipleLinearRegression with PipeOperator[Array[T], Double] { private var qr: QRDecomposition = null private[this] val model: Option[RegressionModel] = … … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1 val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price .drop(1) .zip(_price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.get .zip(volume.get) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4 regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]

0
0
2961

How-To Tutorials - Data

Working with Incanter Datasets

In the Cloud

Taming Big Data using HDInsight

Highcharts Configurations

Evolution of Hadoop

Creating a Map

Cassandra High Availability: Replication

Analyzing Data

Hadoop and SQL

Evolving the data model

Trending Topics

Supervised learning

Navigation Mesh Generation

Mastering Splunk: Lookups

Adding Graded Activities

Ridge Regression