Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-convolutional-neural-networks-reinforcement-learning
Packt
06 Apr 2017
9 min read
Save for later

Convolutional Neural Networks with Reinforcement Learning

Packt
06 Apr 2017
9 min read
In this article by Antonio Gulli, Sujit Pal, the authors of the book Deep Learning with Keras, we will learn about reinforcement learning, or more specifically deep reinforcement learning, that is, the application of deep neural networks to reinforcement learning. We will also see how convolutional neural networks leverage spatial information and they are therefore very well suited for classifying images. (For more resources related to this topic, see here.) Deep convolutional neural network A Deep Convolutional Neural Network (DCCN) consists of many neural network layers. Two different types of layers, convolutional and pooling, are typically alternated. The depth of each filter increases from left to right in the network. The last stage is typically made of one or more fully connected layers as shown here: There are three key intuitions beyond ConvNets: Local receptive fields Shared weights Pooling Let's review them together. Local receptive fields If we want to preserve the spatial information, then it is convenient to represent each image with a matrix of pixels. Then, a simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron belonging to the next layer. That single hidden neuron represents one local receptive field. Note that this operation is named convolution and it gives the name to this type of networks. Of course we can encode more information by having overlapping submatrices. For instance let's suppose that the size of each single submatrix is 5 x 5 and that those submatrices are used with MNIST images of 28 x 28 pixels, then we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In fact it is possible to slide the submatrices by only 23 positions before touching the borders of the images. In Keras, the size of each single submatrix is called stride-length and this is an hyper-parameter which can be fine-tuned during the construction of our nets. Let's define the feature map from one layer to another layer. Of course we can have multiple feature maps which learn independently from each hidden layer. For instance we can start with 28 x 28 input neurons for processing MINST images, and then recall k feature maps of size 23 x 23 neurons each (again with stride of 5 x 5) in the next hidden layer. Shared weights and bias Let's suppose that we want to move away from the pixel representation in a row by gaining the ability of detecting the same feature independently from the location where it is placed in the input image. A simple intuition is to use the same set of weights and bias for all the neurons in the hidden layers. In this way each layer will learn a set of position-independent latent features derived from the image. Assuming that the input image has shape (256, 256) on 3 channels with tf (Tensorflow) ordering, this is represented as (256, 256, 3). Note that with th (Theano) mode the channels dimension (the depth) is at index 1, in tf mode is it at index 3. In Keras if we want to add a convolutional layer with dimensionality of the output 32 and extension of each filter 3 x 3 we will write: model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(256, 256, 3)) This means that we are applying a 3 x 3 convolution on 256 x 256 image with 3 input channels (or input filters) resulting in 32 output channels (or output filters). An example of convolution is provided in the following diagram: Pooling layers Let's suppose that we want to summarize the output of a feature map. Again, we can use the spatial contiguity of the output produced from a single feature map and aggregate the values of a submatrix into one single output value synthetically describing the meaning associated with that physical region. Max pooling One easy and common choice is the so-called max pooling operator which simply outputs the maximum activation as observed in the region. In Keras, if we want to define a max pooling layer of size 2 x 2 we will write: model.add(MaxPooling2D(pool_size = (2, 2))) An example of max pooling operation is given in the following diagram: Average pooling Another choice is the average pooling which simply aggregates a region into the average values of the activations observed in that region. Keras implements a large number of pooling layers and a complete list is available online. In short, all the pooling operations are nothing more than a summary operation on a given region. Reinforcement learning Our objective is to build a neural network to play the game of catch. Each game starts with a ball being dropped from a random position from the top of the screen. The objective is to move a paddle at the bottom of the screen using the left and right arrow keys to catch the ball by the time it reaches the bottom. As games go, this is quite simple. At any point in time, the state of this game is given by the (x, y) coordinates of the ball and paddle. Most arcade games tend to have many more moving parts, so a general solution is to provide the entire current game screen image as the state. The following diagram shows four consecutive screenshots of our catch game: Astute readers might note that our problem could be modeled as a classification problem, where the input to the network are the game screen images and the output is one of three actions - move left, stay, or move right. However, this would require us to provide the network with training examples, possibly from recordings of games played by experts. An alternative and simpler approach might be to build a network and have it play the game repeatedly, giving it feedback based on whether it succeeds in catching the ball or not. This approach is also more intuitive and is closer to the way humans and animals learn. The most common way to represent such a problem is through a Markov Decision Process (MDP). Our game is the environment within which the agent is trying to learn. The state of the environment at time step t is given by st (and contains the location of the ball and paddle). The agent can perform certain actions at (such as moving the paddle left or right). These actions can sometimes result in a reward rt, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1, and so on. The set of states, actions and rewards, together with the rules for transitioning from one state to the other, make up a Markov decision process. A single game is one episode of this process, and is represented by a finite sequence of states, actions and rewards: Since this is a Markov decision process, the probability of state st+1 depends only on current state st and action at. Maximizing future rewards As an agent, our objective is to maximize the total reward from each game. The total reward can be represented as follows: In order to maximize the total reward, the agent should try to maximize the total reward from any time point t in the game. The total reward at time step t is given by Rt and is represented as: However, it is harder to predict the value of the rewards the further we go into the future. In order to take this into consideration, our agent should try to maximize the total discounted future reward at time t instead. This is done by discounting the reward at each future time step by a factor γ over the previous time step. If γ is 0, then our network does not consider future rewards at all, and if γ is 1, then our network is completely deterministic. A good value for γ is around 0.9. Factoring the equation allows us to express the total discounted future reward at a given time step recursively as the sum of the current reward and the total discounted future reward at the next time step: Q-learning Deep reinforcement learning utilizes a model-free reinforcement learning technique called Q-learning. Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function which represents the maximum discounted future reward when we perform action a in state s: Once we know the Q-function, the optimal action a at a state s is the one with the highest Q-value. We can then define a policy π(s) that gives us the optimal action at any state: We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2) similar to how we did with the total discounted future reward. This equation is known as the Bellmann equation. The Q-function can be approximated using the Bellman equation. You can think of the Q-function as a lookup table (called a Q-table) where the states (denoted by s) are rows and actions (denoted by a) are columns, and the elements (denoted by Q(s, a)) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward. We start by randomly initializing the Q-table, then carry out random actions and observe the rewards to update the Q-table iteratively according to the following algorithm: initialize Q-table Q observe initial state s repeat select and carry out action a observe reward r and move to new state s' Q(s, a) = Q(s, a) + α(r + γ maxa' Q(s', a') - Q(s, a)) s = s' until game over You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q-value should be incorporated. Summary We have seen the application of deep neural networks, reinforcement learning. We have also seen convolutional neural networks and how they are well suited for classifying images. Resources for Article: Further resources on this subject: Deep learning and regression analysis [article] Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article]
Read more
  • 0
  • 0
  • 13843

article-image-weblogic-server
Packt
15 Mar 2017
24 min read
Save for later

WebLogic Server

Packt
15 Mar 2017
24 min read
In this article by Adrian Ward, Christian Screen and Haroun Khan, the author of the book Oracle Business Intelligence Enterprise Edition 12c - Second Edition, will talk a little more in detail about the enterprise application server that is at the core of Oracle Fusion Middleware, WebLogic. Oracle WebLogic Server is a scalable, enterprise-ready Java Platform Enterprise Edition (Java EE) application server. Its infrastructure supports the deployment of many types of distributed applications. It is also an ideal foundation for building service-oriented applications (SOA). You can already see why BEA was a perfect acquisition for Oracle years ago. Or, more to the point, a perfect core for Fusion Middleware. (For more resources related to this topic, see here.) The WebLogic Server is a robust application in itself. In Oracle BI 12c, the WebLogic server is crucial to the overall implementation, not just from installation but throughout the Oracle BI 12c lifecycle, which now takes advantage of the WebLogic Management Framework. Learning the management components of WebLogic Server that ultimately control the Oracle BI components is critical to the success of an implementation. These management areas within the WebLogic Server are referred to as the WebLogic Administration Server, WebLogic Manager Server(s), and the WebLogic Node Manager. A Few WebLogic Server Nuances Before we move on to a description for each of those areas within WebLogic, it is also important to understand that the WebLogic Server software that is used for the installation of the Oracle BI product suite carries a limited license. Although the software itself is the full enterprise version and carries full functionality, the license that ships with Oracle BI 12c is not a full enterprise license for WebLogic Server for your organization to spin off other siloed JEE deployments on other non-OBIEE servers.: Clustered from the installation:            The WebLogic Server license provided with out-of-the-box Oracle BI 12c does not allow for horizontal scale-out. An enterprise WebLogic Server license needs be obtained for this advanced functionality. Contains an Embedded Web/HTTP Server, not Oracle HTTP Server (OHS): WebLogic Server does not contain a separate HTTP server with the installation. The Oracle BI Enterprise Deployment Guide (available on oracle.com ) discusses separating the Application Tier from the Web/HTTP tier, suggesting Oracle HTTP Server. These items are simply a few nuances of the product suite in relation to Oracle BI 12c. Most software products contain a short list such as this one. However, once you understand the nuances, the easier it will be to ensure that you have a more successful implementation. It also allows your team to be as prepared in advance as possible. Be sure to consult your Oracle sales representative to assist with licensing concerns. Despite these nuances, we highly recommended that, in order to learn more about the installation features, configuration options, administration, and maintenance of WebLogic, you not only research it in relation to Oracle BI, but also in relation to its standalone form. That is to say that there is much more information at large on the topic of WebLogic Server itself than WebLogic Server as it relates to Oracle BI. Understanding this approach to self-educating or web searching should provide you with more efficient results. WebLogic Domain The highest unit of management for controlling the WebLogic Server Installation is called a domain. A domain is a logically related group of WebLogic Server resources that you manage as a unit. A domain always includes, and is centrally managed by, one Administration Server. Additional WebLogic Server instances, which are controlled by the Administration Server for the domain, are called Managed Servers. The configuration for all the servers in the domain is stored in the configuration repository, the config.xml file, which resides on the machine hosting the Administration Server. Upon installing and configuring Oracle BI 12c, the domain, bi, is established within the WebLogic Server. This domain is the recommended name for each Oracle BI 12c implementation and should not be modified. The domain path for the bi domain may appear as ORACLE_HOME/user_projects/domains/bi. This directory for the bi domain is also referred to as the DOMAIN_HOME or BI_DOMAIN folder WebLogic Administration Server The WebLogic Server is an enterprise software suite that manages a myriad of application server components, mainly focusing on Java technology. It is also comprised of many ancillary components, which enables the software to scale well, and also makes it a good choice for distributed environments and high-availability. Clearly, it is good enough to be at the core of Oracle Fusion Middleware. One of the most crucial components of WebLogic Server is WebLogic Administration Server. When installing the WebLogic Server software, the Administration Server is automatically installed with it. It is the Administration Server that not only controls all subsequent WebLogic Server instances, called Managed Servers, but it also controls such aspects as authentication-provider security (for example, LDAP) and other application-server-related configurations. WebLogic Server installs on the operating system and ultimately runs as a service on that machine. The WebLogic Server can be managed in several ways. The two main methods are via the Graphical User Interface (GUI) web application called WebLogic Administration Console, or via command line using the WebLogic Scripting Tool (WLST). You access the Administration Console from any networked machine using a web-based client (that is, a web browser) that can communicate with the Administration Server through the network and/or firewall. The WebLogic Administration Server and the WebLogic Server are basically synonymous. If the WebLogic Server is not running, the WebLogic Administration Console will be unavailable as well. WebLogic Managed Server Web applications, Enterprise Java Beans (EJB), and other resources are deployed onto one or more Managed Servers in a WebLogic Domain. A managed server is an instance of a WebLogic Server in a WebLogic Server domain. Each WebLogic Server domain has at least one instance, which acts as the Administration Server just discussed. One administration server per domain must exist, but one or more managed servers may exist in the WebLogic Server domain. In a production deployment, Oracle BI is deployed into its own managed server. The Oracle BI installer installs two WebLogic server instances, the Admin Server and a managed server, bi_server1. Oracle BI is deployed into the managed server bi_server1, and is configured by default to resolve to port 19502; the Admin Server resolves to port 19500. Historically, this has been port 9704 for the Oracle BI managed server, and port 7001 for the Admin Server. When administering the WebLogic Server via the Administration Console, the WebLogic Administration Server instance appears in the same list of servers, which also includes any managed servers. As a best practice, the WebLogic Administration Server should be used for configuration and management of the WebLogic Server only, and not contain any additionally deployed applications, EJBs, and so on. One thing to note is that the Enterprise Manager Fusion Control is actually a JEE application deployed to the Administration Server instance, which is why its web client is accessible under the same port as the Admin Server. It is not necessarily a native application deployment to the core WebLogic Server, but gets deployed and configured during the Oracle BI installation and configuration process automatically. In the deployments page within the Administration Console, you will find a deployment named em. WebLogic Node Manager The general idea behind Node Manager is that it takes on somewhat of a middle-man role. That is to say, the Node Manager provides a communication tunnel between the WebLogic Administration Server and any Managed Servers configured within the WebLogic Domain. When the WebLogic Server environment is contained on a single physical server, it may be difficult to recognize the need for a Node Manager. It is very necessary and, as part of any of your ultimate start-up and shutdown scripts for Oracle BI, the Node Manager lifecycle management will have to be a part of that process. Node Manager’s real power comes into play when Oracle BI is scaled out horizontally on one or more physical servers. Each scaled-out deployment of WebLogic Server will contain a Node Manager. If the Node Manager is not running on the server on which the Managed Server is deployed, then the core Administration Server will not be able to issue start or stop commands to that server. As such, if the Node Manager is down, communication with the overall cluster will be affected. The following figure shows how machines A, B, and C are physically separated, each containing a Node Manager. You can see that the Administration Server communicates to the Node Managers, and not the Managed Servers, directly: System tools controlled by WebLogic We briefly discussed the WebLogic Administration Console, which controls the administrative configuration of the WebLogic Server Domain. This includes the components managed within it, such as security, deployed applications, and so on. The other management tool that provides control of the deployed Oracle BI application ancillary deployments, libraries, and several other configurations, is called the Enterprise Manager Fusion Middleware Control. This seems to be a long name for single web-based tool. As such, the name is often shortened to “Fusion Control” or “Enterprise Manager.” Reference to either abbreviated title in the context of Oracle BI should ensure fellow Oracle BI teammates understand what you mean. Security It would be difficult to discuss the overall architecture of Oracle BI without at least giving some mention to how the basics of security, authentication, and authorization are applied. By default, installing Oracle WebLogic Server provides a default Lightweight Directory Access Protocol (LDAP) server, referred to as the WebLogic Server Embedded LDAP server. This is a standards-compliant LDAP system, which acts as the default authentication method for out-of-the-box Oracle BI. Integration of secondary LDAP providers, such as Oracle Internet Directory (OID) or Microsoft Active Directory (MSAD), is crucial to leveraging most organizations' identity-management systems. The combination of multiple authentication providers is possible; in fact, it is commonplace. For example, a configuration may wish to have users that exist in both the Embedded LDAP server and MSAD to authenticate and have access to Oracle BI. Potentially, users may want another set of users to be stored in a relational database repository, or have a set of relational database tables control the authorization that users have in relation to the Oracle BI system. WebLogic Server provides configuration opportunities for each of these scenarios. Oracle BI security incorporates the Fusion Middleware Security model, Oracle Platform Security Services (OPSS). This has a positive influence over managing all aspects of Oracle BI, as it provides a very granular level of authorization and a large number of authentication and authorization-integration mechanisms. OPSS also introduces to Oracle BI the concept of managing privileges by application role instead of directly by user or group. It abides by open standards to integrate with security mechanisms that are growing in popularity, such as the Security Assertion Markup Language (SAML) 2.0. Other well-known single-sign-on mechanisms such as SiteMinder and Oracle Access Manager already have pre-configured integration points within Oracle BI Fusion Control.Oracle BI 12c and Oracle BI 11g security is managed differently than the legacy Oracle BI 10g versions. Oracle BI 12c no longer has backward compatibility for the legacy version of Oracle BI 10g, and focus should be to follow the new security configuration best practices of Oracle BI 12c: An Oracle BI best practice is to manage security by Application Roles. Understanding the differences between the Identity Store, Credential Store, and Policy Store is critical for advanced security configuration and maintenance. As of Oracle BI 12c, the OPSS metadata is now stored in a relational repository, which is installed as part of the RCU-schemas installation process that takes place prior to executing the Oracle BI 12c installation on the application server. Managing by Application Roles In Oracle BI 11g, the default security model is the Oracle Fusion Middleware security model, which has a very broad scope. A universal Information Technology security-administration best practice is to set permissions or privileges to a specific point of access on a group, and not individual users. The same idea applies here, except there is another enterprise-level of user, and even group, aggregation, called an Application Role.Application Roles can contain other application roles, groups, or individual users. Access privileges to a certain object, such as a folder, web page, or column should always be assigned to an application role. Application roles for Oracle BI can be managed in the Oracle Enterprise Manager Fusion Middleware Control interface. They can also be scripted using the WLST command-line interface. Security Providers Fusion Middleware security can seem complex at first, but knowing the correct terminology and understanding how the most important components communicate with each other and the application at large is extremely important as it relates to security management. Oracle BI uses three main repositories for accessing authentication and authorization information, all of which are explained in the following sections. Identity Store Identity Store is the authentication provider, which may also provide authorization metadata. A simple mnemonic here is that this store tells Oracle BI how to “Identify” any users attempting to access the system. An example of creating an Identity Store would be to configure an LDAP system such as Oracle Internet Directory or Microsoft Active Directory to reference users within an organization. These LDAP configurations are referred to as Authentication Providers. Credential Store The credential store is ultimately for advanced Oracle configurations. You may touch upon this when establishing an enterprise Oracle BI deployment, but not much thereafter, unless integrating the Oracle BI Action Framework or something equally as complex. Ultimately, the credential store does exactly what its name implies – it stores credentials. Specifically, it is used to store credentials of other applications, which the core application (that is, Oracle BI) may access at a later time without having to re-enter said credentials. An example of this would be integrating Oracle BI with the Oracle Enterprise Management (EPM) suite. In this example, let's pretend there is an internal requirement at Company XYZ for users to access an Oracle BI dashboard. Upon viewing said dashboard, if a report with discrepancies is viewed, the user requires the ability to click on a link, which opens an Oracle EPM Financial Report containing more details about the concern. If not all users accessing the Oracle BI dashboard have credentials to access to the Oracle EPM environment directly, how could they open and view the report without being prompted for credentials? The answer would be that the credential store would be configured with the credentials of a central user having access to the Oracle EPM environment. This central user's credentials (encrypted, of course) are passed along with the dashboard viewer's request and presto, access! Policy Store The policy store is quite unique to Fusion Middleware security and leverages a security standard referred to as XACML, which ultimately provides granular access and privilege control for an enterprise application. This is one of the reasons why managing by Application Roles becomes so important. It is the individual Application Roles that are assigned policies defining access to information within Oracle BI. Stated another way, the application privileges, such as the ability to administer the Oracle BI RPD, are assigned to a particular application role, and these associations are defined in the policy store. The following figure shows how each area of security management is controlled: These three types of security providers within Oracle Fusion Middleware are integral to Oracle BI architecture. Further recommended research on this topic would be to look at Oracle Fusion Middleware Security, OPSS, and the Application Development Framework (ADF). System Requirements The first thing to recognize with infrastructure requirements prior to deploying Oracle BI 12c is that its memory and processor requirements have increased since previous versions. The Java Application server, WebLogic Server, installs with the full version of its software (though under a limited/restricted license, as already discussed). A multitude of additional Java libraries and applications are also deployed. Be prepared for a recommended minimum 8 to16 GB Read Access Memory (RAM) requirement for an Enterprise deployment, and a 6 to 8 GB RAM minimum requirement for a workstation deployment. Client tools Oracle BI 12c has a separate client tools installation that requires Microsoft Windows XP or a more recent version of the Windows Operating System (OS). The Oracle BI 12c client tools provide the majority of client-to-server management capabilities required for normal day-to-day maintenance of the Oracle BI repository and related artefacts. The client-tools installation is usually reserved for Oracle BI developers who architect and maintain the Oracle BI metadata repository, better known as the RPD, which stems from its binary file extension (.rpd). The Oracle BI 12c client-tools installation provides each workstation with the Administration tool, Job Manager, and all command-line Application Programming Interface (API) executables. In Oracle BI 12c, a 64-bit Windows OS is a requirement for installing the Oracle BI Development Client tools. It has been observed that, with some initial releases of Oracle BI 12c client tools, the ODBC DSN connectivity does not work in Windows Server 2012. Therefore, utilizing Windows Server 2012 as a development environment will be ineffective if attempting to open the Administration Tool and connecting to the OBIEE Server in online mode. Multi-User Development Environment One of the key features when developing with Oracle BI is the ability for multiple metadata developers to develop simultaneously. Although the use of the term “simultaneously” can vary among the technical communities, the use of concurrent development within the Oracle BI suite requires Oracle BI's Multi-User Development Environment (MUD) configuration, or some other process developed by third-party Oracle partners. The MUD configuration itself is fairly straightforward and ultimately relies on the Oracle BI administrator’s ability to divide metadata modeling responsibilities into projects. Projects – which are usually defined and delineated by logical fact table definitions – can be assigned to one or more metadata developers. In previous versions of Oracle BI, a metadata developer could install the entire Oracle BI product suite on an up-to-date laptop or commodity desktop workstation and successfully develop, test, and deploy an Oracle BI metadata model. The system requirements of Oracle BI 12c require a significant amount of processor and RAM capacity in order to perform development efforts on a standard-issue workstation or laptop. If an organization currently leverages the Oracle BI multi-user development environment, or plans to with the current release, this raises a couple of questions: How do we get our developers the best environment suitable for developing our metadata? Do we need to procure new hardware? Microsoft Windows is a requirement for Oracle BI client tools. However, the Oracle BI client tool does not include the server component of the Oracle BI environment. It only allows for connecting from the developer's workstation to the Oracle BI server instance. In a multi-user development environment, this poses a serious problem as only one metadata repository (RPD) can exist on any one Oracle BI server instance at any given time. If two developers are working from their respective workstations at the same time and wish to see their latest modifications published in a rapid application development (RAD) cycle, this type of iterative effort fails, as one developer's published changes will overwrite the other’s in real-time. To resolve the issue there are two recommended solutions. The first is an obvious localized solution. This solution merely upgrades the Oracle BI developers’ workstations or laptops to comply with the minimum requirements for installing the full Oracle BI environment on said machines. This upgrade should be both memory- (RAM) and processor- (MHz) centric. 16GB+ RAM and a dual-core processor are recommended. A 64-bit operating system kernel is required. Without an upgraded workstation from which to work, Oracle BI metadata developers will sit at a disadvantage for general iterative metadata development, and will especially be disenfranchised if interfacing within a multi-user development environment. The second solution is one that takes advantage of virtual machines’ (VM) capacity within the organization. Virtual machines have become a staple within most Information Technology departments, as they are versatile and allow for speedy proposition of server environments. For this scenario, it is recommended to create a virtual-machine template of an Oracle BI environment from which to duplicate and “stand up” individual virtual machine images for each metadata developer on the Oracle BI development team. This effectively provides each metadata developer with their own Oracle BI development environment server, which contains the fully deployed Oracle BI server environment. Each developer then has the ability to develop and test iteratively by connecting to their assigned virtual server, without fear that their efforts will conflict with another developer's. The following figure illustrates how an Oracle BI MUD environment can leverage either upgraded developer-workstation hardware or VM images, to facilitate development: This article does not cover the installation, configuration, or best practices for developing in a MUD environment. However, the Oracle BI development team deserves a lot of credit for documenting these processes in unprecedented detail. The Oracle BI 11g MUD documentation provides a case study, which conveys best practices for managing a complex Oracle BI development lifecycle. When ready to deploy a MUD environment, it is highly recommended to peruse this documentation first. The information in this section merely seeks to convey best practice in facilitating a developer’s workstation when using MUD. Certifications matrix Oracle BI 12c largely complies with the overall Fusion Middleware infrastructure. This common foundation allows for a centralized model to communicate with operating systems, web servers, and other ancillary components that are compliant. Oracle does a good job of updating a certification matrix for each Fusion Middleware application suite per respective product release. The certification matrix for Oracle BI 12c can be found on the Oracle website at the following locations: http://www.oracle.com/technetwork/middleware/fusion-middleware/documentation/fmw-1221certmatrix-2739738.xlsx and http://www.oracle.com/technetwork/middleware/ias/downloads/fusion-certification-100350.html. The certification matrix document is usually provided in Microsoft Excel format and should be referenced before any project or deployment of Oracle BI begins. This will ensure that infrastructure components such as the selected operating system, web server, web browsers, LDAP server, and so on, will actually work when integrated with the product suite. Scaling out Oracle BI 12c There are several reasons why an organization may wish to expand their Oracle BI footprint. This can range anywhere from requiring a highly available environment to achieving high levels of concurrent usage over time. The number of total end users, the number of total concurrent end users, the volume of queries, the size of the underlying data warehouse, and cross-network latency are even more factors to consider. Scaling out an environment has the potential to solve performance issues and stabilize the environment. When scoping out the infrastructure for an Oracle BI deployment, there are several crucial decisions to be made. These decisions can be greatly assisted by preparing properly, using Oracle's recommended guides for clustering and deploying Oracle BI on an enterprise scale. Pre-Configuration Run-Down Configuring the Oracle BI product suite, specifically when involving scaling out or setting up high-availability (HA), takes preparation. Proactively taking steps to understand what it takes to correctly establish or pre-configure the infrastructure required to support any level of fault tolerance and high-availability is critical. Even if the decision to scale-out from the initial Oracle BI deployment hasn't been made, if the potential exists, proper planning is recommended. Shared Storage We would be remiss not to highlight one of the most important concepts of scaling out Oracle BI, specifically for High-Availability: shared storage. The idea of shared storage is that, in a fault-tolerance environment, there are binary files and other configuration metadata that needs to be shared across the nodes. If these common elements were not shared, then, if one node were to fail, there is a potential loss of data. Most importantly is that, in a highly available Oracle BI environment, there can be only one WebLogic Administration Server running for that environment at any one time. A HA configuration makes one Administration Server active while the other is passive. If the appropriate pre-configuration steps for shared storage (as well as other items in the high-availability guide) are not properly completed, one should not expect accurate results from their environment. OBIEE 12c requires you to modify the Singleton Data Directory (SDD) for your Oracle BI configuration found at ORACLE_HOME/user_projects/domains/bi/data, so that the files within that path are moved to a shared storage location that would be mounted to the scaled-out servers on which a HA configuration would be implemented. To change this, one would need to modify the ORACLE_HOME/user_projects/domains/bi/config/fmwconfig/bienv/core/bi-environment.xml file to set the path of the bi:singleton-data-directory element to the full path of the shared mounted file location that contains a copy of the bidata folder, which will be referenced by one ore more scaled-out HA Oracle 12c servers. For example, change the XML file element: <bi:singleton-data-directory>/oraclehome/user_projects/domains/bi/bidata/</bi:singleton-data-directory> To reflect a shared NAS or SAN mount whose folder names and structure are inline with the IT team’s standard naming conventions, where the /bidata folder is the folder from the main Oracle BI 12c instance that gets copied to the shared directory: <bi:singleton-data-directory>/mount02/obiee_shared_settings/bidata/</bi:singleton-data-directory> Clustering A major benefit of Oracle BI's ability to leverage WebLogic Server as the Java application server tier is that, per the default installation, Oracle BI gets established in a clustered architecture. There is no additional configuration necessary to set this architecture in motion. Clearly, installing Oracle BI on a single server only provides a single server with which to interface; however, upon doing so, Oracle BI is installed into a single-node clustered-application-server environment. Additional clustered nodes of Oracle BI can then be configured to establish and expand the server, either horizontally or vertically. Vertical vs Horizontal In respect to the enterprise architecture and infrastructure of the Oracle BI environment, a clustered environment can be expanded in one of two ways: horizontally (scale-out) and vertically (scale-up). A horizontal expansion is the typical expansion type when clustering. It is represented by installing and configuring the application on a separate physical server, with reference to the main server application. A vertical expansion is usually represented by expanding the application on the same physical server under which the main server application resides. A horizontally expanded system can then, additionally, be vertically expanded. There are benefits to both scaling options. The decision to scale the system one way or the other is usually predicated on the cost of additional physical servers, server limitations, peripherals such as memory or processors, or an increase in usage activity by end users. Some considerations that may be used to assess which approach is the best for your specific implementation might be as follows: Load-balancing capabilities and need for an Active-Active versus Active-Passive architecture Need for failover or high availability Costs for processor and memory enhancements versus the cost of new servers Anticipated increase in concurrent user queries Realized decrease in performance due to increase in user activity Oracle BI Server (System Component) Cluster Controller When discussing scaling out the Oracle BI Server cluster, it is a common mistake to confuse the WebLogic Server application clustering with the Oracle BI Server Cluster Controller. Currently, Oracle BI can only have a single metadata repository (RPD) reference associated with an Oracle BI Server deployment instance at any single point in time. Because of this, the Oracle BI Server engine leverages a failover concept, to ensure some level of high-availability exists when the environment is scaled. In an Oracle BI scaled-out clustered environment, a secondary node, which has an instance of Oracle BI installed, will contain a secondary Oracle BI Server engine. From the main Oracle BI Managed Server containing the primary Oracle BI Server instance, the secondary Oracle BI Server instance gets established as the failover server engine using the Oracle BI Server Cluster Controller. This configuration takes place in the Enterprise Manager Fusion Control console. Based on this configuration, the scaled-out Oracle BI Server engine acts in an active-passive mode. That is to say that, when the main Oracle BI server engine instance fails, the secondary, or passive, Oracle BI Server engine then becomes active to route requests and field queries. Summary This article proves very beneficial as an introductory document for the beginner about what WebLogic Server is. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features [article] Schema Validation with Oracle JDeveloper - XDK 11g [article] Creating external tables in your Oracle 10g/11g Database [article]
Read more
  • 0
  • 0
  • 2750

article-image-learn-data
Packt
09 Mar 2017
6 min read
Save for later

Learn from Data

Packt
09 Mar 2017
6 min read
In this article by Rushdi Shams, the author of the book Java Data Science Cookbook, we will cover recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. In contrast to classification, regression models attempt to predict a value from a numeric class. (For more resources related to this topic, see here.) Generating linear regression models Most of the linear regression modelling follows a general pattern—there will be many independent variables that will be collectively produce a result, which is a dependent variable. For instance, we can generate a regression model to predict the price of a house based on different attributes/features of a house (mostly numeric, real values) like its size in square feet, number of bedrooms, number of washrooms, importance of its location, and so on. In this recipe, we will use Weka’s Linear Regression classifier to generate a regression model. Getting ready In order to perform the recipes in this section, we will require the following: To download Weka, go to http://www.cs.waikato.ac.nz/ml/weka/downloading.html and you will find download options for Windows, Mac, and other operating systems such as Linux. Read through the options carefully and download the appropriate version. During the writing of this article, 3.9.0 was the latest version for the developers and as the author already had version 1.8 JVM installed in his 64-bit Windows machine, he has chosen to download a self-extracting executable for 64-bit Windows without a Java Virtual Machine (JVM)   After the download is complete, double-click on the executable file and follow on screen instructions. You need to install the full version of Weka. Once the installation is done, do not run the software. Instead, go to the directory where you have installed it and find the Java Archive File for Weka (weka.jar). Add this file in your Eclipse project as external library. If you need to download older versions of Weka for some reasons, all of them can be found at https://sourceforge.net/projects/weka/files/. Please note that there is a possibility that many of the methods from old versions are deprecated and therefore not supported any more. How to do it… In this recipe, the linear regression model we will be creating is based on the cpu.arff dataset that can be found in the data directory of the Weka installation directory. Our code will have two instance variables: the first variable will contain the data instances of cpu.arff file and the second variable will be our linear regression classifier. Instances cpu = null; LinearRegression lReg ; Next, we will be creating a method to load the ARFF file and assign the last attribute of the ARFF file as its class attribute. public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } We will be creating a method to build the linear regression model. To do so, we simply need to call the buildClassifier() method of our linear regression variable. The model can directly be sent as parameter to System.out.println(). public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } The complete code for the recipe is as follows: import weka.classifiers.functions.LinearRegression; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLinearRegressionTest { Instances cpu = null; LinearRegression lReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } public static void main(String[] args) throws Exception{ WekaLinearRegressionTest test = new WekaLinearRegressionTest(); test.loadArff("path to the cpu.arff file"); test.buildRegression(); } } The output of the code is as follows: Linear Regression Model class = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075 Generating logistic regression models Weka has a class named Logistic that can be used for building and using a multinomial logistic regression model with a ridge estimator. Although original Logistic Regression does not deal with instance weights, the algorithm in Weka has been modified to handle the instance weights. In this recipe, we will use Weka to generate logistic regression model on iris dataset. How to do it… We will be generating a logistic regression model from the iris dataset that can be found in the data directory in the installed folder of Weka. Our code will have two instance variables: one will be containing the data instances of iris dataset and the other will be the logistic regression classifier.  Instances iris = null; Logistic logReg ; We will be using a method to load and read the dataset as well as assign its class attribute (the last attribute of iris.arff file): public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } Next, we will be creating the most important method of our recipe that builds a logistic regression classifier from the iris dataset: public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } The complete executable code for the recipe is as follows: import weka.classifiers.functions.Logistic; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLogisticRegressionTest { Instances iris = null; Logistic logReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } public static void main(String[] args) throws Exception{ WekaLogisticRegressionTest test = new WekaLogisticRegressionTest(); test.loadArff("path to the iris.arff file "); test.buildRegression(); } } The output of the code is as follows: Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 21.8065 2.4652 sepalwidth 4.5648 6.6809 petallength -26.3083 -9.4293 petalwidth -43.887 -18.2859 Intercept 8.1743 42.637 Odds Ratios... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 2954196659.8892 11.7653 sepalwidth 96.0426 797.0304 petallength 0 0.0001 petalwidth 0 0 The interpretation of the results from the recipe is beyond the scope of this article. Interested readers are encouraged to see a Stack Overflow discussion here: http://stackoverflow.com/questions/19136213/how-to-interpret-weka-logistic-regression-output. Summary In this article, we have covered the recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Data Science with R [article] Data visualization [article]
Read more
  • 0
  • 0
  • 1165
Visually different images

article-image-reading-fine-manual
Packt
09 Mar 2017
34 min read
Save for later

Reading the Fine Manual

Packt
09 Mar 2017
34 min read
In this article by Simon Riggs, Gabriele Bartolini, Hannu Krosing, Gianni Ciolli, the authors of the book PostgreSQL Administration Cookbook - Third Edition, you will learn the following recipes: Reading The Fine Manual (RTFM) Planning a new database Changing parameters in your programs Finding the current configuration settings Which parameters are at nondefault settings? Updating the parameter file Setting parameters for particular groups of users The basic server configuration checklist Adding an external module to PostgreSQL Using an installed module Managing installed extensions (For more resources related to this topic, see here.) I get asked many questions about parameter settings in PostgreSQL. Everybody's busy and most people want a 5-minute tour of how things work. That's exactly what a Cookbook does, so we'll do our best. Some people believe that there are some magical parameter settings that will improve their performance, spending hours combing the pages of books to glean insights. Others feel comfortable because they have found some website somewhere that "explains everything", and they "know" they have their database configured OK. For the most part, the settings are easy to understand. Finding the best setting can be difficult, and the optimal setting may change over time in some cases. This article is mostly about knowing how, when, and where to change parameter settings. Reading The Fine Manual (RTFM) RTFM is often used rudely to mean "don't bother me, I'm busy", or it is used as a stronger form of abuse. The strange thing is that asking you to read a manual is most often very good advice. Don't flame the advisors back; take the advice! The most important point to remember is that you should refer to a manual whose release version matches that of the server on which you are operating. The PostgreSQL manual is very well-written and comprehensive in its coverage of specific topics. However, one of its main failings is that the "documents" aren't organized in a way that helps somebody who is trying to learn PostgreSQL. They are organized from the perspective of people checking specific technical points so that they can decide whether their difficulty is a user error or not. It sometimes answers "What?" but seldom "Why?" or "How?" I've helped write sections of the PostgreSQL documents, so I'm not embarrassed to steer you towards reading them. There are, nonetheless, many things to read here that are useful. How to do it… The main documents for each PostgreSQL release are available at http://www.postgresql.org/docs/manuals/. The most frequently accessed parts of the documents are as follows: SQL command reference, as well as client and server tools reference: http://www.postgresql.org/docs/current/interactive/reference.html Configuration: http://www.postgresql.org/docs/current/interactive/runtime-config.html Functions: http://www.postgresql.org/docs/current/interactive/functions.html You can also grab yourself a PDF version of the manual, which can allow easier searching in some cases. Don't print it! The documents are more than 2000 pages of A4-sized sheets. How it works… The PostgreSQL documents are written in SGML, which is similar to, but not the same as, XML. These files are then processed to generate HTML files, PDF, and so on. This ensures that all the formats have exactly the same content. Then, you can choose the format you prefer, and you can even compile it in other formats such as EPUB, INFO, and so on. Moreover, the PostgreSQL manual is actually a subset of the PostgreSQL source code, so it evolves together with the software. It is written by the same people who make PostgreSQL. Even more reasons to read it! There's more… More information is also available at http://wiki.postgresql.org. Many distributions offer packages that install static versions of the HTML documentation. For example, on Debian and Ubuntu, the docs for the most recent stable PostgreSQL version are named postgresql-9.6-docs (unsurprisingly). Planning a new database Planning a new database can be a daunting task. It's easy to get overwhelmed by it, so here, we present some planning ideas. It's also easy to charge headlong at the task as well, thinking that whatever you know is all you'll ever need to consider. Getting ready You are ready. Don't wait to be told what to do. If you haven't been told what the requirements are, then write down what you think they are, clearly labeling them as "assumptions" rather than "requirements"—we mustn't confuse the two things. Iterate until you get some agreement, and then build a prototype. How to do it… Write a document that covers the following items: Database design—plan your database design Calculate the initial database sizing Transaction analysis—how will we access the database? Look at the most frequent access paths What are the requirements for response times? Hardware configuration Initial performance thoughts—will all of the data fit into RAM? Choose the operating system and filesystem type How do we partition the disk? Localization plan Decide server encoding, locale, and time zone Access and security plan Identify client systems and specify required drivers Create roles according to a plan for access control Specify pg_hba.conf Maintenance plan—who will keep it working? How? Availability plan—consider the availability requirements checkpoint_timeout Plan your backup mechanism and test it High-availability plan Decide which form of replication you'll need, if any How it works… One of the most important reasons for planning your database ahead of time is that retrofitting some things is difficult. This is especially true of server encoding and locale, which can cause much downtime and exertion if we need to change them later. Security is also much more difficult to set up after the system is live. There's more… Planning always helps. You may know what you're doing, but others may not. Tell everybody what you're going to do before you do it to avoid wasting time. If you're not sure yet, then build a prototype to help you decide. Approach the administration framework as if it were a development task. Make a list of things you don't know yet, and work through them one by one. This is deliberately a very short recipe. Everybody has their own way of doing things, and it's very important not to be too prescriptive about how to do things. If you already have a plan, great! If you don't, think about what you need to do, make a checklist, and then do it. Changing parameters in your programs PostgreSQL allows you to set some parameter settings for each session or transaction. How to do it… You can change the value of a setting during your session, like this: SET work_mem = '16MB'; This value will then be used for every future transaction. You can also change it only for the duration of the "current transaction": SET LOCAL work_mem = '16MB'; The setting will last until you issue this command: RESET work_mem; Alternatively, you can issue the following command: RESET ALL; SETand RESET commands are SQL commands that can be issued from any interface. They apply only to PostgreSQL server parameters, but this does not mean that they affect the entire server. In fact, the parameters you can change with SET and RESET apply only to the current session. Also, note that there may be other parameters, such as JDBC driver parameters, that cannot be set in this way. How it works… Suppose you change the value of a setting during your session, for example, by issuing this command: SET work_mem = '16MB'; Then, the following will show up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 16384 | 1024 | session Until you issue this command: RESET work_mem; After issuing it, the setting returns to reset_val and the source returns to default: name | setting | reset_val | source ---------+---------+-----------+--------- work_mem | 1024 | 1024 | default There's more… You can change the value of a setting during your transaction as well, like this: SET LOCAL work_mem = '16MB'; Then, this will show up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 1024 | 1024 | session Huh? What happened to your parameter setting? The SET LOCAL command takes effect only for the transaction in which it was executed, which was just the SET LOCAL command in our case. We need to execute it inside a transaction block to be able to see the setting take hold, as follows: BEGIN; SET LOCAL work_mem = '16MB'; Here is what shows up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 16384 | 1024 | session You should also note that the value of source is session rather than transaction, as you might have been expecting. Finding the current configuration settings At some point, it will occur to you to ask, "What are the current configuration settings?" Most settings can be changed in more than one way, and some ways do not affect all users or all sessions, so it is quite possible to get confused. How to do it… Your first thought is probably to look in postgresql.conf, which is the configuration file, described in detail in the Updating the parameter file recipe. That works, but only as long as there is only one parameter file. If there are two, then maybe you're reading the wrong file! (How do you know?) So, the cautious and accurate way is not to trust a text file, but to trust the server itself. Moreover, you learned in the previous recipe, Changing parameters in your programs, that each parameter has a scope that determines when it can be set. Some parameters can be set through postgresql.conf, but others can be changed afterwards. So, the current value of configuration settings may have been subsequently changed. We can use the SHOW command like this: postgres=# SHOW work_mem; Its output is as follows: work_mem ---------- 1MB (1 row) However, remember that it reports the current setting at the time it is run, and that can be changed in many places. Another way of finding the current settings is to access a PostgreSQL catalog view named pg_settings: postgres=# x Expanded display is on. postgres=# SELECT * FROM pg_settings WHERE name = 'work_mem'; [ RECORD 1 ] -------------------------------------------------------- name | work_mem setting | 1024 unit | kB category | Resource Usage / Memory short_desc | Sets the maximum memory to be used for query workspaces. extra_desc | This much memory can be used by each internal sort operation and hash table before switching to temporary disk files. context | user vartype | integer source | default min_val | 64 max_val | 2147483647 enumvals | boot_val | 1024 reset_val | 1024 sourcefile | sourceline | Thus, you can use the SHOW command to retrieve the value for a setting, or you can access the full details via the catalog table. There's more… The actual location of each configuration file can be asked directly to the PostgreSQL server, as shown in this example: postgres=# SHOW config_file; This returns the following output: config_file ------------------------------------------ /etc/postgresql/9.4/main/postgresql.conf (1 row) The other configuration files can be located by querying similar variables, hba_file and ident_file. How it works… Each parameter setting is cached within each session so that we can get fast access to the parameter settings. This allows us to access the parameter settings with ease. Remember that the values displayed are not necessarily settings for the server as a whole. Many of those parameters will be specific to the current session. That's different from what you experience with many other database software, and is also very useful. Which parameters are at nondefault settings? Often, we need to check which parameters have been changed or whether our changes have correctly taken effect. In the previous two recipes, we have seen that parameters can be changed in several ways, and with different scope. You learned how to inspect the value of one parameter or get the full list of parameters. In this recipe, we will show you how to use SQL capabilities to list only those parameters whose value in the current session differs from the system-wide default value. This list is valuable for several reasons. First, it includes only a few of the 200-plus available parameters, so it is more immediate. Also, it is difficult to remember all our past actions, especially in the middle of a long or complicated session. Version 9.4 introduces the ALTER SYSTEM syntax, which we will describe in the next recipe, Updating the parameter file. From the viewpoint of this recipe, its behavior is quite different from all other setting-related commands; you run it from within your session and it changes the default value, but not the value in your session. How to do it… We write a SQL query that lists all parameter values, excluding those whose current value is either the default or set from a configuration file: postgres=# SELECT name, source, setting FROM pg_settings WHERE source != 'default' AND source != 'override' ORDER by 2, 1; The output is as follows: name | source | setting ----------------------------+----------------------+----------------- application_name | client | psql client_encoding | client | UTF8 DateStyle | configuration file | ISO, DMY default_text_search_config | configuration file | pg_catalog.english dynamic_shared_memory_type | configuration file | posix lc_messages | configuration file | en_GB.UTF-8 lc_monetary | configuration file | en_GB.UTF-8 lc_numeric | configuration file | en_GB.UTF-8 lc_time | configuration file | en_GB.UTF-8 log_timezone | configuration file | Europe/Rome max_connections | configuration file | 100 port | configuration file | 5460 shared_buffers | configuration file | 16384 TimeZone | configuration file | Europe/Rome max_stack_depth | environment variable | 2048 How it works… You can see from pg_settings which parameters have nondefault values and what the source of the current value is. The SHOW command doesn't tell you whether a parameter is set at a nondefault value. It just tells you the value, which isn't of much help if you're trying to understand what is set and why. If the source is a configuration file, then the sourcefile and sourceline columns are also set. These can be useful in understanding where the configuration came from. There's more… The setting column of pg_settings shows the current value, but you can also look at boot_val and reset_val. The boot_val parameter shows the value assigned when the PostgreSQL database cluster was initialized (initdb), while reset_val shows the value that the parameter will return to if you issue the RESET command. The max_stack_depth parameter is an exception because pg_settings says it is set by the environment variable, though it is actually set by ulimit -s on Linux and Unix systems. The max_stack_depth parameter just needs to be set directly on Windows. The time zone settings are also picked up from the OS environment, so you shouldn't need to set those directly. In older releases, pg_settings showed them as command-line settings. From version 9.1 onwards, they are written to postgresql.conf when the data directory is initialized, so they show up as configuration files. Updating the parameter file The parameter file is the main location for defining parameter values for the PostgreSQL server. All the parameters can be set in the parameter file, which is known as postgresql.conf. There are also two other parameter files: pg_hba.conf and pg_ident.conf. Both of these relate to connections and security. Getting ready First, locate postgresql.conf, as described earlier. How to do it… Some of the parameters take effect only when the server is first started. A typical example might be shared_buffers, which defines the size of the shared memory cache. Many of the parameters can be changed while the server is still running. After changing the required parameters, we issue a reload operation to the server, forcing PostgreSQL to reread the postgresql.conf file (and all other configuration files). There is a number of ways to do that, depending on your distribution and OS. The most common is to issue the following command: pg_ctl reload with the same OS user that runs the PostgreSQL server process. This assumes the default data directory; otherwise you have to specify the correct data directory with the -D option. As noted earlier, Debian and Ubuntu have a different multiversion architecture, so you should issue the following command instead: pg_ctlcluster 9.6 main reload On modern distributions you can also use systemd, as follows: sudo systemctl reload [email protected] Some other parameters require a restart of the server for changes to take effect, for instance, max_connections, listen_addresses, and so on. The syntax is very similar to a reload operation, as shown here: pg_ctl restart For Debian and Ubuntu, use this command: pg_ctlcluster 9.6 main restart and with systemd: sudo systemctl restart [email protected] The postgresql.conf file is a normal text file that can be simply edited. Most of the parameters are listed in the file, so you can just search for them and then insert the desired value in the right place. How it works… If you set the same parameter twice in different parts of the file, the last setting is what applies. This can cause lots of confusion if you add settings to the bottom of the file, so you are advised against doing that. The best practice is to either leave the file as it is and edit the values, or to start with a blank file and include only the values that you wish to change. I personally prefer a file with only the nondefault values. That makes it easier to see what's happening. Whichever method you use, you are strongly advised to keep all the previous versions of your .conf files. You can do this by copying, or you can use a version control system such as Git or SVN. There's more… The postgresql.conf file also supports an include directive. This allows the postgresql.conf file to reference other files, which can then reference other files, and so on. That may help you organize your parameter settings better, if you don't make it too complicated. If you are working with PostgreSQL version 9.4 or later, you can change the values stored in the parameter files directly from your session, with syntax such as the following: ALTER SYSTEM SET shared_buffers = '1GB'; This command will not actually edit postgresql.conf. Instead, it writes the new setting to another file named postgresql.auto.conf. The effect is equivalent, albeit in a safer way. The original configuration is never written, so it cannot be damaged in the event of a crash. If you mess up with too many ALTER SYSTEM commands, you can always delete postgresql.auto.conf manually and reload the configuration, or restart PostgreSQL, depending on what parameters you had changed. Setting parameters for particular groups of users PostgreSQL supports a variety of ways of defining parameter settings for various user groups. This is very convenient, especially to manage user groups that have different requirements. How to do it… For all users in the saas database, use the following commands: ALTER DATABASE saas SET configuration_parameter = value1; For a user named simon connected to any database, use this: ALTER ROLE Simon SET configuration_parameter = value2; Alternatively, you can set a parameter for a user only when connected to a specific database, as follows: ALTER ROLE Simon IN DATABASE saas SET configuration_parameter = value3; The user won't know that these have been executed specifically for them. These are default settings, and in most cases, they can be overridden if the user requires nondefault values. How it works… You can set parameters for each of the following: Database User (which is named role by PostgreSQL) Database/user combination Each of the parameter defaults is overridden by the one below it. In the preceding three SQL statements: If user hannu connects to the saas database, then value1 will apply If user simon connects to a database other than saas, then value2 will apply If user simon connects to the saas database, then value3 will apply PostgreSQL implements this in exactly the same way as if the user had manually issued the equivalent SET statements immediately after connecting. The basic server configuration checklist PostgreSQL arrives configured for use on a shared system, though many people want to run dedicated database systems. The PostgreSQL project wishes to ensure that PostgreSQL will play nicely with other server software, and will not assume that it has access to the full server resources. If you, as the system administrator, know that there is no other important server software running on this system, then you can crank up the values much higher. Getting ready Before we start, we need to know two sets of information: We need to know the size of the physical RAM that will be dedicated to PostgreSQL We need to know something about the types of applications for which we will use PostgreSQL How to do it… If your database is larger than 32 MB, then you'll probably benefit from increasing shared_buffers. You can increase this to much larger values, but remember that running out of memory induces many problems. For instance, PostgreSQL is able to store information to the disk when the available memory is too small, and it employs sophisticated algorithms to treat each case differently and to place each piece of data either in the disk or in the memory, depending on each use case. On the other hand, overstating the amount of available memory confuses such abilities and results in suboptimal behavior. For instance, if the memory is swapped to disk, then PostgreSQL will inefficiently treat all data as if it were the RAM. Another unfortunate circumstance is when the Linux Out-Of-Memory (OOM) killer terminates one of the various processes spawned by the PostgreSQL server. So, it's better to be conservative. It is good practice to set a low value in your postgresql.conf and increment slowly to ensure that you get the benefits from each change. If you increase shared_buffers and you're running on a non-Windows server, you will almost certainly need to increase the value of the SHMMAX OS parameter (and on some platforms, other parameters as well). On Linux, Mac OS, and FreeBSD, you will need to either edit the /etc/sysctl.conf file or use sysctl -w with the following values: For Linux, use kernel.shmmax=value For Mac OS, use kern.sysv.shmmax=value For FreeBSD, use kern.ipc.shmmax=value There's more… For more information, you can refer to http://www.postgresql.org/docs/9.6/static/kernel-resources.html#SYSVIPC. For example, on Linux, add the following line to /etc/sysctl.conf: kernel.shmmax=value Don't worry about setting effective_cache_size. It is much less important a parameter than you might think. There is no need for too much fuss selecting the value. If you're doing heavy write activity, then you may want to set wal_buffers to a much higher value than the default. In fact wal_buffers is set automatically from the value of shared_buffers, following a rule that fits most cases; however, it is always possible to specify an explicit value that overrides the computation for the very few cases where the rule is not good enough. If you're doing heavy write activity and/or large data loads, you may want to set checkpoint_segments higher than the default to avoid wasting I/O in excessively frequent checkpoints. If your database has many large queries, you may wish to set work_mem to a value higher than the default. However, remember that such a limit applies separately to each node in the query plan, so there is a real risk of overallocating memory, with all the problems discussed earlier. Ensure that autovacuum is turned on, unless you have a very good reason to turn it off—most people don't. Leave the settings as they are for now. Don't fuss too much about getting the settings right. You can change most of them later, so you can take an iterative approach to improving things.  And remember, don't touch the fsync parameter. It's keeping you safe. Adding an external module to PostgreSQL Another strength of PostgreSQL is its extensibility. Extensibility was one of the original design goals, going back to the late 1980s. Now, in PostgreSQL 9.6, there are many additional modules that plug into the core PostgreSQL server. There are many kinds of additional module offerings, such as the following: Additional functions Additional data types Additional operators Additional indexes Getting ready First, you'll need to select an appropriate module to install. The walk towards a complete, automated package management system for PostgreSQL is not over yet, so you need to look in more than one place for the available modules, such as the following: Contrib: The PostgreSQL "core" includes many functions. There is also an official section for add-in modules, known as "contrib" modules. They are always available for your database server, but are not automatically enabled in every database because not all users might need them. On PostgreSQL version 9.6, we will have more than 40 such modules. These are documented at http://www.postgresql.org/docs/9.6/static/contrib.html. PGXN: This is the PostgreSQL Extension Network, a central distribution system dedicated to sharing PostgreSQL extensions. The website started in 2010, as a repository dedicated to the sharing of extension files. Separate projects: These are large external projects, such as PostGIS, offering extensive and complex PostgreSQL modules. For more information, take a look at http://www.postgis.org/. How to do it… There are several ways to make additional modules available for your database server, as follows: Using a software installer Installing from PGXN Installing from a manually downloaded package Installing from source code Often, a particular module will be available in more than one way, and users are free to choose their favorite, exactly like PostgreSQL itself, which can be downloaded and installed through many different procedures. Installing modules using a software installer Certain modules are available exactly like any other software packages that you may want to install in your server. All main Linux distributions provide packages for the most popular modules, such as PostGIS, SkyTools, procedural languages other than those distributed with core, and so on. In some cases, modules can be added during installation if you're using a standalone installer application, for example, the OneClick installer, or tools such as rpm, apt-get, and YaST on Linux distributions. The same procedure can also be followed after the PostgreSQL installation, when the need for a certain module arrives. We will actually describe this case, which is way more common. For example, let's say that you need to manage a collection of Debian package files, and that one of your tasks is to be able to pick the latest version of one of them. You start by building a database that records all the package files. Clearly, you need to store the version number of each package. However, Debian version numbers are much more complex than what we usually call "numbers". For instance, on my Debian laptop, I currently have version 9.2.18-1.pgdg80+1 of the PostgreSQL client package. Despite being complicated, that string follows a clearly defined specification, which includes many bits of information, including how to compare two versions to establish which of them is older. Since this recipe discussed extending PostgreSQL with custom data types and operators, you might have already guessed that I will now consider a custom data type for Debian version numbers that is capable of tasks such as understanding the Debian version number format, sorting version numbers, choosing the latest version number in a given group, and so on. It turns out that somebody else already did all the work of creating the required PostgreSQL data type, endowed with all the useful accessories: comparison operators, input/output functions, support for indexes, and maximum/minimum aggregates. All of this has been packaged as a PostgreSQL extension, as well as a Debian package (not a big surprise), so it is just a matter of installing the postgresql-9.2-debversion package with a Debian tool such as apt-get, aptitude, or synaptic. On my laptop, that boils down to the command line: apt-get install postgresql-9.2-debversion This will download the required package and unpack all the files in the right locations, making them available to my PostgreSQL server. Installing modules from PGXN The PostgreSQL Extension Network, PGXN for short, is a website (http://pgxn.org) launched in late 2010 with the purpose of providing "a central distribution system for open source PostgreSQL extension libraries". Anybody can register and upload their own module, packaged as an extension archive. The website allows browsing available extensions and their versions, either via a search interface or from a directory of package and user names. The simple way is to use a command-line utility, called pgxnclient. It can be easily installed in most systems; see the PGXN website on how to do so. Its purpose is to interact with PGXN and take care of administrative tasks, such as browsing available extensions, downloading the package, compiling the source code, installing files in the proper place, and removing installed package files. Alternatively, you can download the extension files from the website and place them in the right place by following the installation instructions. PGXN is different from official repositories because it serves another purpose. Official repositories usually contain only seasoned extensions because they accept new software only after a certain amount of evaluation and testing. On the other hand, anybody can ask for a PGXN account and upload their own extensions, so there is no filter except requiring that the extension has an open source license and a few files that any extension must have. Installing modules from a manually downloaded package You might have to install a module that is correctly packaged for your system but is not available from the official package archives. For instance, it could be the case that the module has not been accepted in the official repository yet, or you could have repackaged a bespoke version of that module with some custom tweaks, which are so specific that they will never become official. Whatever the case, you will have to follow the installation procedure for standalone packages specific to your system. Here is an example with the Oracle compatibility module, described at http://postgres.cz/wiki/Oracle_functionality_(en): First, we get the package, say for PostgreSQL 8.4 on a 64-bit architecture, from http://pgfoundry.org/frs/download.php/2414/orafce-3.0.1-1.pg84.rhel5.x86_64.rpm. Then, we install the package in the standard way: rpm -ivh orafce-3.0.1-1.pg84.rhel5.x86_64.rpm If all the dependencies are met, we are done. I mentioned dependencies because that's one more potential problem in installing packages that are not officially part of the installed distribution—you can no longer assume that all software version numbers have been tested, all requirements are available, and there are no conflicts. If you get error messages that indicate problems in these areas, you may have to solve them yourself, by manually installing missing packages and/or uninstalling conflicting packages. Installing modules from source code In many cases, useful modules may not have full packaging. In these cases, you may need to install the module manually. This isn't very hard and it's a useful exercise that helps you understand what happens. Each module will have different installation requirements. There are generally two aspects of installing a module. They are as follows: Building the libraries (only for modules that have libraries) Installing the module files in the appropriate locations You need to follow the instructions for the specific module in order to build the libraries, if any are required. Installation will then be straightforward, and usually there will be a suitably prepared configuration file for the make utility so that you just need to type the following command: make install Each file will be copied to the right directory. Remember that you normally need to be a system superuser in order to install files on system directories. Once a library file is in the directory expected by the PostgreSQL server, it will be loaded automatically as soon as requested by a function. Modules such as auto_explain do not provide any additional user-defined function, so they won't be auto-loaded; that needs to be done manually by a superuser with a LOAD statement. How it works… PostgreSQL can dynamically load libraries in the following ways: Using the explicit LOAD command in a session Using the shared_preload_libraries parameter in postgresql.conf at server start At session start, using the local_preload_libraries parameter for a specific user, as set using ALTER ROLE PostgreSQL functions and objects can reference code in these libraries, allowing extensions to be bound tightly to the running server process. The tight binding makes this method suitable for use even in very high-performance applications, and there's no significant difference between additionally supplied features and native features. Using an installed module In this recipe, we will explain how to enable an installed module so that it can be used in a particular database. The additional types, functions, and so on will exist only in those databases where we have carried out this step. Although most modules require this procedure, there are actually a couple of notable exceptions. For instance, the auto_explain module mentioned earlier, which is shipped together with PostgreSQL, does not create any function, type or operator. To use it, you must load its object file using the LOAD command. From that moment, all statements longer than a configurable threshold will be logged together with their execution plan. In the rest of this recipe, we will cover all the other modules. They do not require a LOAD statement because PostgreSQL can automatically load the relevant libraries when they are required. As mentioned in the previous recipe, Adding an external module to PostgreSQL, specially packaged modules are called extensions in PostgreSQL. They can be managed with dedicated SQL commands.  Getting ready Suppose you have chosen to install a certain module among those available for your system (see the previous recipe, Adding an external module to PostgreSQL); all you need to know is the extension name.  How to do it…  Each extension has a unique name, so it is just a matter of issuing the following command: CREATE EXTENSION myextname; This will automatically create all the required objects inside the current database. For security reasons, you need to do so as a database superuser. For instance, if you want to install the dblink extension, type this: CREATE EXTENSION dblink; How it works… When you issue a CREATE EXTENSION command, the database server looks for a file named EXTNAME.control in the SHAREDIR/extension directory. That file tells PostgreSQL some properties of the extension, including a description, some installation information, and the default version number of the extension (which is unrelated to the PostgreSQL version number). Then, a creation script is executed in a single transaction, so if it fails, the database is unchanged. The database server also notes in a catalog table the extension name and all the objects that belong to it. Managing installed extensions In the last two recipes, we showed you how to install external modules in PostgreSQL to augment its capabilities. In this recipe, we will show you some more capabilities offered by the extension infrastructure. How to do it… First, we list all available extensions: postgres=# x on Expanded display is on. postgres=# SELECT * postgres-# FROM pg_available_extensions postgres-# ORDER BY name; -[ RECORD 1 ]-----+-------------------------------------------------- name | adminpack default_version | 1.0 installed_version | comment | administrative functions for PostgreSQL -[ RECORD 2 ]-----+-------------------------------------------------- name | autoinc default_version | 1.0 installed_version | comment | functions for autoincrementing fields (...) In particular, if the dblink extension is installed, then we see a record like this: -[ RECORD 10 ]----+-------------------------------------------------- name | dblink default_version | 1.0 installed_version | 1.0 comment | connect to other PostgreSQL databases from within a database Now, we can list all the objects in the dblink extension, as follows: postgres=# x off Expanded display is off. postgres=# dx+ dblink Objects in extension "dblink" Object Description --------------------------------------------------------------------- function dblink_build_sql_delete(text,int2vector,integer,text[]) function dblink_build_sql_insert(text,int2vector,integer,text[],text[]) function dblink_build_sql_update(text,int2vector,integer,text[],text[]) function dblink_cancel_query(text) function dblink_close(text) function dblink_close(text,boolean) function dblink_close(text,text) (...) Objects created as parts of extensions are not special in any way, except that you can't drop them individually. This is done to protect you from mistakes: postgres=# DROP FUNCTION dblink_close(text); ERROR: cannot drop function dblink_close(text) because extension dblink requires it HINT: You can drop extension dblink instead. Extensions might have dependencies too. The cube and earthdistance contrib extensions provide a good example, since the latter depends on the former: postgres=# CREATE EXTENSION earthdistance; ERROR: required extension "cube" is not installed postgres=# CREATE EXTENSION cube; CREATE EXTENSION postgres=# CREATE EXTENSION earthdistance; CREATE EXTENSION As you can reasonably expect, dependencies are taken into account when dropping objects, just like for other objects: postgres=# DROP EXTENSION cube; ERROR: cannot drop extension cube because other objects depend on it DETAIL: extension earthdistance depends on extension cube HINT: Use DROP ... CASCADE to drop the dependent objects too. postgres=# DROP EXTENSION cube CASCADE; NOTICE: drop cascades to extension earthdistance DROP EXTENSION How it works… The pg_available_extensions system view shows one row for each extension control file in the SHAREDIR/extension directory (see the Using an installed module recipe). The pg_extension catalog table records only the extensions that have actually been created. The psql command-line utility provides the dx meta-command to examine e"x"tensions. It supports an optional plus sign (+) to control verbosity and an optional pattern for the extension name to restrict its range. Consider the following command: dx+ db* This will list all extensions whose name starts with db, together with all their objects. The CREATE EXTENSION command creates all objects belonging to a given extension, and then records the dependency of each object on the extension in pg_depend. That's how PostgreSQL can ensure that you cannot drop one such object without dropping its extension. The extension control file admits an optional line, requires, that names one or more extensions on which the current one depends. The implementation of dependencies is still quite simple. For instance, there is no way to specify a dependency on a specific version number of other extensions, and there is no command that installs one extension and all its prerequisites. As a general PostgreSQL rule, the CASCADE keyword tells the DROP command to delete all the objects that depend on cube, the earthdistance extension in this example. There's more… Another system view, pg_available_extension_versions, shows all the versions available for each extension. It can be valuable when there are multiple versions of the same extension available at the same time, for example, when making preparations for an extension upgrade. When a more recent version of an already installed extension becomes available to the database server, for instance because of a distribution upgrade that installs updated package files, the superuser can perform an upgrade by issuing the following command: ALTER EXTENSION myext UPDATE TO '1.1'; This assumes that the author of the extension taught it how to perform the upgrade. Starting from version 9.6, the CASCADE option is accepted also by the CREATE EXTENSION syntax, with the meaning of “issue CREATE EXTENSION recursively to cover all dependencies”. So, instead of creating extension cube before creating extension earthdistance, you could have just issued the following command: postgres=# CREATE EXTENSION earthdistance CASCADE; NOTICE: installing required extension "cube" CREATE EXTENSION Remember that CREATE EXTENSION … CASCADE will only work if all the extensions it tries to install have already been placed in the appropriate location. Summary In this article, we studies about RTFM, how to plan new database, changing parameters in the program, changing configurations, updating parameter files, how to use a module which is already installed. Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL Cookbook - High Availability and Replication [article] Running a PostgreSQL Database Server [article]
Read more
  • 0
  • 0
  • 1375

article-image-what-d3js
Packt
08 Mar 2017
13 min read
Save for later

What is D3.js?

Packt
08 Mar 2017
13 min read
In this article by Ændrew H. Rininsland, the author of the book Learning D3.JS 4.x Data Visualization, we'll see what is new in D3 v4 and get started with Node and Git on the command line. (For more resources related to this topic, see here.) D3 (Data-Driven Documents), developed by Mike Bostock and the D3 community since 2011, is the successor to Bostock's earlier Protovis library. It allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL), and uses idioms that should be immediately familiar to anyone with experience of using the popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky to start with. After finishing this book, you should be able to understand D3 well enough to figure out the examples, tweaking them to fit your needs. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/d3. The fine-grained control and its elegance make D3 one of the most powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two-in that case you might want to use a library designed for charting. Many use D3 internally anyway. For a massive list, visit https://github.com/sorrycc/awesome-javascript#data-visualization. D3 is ultimately based around functional programming principles, which is currently experience a renaissance in the JavaScript community. This book really isn't about functional programming, but a lot of what we'll be doing will seem really familiar if you've ever used functional programming principles before. What happened to all the classes?! The second edition of this book contained quite a number of examples using the new class feature that is new in ES2015. The revised examples in this edition all use factory functions instead, and the class keyword never appears. Why is this, exactly? ES2015 classes are essentially just syntactic sugaring for factory functions. By this I mean that they ultimately compile down to that anyway. While classes can provide a certain level of organization to a complex piece of code, they ultimately hide what is going on underneath it all. Not only that, using OO paradigms like classes are effectively avoiding one of the most powerful and elegant aspects of JavaScript as a language, which is its focus on first-class functions and objects. Your code will be simpler and more elegant using functional paradigms than OO, and you'll find it less difficult to read examples in the D3 community, which almost never use classes. There are many, much more comprehensive arguments against using classes than I'm able to make here. For one of the best, please read Eric Elliott's excellent "The Two Pillars of JavaScript" pieces, at medium.com/javascript-scene/the-two-pillars-of-javascript-ee6f3281e7f3. What's new in D3 v4? One of the key changes to D3 since the last edition of this book is the release of version 4. Among its many changes, the most significant is a complete overhaul of the D3 namespace. This means that none of the examples in this book will work with D3 3.x, and the examples from the last book will not work with D3 4.x. This is quite possibly the cruelest thing Mr. Bostock could ever do to educational authors such as myself (I am totally joking here). Kidding aside, it also means many of the "block" examples in the D3 community are out-of-date and may appear rather odd if this book is your first encounter with the library. For this reason, it is very important to note the version of D3 an example uses - if it uses 3.x, it might be worth searching for a 4.x example just to prevent this cognitive dissonance. Related to this is how D3 has been broken up from a single library into many smaller libraries. There are two approaches you can take: you can use D3 as a single library in much the same way as version 3, or you can selectively use individual components of D3 in your project. This book takes the latter route, even if it does take a bit more effort - the benefit is primarily in that you'll have a better idea of how D3 is organized as a library and it reduces the size of the final bundle people who view your graphics will have to download. What's ES2017? One of the main changes to this book since the first edition is the emphasis on modern JavaScript; in this case, ES2017. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this book: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2017 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2017) now. The big release was ES2015, which more or less maps to ES6. ES2016 was ratified in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. ES2017 is currently in the draft stage, which means proposals for new features are being considered and developed until it is ratified sometime in 2017. As a result of this book being written while these features are in draft, they may not actually make it into ES2017 and thus need to wait until a later standard to be officially added to the language. You don't really need to worry about any of this, however, because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. I try to refer to the relevant spec where a feature is added when I introduce it for the sake of accuracy (for instance, modules are an ES2015 feature), but when I refer to JavaScript, I mean all modern JavaScript, regardless of which ECMAScript spec it originated in. Getting started with Node and Git on the command line I will try not to be too opinionated in this book about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. Later on in this book, I'll show you how to write a server application in Node, but for now, let's just concentrate on getting it and npm (the brilliant and amazing package manager that Node uses) installed. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm it means you're good to go. I'm using 6.5.0 and 3.10.3, respectively, though yours might be slightly different-- the key is making sure node is at least version 6.0.0. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. In the last edition of this book, we did a bunch of annoying stuff with Webpack and Babel and it was a bit too configuration-heavy to adequately explain. This time around we're using the lovely jspm for everything, which handles all the finicky annoying stuff for us. Install it now, using npm: npm install -g jspm@beta jspm-server This installs the most up-to-date beta version of jspm and the jspm development server. We don't need Webpack this time around because Rollup (which is used to bundle D3 itself) is used to bundle our projects, and jspm handles our Babel config for us. How helpful! Next, you'll want to clone the book's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3-v4 $ cd $ learning-d3-v4 This will clone the development environment and all the samples in the learning-d3-v4/ directory, as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub Pages, and even submit suggestions and amendments to the parent project. This will help us improve this book for future editions. To do this, fork aendrew/learning-d3-v4 by clicking the "fork" button on GitHub, and replace aendrew in the preceding code snippet with your GitHub username. To switch between them, type the following command: $ git checkout <folder name> Replace <folder name> with the appropriate name of your folder. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this book. It includes a prebuilt config.js file (used by jspm to manage dependencies), which we'll use to aid our development over the course of this book. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the lib/ folder. You'll notice it contains a just a main.js file; almost always, we'll be working in main.js, as index.html is just a minimal container to display our work in. This is it in its entirety, and it's the last time we'll look at any HTML in this book: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Learning D3</title> </head> <body> <script src="jspm_packages/system.js"></script> <script src="config.js"></script> <script> System.import('lib/main.js'); </script> </body> </html> There's also an empty stylesheet in styles/index.css, which we'll add to in a bit. To get things rolling, start the development server by typing the following line: $ npm start This starts up the jspm development server, which will transform our new-fangled ES2017 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. Instead of loading in a compiled bundle, we use SystemJS directly and load in main.js. When we're ready for production, we'll use jspm bundle to create an optimized JS payload. Now point Chrome (or whatever, I'm not fussy - so long as it's not Internet Explorer!) to localhost:8080 and fire up the developer console ( Ctrl +  Shift + J for Linux and Windows and option + command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this book shorter, we'll stick to just Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice, and - yeah yeah, I hear you guys at the back; Opera is good too! We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. The main one you want to focus on, however, is Sources, which shows all the source code files that have been pulled in by the webpage. Not only is this useful in determining whether your code is actually loading, it contains a fully functional JavaScript debugger, which few mortals dare to use. While explaining how to debug code is kind of boring and not at the level of this article, learning to use breakpoints instead of perpetually using console.log to figure out what your code is doing is a skill that will take you far in the years to come. For a good overview, visit https://developers.google.com/web/tools/chrome-devtools/debug/breakpoints/step-code?hl=en Most of what you'll do with Developer Tools, however, is look at the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are impacting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. Resources for Article: Further resources on this subject: Learning D3.js Mapping [article] Integrating a D3.js visualization into a simple AngularJS application [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 1425

article-image-data-pipelines
Packt
03 Mar 2017
17 min read
Save for later

Data Pipelines

Packt
03 Mar 2017
17 min read
In this article by Andrew Morgan, Antoine Amend, Matthew Hallett, David George, the author of the book Mastering Spark for Data Science, readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. Readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. In this article we will cover the following topics: Welcome the GDELT Dataset Data Pipelines Universal Ingestion Framework Real-time monitoring for new data Receiving Streaming Data via Kafka Registering new content and vaulting for tracking purposes Visualization of content metrics in Kibana - to monitor ingestion processes & data health   (For more resources related to this topic, see here.) Data Pipelines Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that’s a whole topic for another book!). We have already seen that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision into two distinct areas: Ad-hoc and scheduled. Ad-hoc data acquisition is the most common method during prototyping and small scale analytics as it usually doesn’t require any additional software to implement - the user requires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure. Scheduled data acquisition is used in more controlled environments for large scale and production analytics, there is also an excellent case for ingesting a dataset into a data lake for possible future use. With Internet of Things (IoT) on the increase, huge volumes of data are being produced, in many cases if the data is not ingested now it is lost forever. Much of this data may not have an immediate or apparent use today, but could do in the future; so the mind-set is to gather all of the data in case it is needed and delete it later when sure it is not. It’s clear we need a flexible approach to data acquisition that supports a variety of procurement options. Universal Ingestion Framework There are many ways to approach data acquisition ranging from home grown bash scripts through to high-end commercial tools. The aim of this section is to introduce a highly flexible framework that we can use for small scale data ingest, and then grow as our requirements change - all the way through to a full corporately managed workflow if needed - that framework will be build using Apache NiFi. NiFi enables us to build large-scale integrated data pipelines that move data around the planet. In addition, it’s also incredibly flexible and easy to build simple pipelines - usually quicker even than using Bash or any other traditional scripting method. If an ad-hoc approach is taken to source the same dataset on a number of occasions, then some serious thought should be given as to whether it falls into the scheduled category, or at least whether a more robust storage and versioning setup should be introduced. We have chosen to use Apache NiFi as it offers a solution that provides the ability to create many, varied complexity pipelines that can be scaled to truly Big Data and IoT levels, and it also provides a great drag & drop interface (using what’s known as flow-based programming[1]). With patterns, templates and modules for workflow production, it automatically takes care of many of the complex features that traditionally plague developers such as multi-threading, connection management and scalable processing. For our purposes it will enable us to quickly build simple pipelines for prototyping, and scale these to full production where required. It’s pretty well documented and easy to get running https://nifi.apache.org/download.html, it runs in a browser and looks like this: https://en.wikipedia.org/wiki/Flow-based_programming We leave the installation of NiFi as an exercise for the reader - which we would encourage you to do - as we will be using it in the following section. Introducing the GDELT News Stream Hopefully, we have NiFi up and running now and can start to ingest some data. So let’s start with some global news media data from GDELT. Here’s our brief, taken from the GDELT website http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/: “Within 15 minutes of GDELT monitoring a news report breaking anywhere the world, it has translated it, processed it to identify all events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts, placed it into global context, and made all of this available via a live open metadata firehose enabling open research on the planet itself. [As] the single largest deployment in the world of sentiment analysis, we hope that by bringing together so many emotional and thematic dimensions crossing so many languages and disciplines, and applying all of it in realtime to breaking news from across the planet, that this will spur an entirely new era in how we think about emotion and the ways in which it can help us better understand how we contextualize, interpret, respond to, and understand global events.” In order to start consuming this open data, we’ll need to hook into that metadata firehose and ingest the news streams onto our platform.  How do we do this?  Let’s start by finding out what data is available. Discover GDELT Real-time GDELT publish a list of the latest files on their website - this list is updated every 15 minutes. In NiFi, we can setup a dataflow that will poll the GDELT website, source a file from this list and save it to HDFS so we can use it later. Inside the NiFi dataflow designer, create a HTTP connector by dragging a processor onto the canvas and selecting GetHTTP. To configure this processor, you’ll need to enter the URL of the file list as: http://data.gdeltproject.org/gdeltv2/lastupdate.txt And also provide a temporary filename for the file list you will download. In the example below, we’ve used the NiFi’s expression language to generate a universally unique key so that files are not overwritten (UUID()). It’s worth noting that with this type of processor (GetHTTP), NiFi supports a number of scheduling and timing options for the polling and retrieval. For now, we’re just going to use the default options and let NiFi manage the polling intervals for us. An example of latest file list from GDELT is shown below. Next, we will parse the URL of the GKG news stream so that we can fetch it in a moment. Create a Regular Expression parser by dragging a processor onto the canvas and selecting ExtractText. Now position the new processor underneath the existing one and drag a line from the top processor to the bottom one. Finish by selecting the success relationship in the connection dialog that pops up. This is shown in the example below. Next, let’s configure the ExtractText processor to use a regular expression that matches only the relevant text of the file list, for example: ([^ ]*gkg.csv.*) From this regular expression, NiFi will create a new property (in this case, called url) associated with the flow design, which will take on a new value as each particular instance goes through the flow. It can even be configured to support multiple threads. Again, this is example is shown below. It’s worth noting here that while this is a fairly specific example, the technique is deliberately general purpose and can be used in many situations. Our First GDELT Feed Now that we have the URL of the GKG feed, we fetch it by configuring an InvokeHTTP processor to use the url property we previously created as it’s remote endpoint, and dragging the line as before. All that remains is to decompress the zipped content with a UnpackContent processor (using the basic zip format) and save to HDFS using a PutHDFS processor, like so: Improving with Publish and Subscribe So far, this flow looks very “point-to-point”, meaning that if we were to introduce a new consumer of data, for example, a Spark-streaming job, the flow must be changed. For example, the flow design might have to change to look like this: If we add yet another, the flow must change again. In fact, each time we add a new consumer, the flow gets a little more complicated, particularly when all the error handling is added. This is clearly not always desirable, as introducing or removing consumers (or producers) of data, might be something we want to do often, even frequently. Plus, it’s also a good idea to try to keep your flows as simple and reusable as possible. Therefore, for a more flexible pattern, instead of writing directly to HDFS, we can publish to Apache Kafka. This gives us the ability to add and remove consumers at any time without changing the data ingestion pipeline. We can also still write to HDFS from Kafka if needed, possibly even by designing a separate NiFi flow, or connect directly to Kafka using Spark-streaming. To do this, we create a Kafka writer by dragging a processor onto the canvas and selecting PutKafka. We now have a simple flow that continuously polls for an available file list, routinely retrieving the latest copy of a new stream over the web as it becomes available, decompressing the content and streaming it record-by-record into Kafka, a durable, fault-tolerant, distributed message queue, for processing by spark-streaming or storage in HDFS. And what’s more, without writing a single line of bash! Content Registry We have seen in this article that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point we have a pipeline that enables us to ingest data from a source, schedule that ingest and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry. We’re going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we’ve received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, etc. Choices and More Choices The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing, we will require at least the following attributes: Easily searchable Scalable Parallel write ability Redundancy There are many ways to meet these requirements, for example we could write the metadata to Parquet, store in HDFS and search using Spark SQL. However, here we will use Elasticsearch as it meets the requirements a little better, most notably because it facilitates low latency queries of our metadata over a REST API - very useful for creating dashboards. In fact, Elasticsearch has the advantage of integrating directly with Kibana, meaning it can quickly produce rich visualizations of our content registry. For this reason, we will proceed with Elasticsearch in mind. Going with the Flow Using our current NiFi pipeline flow, let’s fork the output from “Fetch GKG files from URL” to add an additional set of steps to allow us to capture and store this metadata in Elasticsearch. These are: Replace the flow content with our metadata model Capture the metadata Store directly in Elasticsearch Here’s what this looks like in NiFi: Metadata Model So, the first step here is to define our metadata model. And there are many areas we could consider, but let’s select a set that helps tackle a few key points from earlier discussions. This will provide a good basis upon which further data can be added in the future, if required. So, let’s keep it simple and use the following three attributes: File size Date ingested File name These will provide basic registration of received files. Next, inside the NiFi flow, we’ll need to replace the actual data content with this new metadata model. An easy way to do this, is to create a JSON template file from our model. We’ll save it to local disk and use it inside a FetchFile processor to replace the flow’s content with this skeleton object. This template will look something like: { "FileSize": SIZE, "FileName": "FILENAME", "IngestedDate": "DATE" } Note the use of placeholder names (SIZE, FILENAME, DATE) in place of the attribute values. These will be substituted, one-by-one, by a sequence of ReplaceText processors, that swap the placeholder names for an appropriate flow attribute using regular expressions provided by the NiFi Expression Language, for example DATE becomes ${now()}. The last step is to output the new metadata payload to Elasticsearch. Once again, NiFi comes ready with a processor for this; the PutElasticsearch processor. An example metadata entry in Elasticsearch: { "_index": "gkg", "_type": "files", "_id": "AVZHCvGIV6x-JwdgvCzW", "_score": 1, "source": { "FileSize": 11279827, "FileName": "20150218233000.gkg.csv.zip", "IngestedDate": "2016-08-01T17:43:00+01:00" } } Now that we have added the ability to collect and interrogate metadata, we now have access to more statistics that can be used for analysis. This includes: Time based analysis e.g. file sizes over time Loss of data, for example are there data “holes” in the timeline? If there is a particular analytic that is required, the NIFI metadata component can be adjusted to provide the relevant data points. Indeed, an analytic could be built to look at historical data and update the index accordingly if the metadata does not exist in current data. Kibana Dashboard We have mentioned Kibana a number of times in this article, now that we have an index of metadata in Elasticsearch, we can use the tool to visualize some analytics. The purpose of this brief section is to demonstrate that we can immediately start to model and visualize our data. In this simple example we have completed the following steps: Added the Elasticsearch index for our GDELT metadata to the “Settings” tab Selected “file size” under the “Discover” tab Selected Visualize for “file size” Changed the Aggregation field to “Range” Entered values for the ranges The resultant graph displays the file size distribution: From here we are free to create new visualizations or even a fully featured dashboard that can be used to monitor the status of our file ingest. By increasing the variety of metadata written to Elasticsearch from NiFi, we can make more fields available in Kibana and even start our data science journey right here with some ingest based actionable insights. Now that we have a fully-functioning data pipeline delivering us real-time feeds of data, how do we ensure data quality of the payload we are receiving?  Let’s take a look at the options. Quality Assurance With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the front door. It’s perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, etc. You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it’s not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time. Here are some examples of the type of rough capacity planning calculation you can perform: Example 1: Basic Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes There are no other users on the compute cluster There are 10 minutes of resources available for other tasks. As there are no other users on the cluster, this is satisfactory - no action needs to be taken. Example 2: Advanced Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes There are no other users on the compute cluster There is only 1 minute of resource available for other tasks. We probably need to consider, either: Configuring a resource scheduling policy Reducing the amount of data ingested Reducing the amount of processing we undertake Adding additional compute resources to the cluster Example 3: Basic Quality Checking, 50% Utility Due to Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility) There are other users on the compute cluster There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur. When you run into timing issues, you have a number of options available to you in order to circumvent any backlog: Negotiating sole use of the resources at certain times Configuring a resource scheduling policy, including: YARN Fair Scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence Dynamicandr Resource Allocation: allows concurrently running jobs to automatically scale to match their utilization Spark Scheduler Pool: allows you to define queues when sharing a SparkContext using multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence Running processing jobs overnight when the cluster is quiet In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There’s always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper and builds data expertise. Summary In this article we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resultant data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way. Resources for Article: Further resources on this subject: Integration with Continuous Delivery [article] Amazon Web Services [article] AWS Fundamentals [article]
Read more
  • 0
  • 1
  • 5349
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-numpy-array-object
Packt
03 Mar 2017
18 min read
Save for later

The NumPy array object

Packt
03 Mar 2017
18 min read
In this article by Armando Fandango author of the book Python Data Analysis - Second Edition, discuss how the NumPy provides a multidimensional array object called ndarray. NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type. An ndarray consists of two parts, which are as follows: The actual data that is stored in a contiguous block of memory The metadata describing the actual data Since the actual data is stored in a contiguous block of memory hence loading of the large data set as ndarray is affected by availability of large enough contiguous block of memory. Most of the array methods and functions in NumPy leave the actual data unaffected and only modify the metadata. Actually, we made a one-dimensional array that held a set of numbers. The ndarray can have more than a single dimension. (For more resources related to this topic, see here.) Advantages of NumPy arrays The NumPy array is, in general, homogeneous (there is a particular record array type that is heterogeneous)—the items in the array have to be of the same type. The advantage is that if we know that the items in an array are of the same type, it is easy to ascertain the storage size needed for the array. NumPy arrays can execute vectorized operations, processing a complete array, in contrast to Python lists, where you usually have to loop through the list and execute the operation on each element. NumPy arrays are indexed from 0, just like lists in Python. NumPy utilizes an optimized C API to make the array operations particularly quick. We will make an array with the arange() subroutine again. You will see snippets from Jupyter Notebook sessions where NumPy is already imported with instruction import numpy as np. Here's how to get the data type of an array: In: a = np.arange(5) In: a.dtype Out: dtype('int64') The data type of the array a is int64 (at least on my computer), but you may get int32 as the output if you are using 32-bit Python. In both the cases, we are dealing with integers (64 bit or 32 bit). Besides the data type of an array, it is crucial to know its shape. A vector is commonly used in mathematics but most of the time we need higher-dimensional objects. Let's find out the shape of the vector we produced a few minutes ago: In: a Out: array([0, 1, 2, 3, 4]) In: a.shape Out: (5,) As you can see, the vector has five components with values ranging from 0 to 4. The shape property of the array is a tuple; in this instance, a tuple of 1 element, which holds the length in each dimension. Creating a multidimensional array Now that we know how to create a vector, we are set to create a multidimensional NumPy array. After we produce the matrix, we will again need to show its, as demonstrated in the following code snippets: Create a multidimensional array as follows: In: m = np.array([np.arange(2), np.arange(2)]) In: m Out: array([[0, 1], [0, 1]]) We can show the array shape as follows: In: m.shape Out: (2, 2) We made a 2 x 2 array with the arange() subroutine. The array() function creates an array from an object that you pass to it. The object has to be an array, for example, a Python list. In the previous example, we passed a list of arrays. The object is the only required parameter of the array() function. NumPy functions tend to have a heap of optional arguments with predefined default options. Selecting NumPy array elements From time to time, we will wish to select a specific constituent of an array. We will take a look at how to do this, but to kick off, let's make a 2 x 2 matrix again: In: a = np.array([[1,2],[3,4]]) In: a Out: array([[1, 2], [3, 4]]) The matrix was made this time by giving the array() function a list of lists. We will now choose each item of the matrix one at a time, as shown in the following code snippet. Recall that the index numbers begin from 0: In: a[0,0] Out: 1 In: a[0,1] Out: 2 In: a[1,0] Out: 3 In: a[1,1] Out: 4 As you can see, choosing elements of an array is fairly simple. For the array a, we just employ the notation a[m,n], where m and n are the indices of the item in the array. Have a look at the following figure for your reference: NumPy numerical types Python has an integer type, a float type, and complex type; nonetheless, this is not sufficient for scientific calculations. In practice, we still demand more data types with varying precisions and, consequently, different storage sizes of the type. For this reason, NumPy has many more data types. The bulk of the NumPy mathematical types ends with a number. This number designates the count of bits related to the type. The following table (adapted from the NumPy user guide) presents an overview of NumPy numerical types: Type Description bool Boolean (True or False) stored as a bit inti Platform integer (normally either int32 or int64) int8 Byte (-128 to 127) int16 Integer (-32768 to 32767) int32 Integer (-2 ** 31 to 2 ** 31 -1) int64 Integer (-2 ** 63 to 2 ** 63 -1) uint8 Unsigned integer (0 to 255) uint16 Unsigned integer (0 to 65535) uint32 Unsigned integer (0 to 2 ** 32 - 1) uint64 Unsigned integer (0 to 2 ** 64 - 1) float16 Half precision float: sign bit, 5 bits exponent, and 10 bits mantissa float32 Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa float64 or float Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa complex64 Complex number, represented by two 32-bit floats (real and imaginary components) complex128 or complex Complex number, represented by two 64-bit floats (real and imaginary components) For each data type, there exists a matching conversion function: In: np.float64(42) Out: 42.0 In: np.int8(42.0) Out: 42 In: np.bool(42) Out: True In: np.bool(0) Out: False In: np.bool(42.0) Out: True In: np.float(True) Out: 1.0 In: np.float(False) Out: 0.0 Many functions have a data type argument, which is frequently optional: In: np.arange(7, dtype= np.uint16) Out: array([0, 1, 2, 3, 4, 5, 6], dtype=uint16) It is important to be aware that you are not allowed to change a complex number into an integer. Attempting to do that sparks off a TypeError: In: np.int(42.0 + 1.j) Traceback (most recent call last): <ipython-input-24-5c1cd108488d> in <module>() ----> 1 np.int(42.0 + 1.j) TypeError: can't convert complex to int The same goes for conversion of a complex number into a floating-point number. By the way, the j component is the imaginary coefficient of a complex number. Even so, you can convert a floating-point number to a complex number, for example, complex(1.0). The real and imaginary pieces of a complex number can be pulled out with the real() and imag() functions, respectively. Data type objects Data type objects are instances of the numpy.dtype class. Once again, arrays have a data type. To be exact, each element in a NumPy array has the same data type. The data type object can tell you the size of the data in bytes. The size in bytes is given by the itemsize property of the dtype class : In: a.dtype.itemsize Out: 8 Character codes Character codes are included for backward compatibility with Numeric. Numeric is the predecessor of NumPy. Its use is not recommended, but the code is supplied here because it pops up in various locations. You should use the dtype object instead. The following table lists several different data types and character codes related to them: Type Character code integer i Unsigned integer u Single precision float f Double precision float d bool b complex D string S unicode U Void V Take a look at the following code to produce an array of single precision floats: In: arange(7, dtype='f') Out: array([ 0., 1., 2., 3., 4., 5., 6.], dtype=float32) Likewise, the following code creates an array of complex numbers: In: arange(7, dtype='D') In: arange(7, dtype='D') Out: array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j]) The dtype constructors We have a variety of means to create data types. Take the case of floating-point data (have a look at dtypeconstructors.py in this book's code bundle): We can use the general Python float, as shown in the following lines of code: In: np.dtype(float) Out: dtype('float64') We can specify a single precision float with a character code: In: np.dtype('f') Out: dtype('float32') We can use a double precision float with a character code: In: np.dtype('d') Out: dtype('float64') We can pass the dtype constructor a two-character code. The first character stands for the type; the second character is a number specifying the number of bytes in the type (the numbers 2, 4, and 8 correspond to floats of 16, 32, and 64 bits, respectively): In: np.dtype('f8') Out: dtype('float64') A (truncated) list of all the full data type codes can be found by applying sctypeDict.keys(): In: np.sctypeDict.keys() In: np.sctypeDict.keys() Out: dict_keys(['?', 0, 'byte', 'b', 1, 'ubyte', 'B', 2, 'short', 'h', 3, 'ushort', 'H', 4, 'i', 5, 'uint', 'I', 6, 'intp', 'p', 7, 'uintp', 'P', 8, 'long', 'l', 'L', 'longlong', 'q', 9, 'ulonglong', 'Q', 10, 'half', 'e', 23, 'f', 11, 'double', 'd', 12, 'longdouble', 'g', 13, 'cfloat', 'F', 14, 'cdouble', 'D', 15, 'clongdouble', 'G', 16, 'O', 17, 'S', 18, 'unicode', 'U', 19, 'void', 'V', 20, 'M', 21, 'm', 22, 'bool8', 'Bool', 'b1', 'float16', 'Float16', 'f2', 'float32', 'Float32', 'f4', 'float64', ' Float64', 'f8', 'float128', 'Float128', 'f16', 'complex64', 'Complex32', 'c8', 'complex128', 'Complex64', 'c16', 'complex256', 'Complex128', 'c32', 'object0', 'Object0', 'bytes0', 'Bytes0', 'str0', 'Str0', 'void0', 'Void0', 'datetime64', 'Datetime64', 'M8', 'timedelta64', 'Timedelta64', 'm8', 'int64', 'uint64', 'Int64', 'UInt64', 'i8', 'u8', 'int32', 'uint32', 'Int32', 'UInt32', 'i4', 'u4', 'int16', 'uint16', 'Int16', 'UInt16', 'i2', 'u2', 'int8', 'uint8', 'Int8', 'UInt8', 'i1', 'u1', 'complex_', 'int0', 'uint0', 'single', 'csingle', 'singlecomplex', 'float_', 'intc', 'uintc', 'int_', 'longfloat', 'clongfloat', 'longcomplex', 'bool_', 'unicode_', 'object_', 'bytes_', 'str_', 'string_', 'int', 'float', 'complex', 'bool', 'object', 'str', 'bytes', 'a']) The dtype attributes The dtype class has a number of useful properties. For instance, we can get information about the character code of a data type through the properties of dtype: In: t = np.dtype('Float64') In: t.char Out: 'd' The type attribute corresponds to the type of object of the array elements: In: t.type Out: numpy.float64 The str attribute of dtype gives a string representation of a data type. It begins with a character representing endianness, if appropriate, then a character code, succeeded by a number corresponding to the number of bytes that each array item needs. Endianness, here, entails the way bytes are ordered inside a 32- or 64-bit word. In the big-endian order, the most significant byte is stored first, indicated by >. In the little-endian order, the least significant byte is stored first, indicated by <, as exemplified in the following lines of code: In: t.str Out: '<f8' One-dimensional slicing and indexing Slicing of one-dimensional NumPy arrays works just like the slicing of standard Python lists. Let's define an array containing the numbers 0, 1, 2, and so on up to and including 8. We can select a part of the array from indexes 3 to 7, which extracts the elements of the arrays 3 through 6: In: a = np.arange(9) In: a[3:7] Out: array([3, 4, 5, 6]) We can choose elements from indexes the 0 to 7 with an increment of 2: In: a[:7:2] Out: array([0, 2, 4, 6]) Just as in Python, we can use negative indices and reverse the array: In: a[::-1] Out: array([8, 7, 6, 5, 4, 3, 2, 1, 0]) Manipulating array shapes We have already learned about the reshape() function. Another repeating chore is the flattening of arrays. Flattening in this setting entails transforming a multidimensional array into a one-dimensional array. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2,3,4) In: print(b) Out: [[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) We can manipulate array shapes using the following functions: Ravel: We can accomplish this with the ravel() function as follows: In: b Out: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) In: b.ravel() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Flatten: The appropriately named function, flatten(), does the same as ravel(). However, flatten() always allocates new memory, whereas ravel gives back a view of the array. This means that we can directly manipulate the array as follows: In: b.flatten() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Setting the shape with a tuple: Besides the reshape() function, we can also define the shape straightaway with a tuple, which is exhibited as follows: In: b.shape = (6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) As you can understand, the preceding code alters the array immediately. Now, we have a 6 x 4 array. Transpose: In linear algebra, it is common to transpose matrices. Transposing is a way to transform data. For a two-dimensional table, transposing means that rows become columns and columns become rows. We can do this too by using the following code: In: b.transpose() Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) Resize: The resize() method works just like the reshape() method, In: b.resize((2,12)) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Stacking arrays Arrays can be stacked horizontally, depth wise, or vertically. We can use, for this goal, the vstack(), dstack(), hstack(), column_stack(), row_stack(), and concatenate() functions. To start with, let's set up some arrays: In: a = np.arange(9).reshape(3,3) In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: b = 2 * a In: b Out: array([[ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) As mentioned previously, we can stack arrays using the following techniques: Horizontal stacking: Beginning with horizontal stacking, we will shape a tuple of ndarrays and hand it to the hstack() function to stack the arrays. This is shown as follows: In: np.hstack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) We can attain the same thing with the concatenate() function, which is shown as follows: In: np.concatenate((a, b), axis=1) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) The following diagram depicts horizontal stacking: Vertical stacking: With vertical stacking, a tuple is formed again. This time it is given to the vstack() function to stack the arrays. This can be seen as follows: In: np.vstack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) The concatenate() function gives the same outcome with the axis parameter fixed to 0. This is the default value for the axis parameter, as portrayed in the following code: In: np.concatenate((a, b), axis=0) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) Refer to the following figure for vertical stacking: Depth stacking: To boot, there is the depth-wise stacking employing dstack() and a tuple, of course. This entails stacking a list of arrays along the third axis (depth). For example, we could stack 2D arrays of image data on top of each other as follows: In: np.dstack((a, b)) Out: array([[[ 0, 0], [ 1, 2], [ 2, 4]], [[ 3, 6], [ 4, 8], [ 5, 10]], [[ 6, 12], [ 7, 14], [ 8, 16]]]) Column stacking: The column_stack() function stacks 1D arrays column-wise. This is shown as follows: In: oned = np.arange(2) In: oned Out: array([0, 1]) In: twice_oned = 2 * oned In: twice_oned Out: array([0, 2]) In: np.column_stack((oned, twice_oned)) Out: array([[0, 0], [1, 2]]) 2D arrays are stacked the way the hstack() function stacks them, as demonstrated in the following lines of code: In: np.column_stack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) In: np.column_stack((a, b)) == np.hstack((a, b)) Out: array([[ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], dtype=bool) Yes, you guessed it right! We compared two arrays with the == operator. Row stacking: NumPy, naturally, also has a function that does row-wise stacking. It is named row_stack() and for 1D arrays, it just stacks the arrays in rows into a 2D array: In: np.row_stack((oned, twice_oned)) Out: array([[0, 1], [0, 2]]) The row_stack() function results for 2D arrays are equal to the vstack() function results: In: np.row_stack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) In: np.row_stack((a,b)) == np.vstack((a, b)) Out: array([[ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True]], dtype=bool) Splitting NumPy arrays Arrays can be split vertically, horizontally, or depth wise. The functions involved are hsplit(), vsplit(), dsplit(), and split(). We can split arrays either into arrays of the same shape or indicate the location after which the split should happen. Let's look at each of the functions in detail: Horizontal splitting: The following code splits a 3 x 3 array on its horizontal axis into three parts of the same size and shape (see splitting.py in this book's code bundle): In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: np.hsplit(a, 3) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Liken it with a call of the split() function, with an additional argument, axis=1: In: np.split(a, 3, axis=1) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Vertical splitting: vsplit() splits along the vertical axis: In: np.vsplit(a, 3) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] The split() function, with axis=0, also splits along the vertical axis: In: np.split(a, 3, axis=0) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] Depth-wise splitting: The dsplit() function, unsurprisingly, splits depth-wise. We will require an array of rank 3 to begin with: In: c = np.arange(27).reshape(3, 3, 3) In: c Out: array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) In: np.dsplit(c, 3) Out: [array([[[ 0], [ 3], [ 6]], [[ 9], [12], [15]], [[18], [21], [24]]]), array([[[ 1], [ 4], [ 7]], [[10], [13], [16]], [[19], [22], [25]]]), array([[[ 2], [ 5], [ 8]], [[11], [14], [17]], [[20], [23], [26]]])] NumPy array attributes Let's learn more about the NumPy array attributes with the help of an example. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2, 12) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Besides the shape and dtype attributes, ndarray has a number of other properties, as shown in the following list: ndim gives the number of dimensions, as shown in the following code snippet: In: b.ndim Out: 2 size holds the count of elements. This is shown as follows: In: b.size Out: 24 itemsize returns the count of bytes for each element in the array, as shown in the following code snippet: In: b.itemsize Out: 8 If you require the full count of bytes the array needs, you can have a look at nbytes. This is just a product of the itemsize and size properties: In: b.nbytes Out: 192 In: b.size * b.itemsize Out: 192 The T property has the same result as the transpose() function, which is shown as follows: In: b.resize(6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) In: b.T Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) If the array has a rank of less than 2, we will just get a view of the array: In: b.ndim Out: 1 In: b.T Out: array([0, 1, 2, 3, 4]) Complex numbers in NumPy are represented by j. For instance, we can produce an array with complex numbers as follows: In: b = np.array([1.j + 1, 2.j + 3]) In: b Out: array([ 1.+1.j, 3.+2.j]) The real property returns to us the real part of the array, or the array itself if it only holds real numbers: In: b.real Out: array([ 1., 3.]) The imag property holds the imaginary part of the array: In: b.imag Out: array([ 1., 2.]) If the array holds complex numbers, then the data type will automatically be complex as well: In: b.dtype Out: dtype('complex128') In: b.dtype.str Out: '<c16' The flat property gives back a numpy.flatiter object. This is the only means to get a flatiter object; we do not have access to a flatiter constructor. The flat iterator enables us to loop through an array as if it were a flat array, as shown in the following code snippet: In: b = np.arange(4).reshape(2,2) In: b Out: array([[0, 1], [2, 3]]) In: f = b.flat In: f Out: <numpy.flatiter object at 0x103013e00> In: for item in f: print(item) Out: 0 1 2 3 It is possible to straightaway obtain an element with the flatiter object: In: b.flat[2] Out: 2 Also, you can obtain multiple elements as follows: In: b.flat[[1,3]] Out: array([1, 3]) The flat property can be set. Setting the value of the flat property leads to overwriting the values of the entire array: In: b.flat = 7 In: b Out: array([[7, 7], [7, 7]]) We can also obtain selected elements as follows: In: b.flat[[1,3]] = 1 In: b Out: array([[7, 1], [7, 1]]) The next diagram illustrates various properties of ndarray: Converting arrays We can convert a NumPy array to a Python list with the tolist() function . The following is a brief explanation: Convert to a list: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.tolist() Out: [(1+1j), (3+2j)] The astype() function transforms the array to an array of the specified data type: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.astype(int) /usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:1: ComplexWarning: Casting complex values to real discards the imaginary part … Out: array([1, 3]) In: b.astype('complex') Out: array([ 1.+1.j, 3.+2.j]) We are dropping off the imaginary part when casting from the complex type to int. The astype() function takes the name of a data type as a string too. The preceding code won't display a warning this time because we used the right data type. Summary In this article, we found out a heap about the NumPy basics: data types and arrays. Arrays have various properties that describe them. You learned that one of these properties is the data type, which, in NumPy, is represented by a full-fledged object. NumPy arrays can be sliced and indexed in an effective way, compared to standard Python lists. NumPy arrays have the extra ability to work with multiple dimensions. The shape of an array can be modified in multiple ways, such as stacking, resizing, reshaping, and splitting. Resources for Article: Further resources on this subject: Big Data Analytics [article] Python Data Science Up and Running [article] R and its Diverse Possibilities [article]
Read more
  • 0
  • 0
  • 14038

article-image-understanding-spark-rdd
Packt
01 Mar 2017
17 min read
Save for later

Understanding Spark RDD

Packt
01 Mar 2017
17 min read
In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs:   RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]
Read more
  • 0
  • 0
  • 7344

article-image-review-sql-server-features-developers
Packt
13 Feb 2017
43 min read
Save for later

Review of SQL Server Features for Developers

Packt
13 Feb 2017
43 min read
In this article by Dejan Sarka, Miloš Radivojević, and William Durkin, the authors of the book, SQL Server 2016 Developer's Guide explains that before dwelling into the new features in SQL Server 2016, let's make a quick recapitulation of the SQL Server features for developers available already in the previous versions of SQL Server. Recapitulating the most important features with help you remember what you already have in your development toolbox and also understanding the need and the benefits of the new or improved features in SQL Server 2016. The recapitulation starts with the mighty T-SQL SELECT statement. Besides the basic clauses, advanced techniques like window functions, common table expressions, and APPLY operator are explained. Then you will pass quickly through creating and altering database objects, including tables and programmable objects, like triggers, views, user-defined functions, and stored procedures. You will also review the data modification language statements. Of course, errors might appear, so you have to know how to handle them. In addition, data integrity rules might require that two or more statements are executed as an atomic, indivisible block. You can achieve this with help of transactions. The last section of this article deals with parts of SQL Server Database Engine marketed with a common name "Beyond Relational". This is nothing beyond the Relational Model, the "beyond relational" is really just a marketing term. Nevertheless, you will review the following: How SQL Server supports spatial data How you can enhance the T-SQL language with Common Language Runtime (CLR) elements written is some .NET language like Visual C# How SQL Server supports XML data The code in this article uses the WideWorldImportersDW demo database. In order to test the code, this database must be present in your SQL Server instance you are using for testing, and you must also have SQL Server Management Studio (SSMS) as the client tool. This article will cover the following points: Core Transact-SQL SELECT statement elements Advanced SELECT techniques Error handling Using transactions Spatial data XML support in SQL Server (For more resources related to this topic, see here.) The Mighty Transact-SQL SELECT You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage: SELECT to define the columns returned, or a projection of all table columns FROM to list the tables used in the query and how they are associated, or joined WHERE to filter the data to return only the rows that satisfy the condition in the predicate GROUP BY to define the groups over which the data is aggregated HAVING to filter the data after the grouping with conditions that refer to aggregations ORDER BY to sort the rows returned to the client application Besides these basic clauses, SELECT offers a variety of advanced possibilities as well. These advanced techniques are unfortunately less exploited by developers, although they are really powerful and efficient. Therefore, I urge you to review them and potentially use them in your applications. The advanced query techniques presented here include: Queries inside queries, or shortly subqueries Window functions TOP and OFFSET...FETCH expressions APPLY operator Common tables expressions, or CTEs Core Transact-SQL SELECT Statement Elements Let us start with the most simple concept of SQL which every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the WideWorldImportersDW database context and selects all data from the Dimension.Customer table. USE WideWorldImportersDW; SELECT * FROM Dimension.Customer; The code returns 403 rows, all customers with all columns. Using SELECT * is not recommended in production. Such queries can return an unexpected result when the table structure changes, and is also not suitable for good optimization. Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only four columns from the table. SELECT [Customer Key], [WWI Customer ID], [Customer], [Buying Group] FROM Dimension.Customer; Below is the shortened result, limited to the first three rows only. Customer Key WWI Customer ID Customer Buying Group ------------ --------------- ----------------------------- ------------- 0 0 Unknown N/A 1 1 Tailspin Toys (Head Office) Tailspin Toys 2 2 Tailspin Toys (Sylvanite, MT) Tailspin Toys You can see that the column names in the WideWorldImportersDW database include spaces. Names that include spaces are called delimited identifiers. In order to make SQL Server properly understand them as column names, you must enclose delimited identifiers in square parentheses. However, if you prefer to have names without spaces, or is you use computed expressions in the column list, you can add column aliases. The following query returns completely the same data as the previous one, just with columns renamed by aliases to avoid delimited names. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer; You might have noticed in the result set returned from the last two queries that there is also a row in the table for an unknown customer. You can filter this row with the WHERE clause. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer WHERE [Customer Key] <> 0; In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers in the examples you have seen so far. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important. In the following example, you can see the Fact.Sale table joined with an inner join to the Dimension.Customer table. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]; In the query, you can see that table aliases are used. If a column's name is unique across all tables in the query, then you can use it without table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the format table.column. In the previous query, the [Customer Key] column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. The previous query returns 228,265 rows. It is always recommendable to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written logically correct. The query returns the unknown customer and the orders associated for this customer, of more precisely said associated to this placeholder for an unknown customer. Of course, you can use the WHERE clause to filter the rows in a query that joins multiple tables, like you use it for a single table query. The following query filters the unknown customer rows. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0; The query returns 143,968 rows. You can see that a lot of sales is associated with the unknown customer. Of course, the Fact.Sale table cannot be joined to the Dimension.Customer table. The following query joins it to the Dimension.Date table. Again, the join performed is an inner join. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key] FROM Fact.Sale AS f INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 227,981 rows. The query that joined the Fact.Sale table to the Dimension.Customer table returned 228,265 rows. It looks like not all Fact.Sale table rows have a known delivery date, not all rows can match the Dimension.Date table rows. You can use an outer join to check this. With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows like you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, then the rows from the left table are preserved. If you use RIGHT OUTER JOIN, then the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With a FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query preserves the rows from the Fact.Sale table, which is on the left side of the join to the Dimension.Date table. In addition, the query sorts the result set by the invoice date descending using the ORDER BY clause. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key], f.[Invoice Date Key] FROM Fact.Sale AS f LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date ORDER BY f.[Invoice Date Key] DESC; The query returns 228,265 rows. Here is the partial result of the query. Date Total Excluding Tax Delivery Date Key Invoice Date Key ---------- -------------------- ----------------- ---------------- NULL 180.00 NULL 2016-05-31 NULL 120.00 NULL 2016-05-31 NULL 160.00 NULL 2016-05-31 … … … … 2016-05-31 2565.00 2016-05-31 2016-05-30 2016-05-31 88.80 2016-05-31 2016-05-30 2016-05-31 50.00 2016-05-31 2016-05-30 For the last invoice date (2016-05-31), the delivery date is NULL. The NULL in the Date column form the Dimension.Date table is there because the data from this table is unknown for the rows with an unknown delivery date in the Fact.Sale table. Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows if an outer join. You can control the join order with parenthesis. The following query joins the Fact.Sale table with an inner join to the Dimension.Customer, Dimension.City, Dimension.[Stock Item], and Dimension.Employee tables, and with an left outer join to the Dimension.Date table. SELECT cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory, d.Date, d.[Calendar Month Label] AS CalendarMonth, d.[Calendar Year] AS CalendarYear, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color, e.[Employee Key] AS EmployeeKey, e.Employee, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Employee AS e ON f.[Salesperson Key] = e.[Employee Key]) LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 228,265 rows. Note that with the usage of the parenthesis the order of joins is defined in the following way: Perform all inner joins, with an arbitrary order among them Execute the left outer join after all of the inner joins So far, I have tacitly assumed that the Fact.Sale table has 228,265 rows, and that the previous query needed only one outer join of the Fact.Sale table with the Dimension.Date to return all of the rows. It would be good to check this number in advance. You can check the number of rows by aggregating them using the COUNT(*) aggregate function. The following query introduces that function. SELECT COUNT(*) AS SalesCount FROM Fact.Sale; Now you can be sure that the Fact.Sale table has exactly 228,265 rows. Many times you need to aggregate data in groups. This is the point where the GROUP BY clause becomes handy. The following query aggregates the sales data for each customer. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer; The query returns 402 rows, one for each known customer. In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list. Sometimes you need to filter aggregated data. For example, you might need to find only frequent customers, defined as customers with more than 400 rows in the Fact.Sale table. You can filter the result set on the aggregated data by using the HAVING clause, like the following query shows. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400; The query returns 45 rows for 45 most frequent known customers. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clause from the query, and the aliases are not known yet. However, the ORDER BY clause executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows all of the basic SELECT statement clauses used together to aggregate the sales data over the known customers, filters the data to include the frequent customers only, and sorts the result set descending by the number of rows of each customer in the Fact.Sale table. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400 ORDER BY InvoiceLinesCountDESC; The query returns 45 rows. Below is the shortened result set. Customer TotalQuantity TotalAmount SalesCount ------------------------------------- ------------- ------------ ----------- Tailspin Toys (Vidrine, LA) 18899 340163.80 455 Tailspin Toys (North Crows Nest, IN) 17684 313999.50 443 Tailspin Toys (Tolna, ND) 16240 294759.10 443 Advanced SELECT Techniques Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only – either one row for the whole input rowset or one row per group. Sometimes you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, queries inside queries. The following query shows an example of using two subqueries in a single query. In the SELECT clause, a subquery that calculates the sum of quantity for each customer. It returns a scalar value. The subquery refers to the customer key from the outer query. The subquery can't execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates overall quantity for all customers. This query returns a table, although it is a table with a single row and single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table. Another type of join is used to add the overall total to each detail row. A cross join is a Cartesian product of two input rowsets—each row from one side is associated with every single row from the other side. No join condition is needed. A cross join can produce an unwanted huge result set. For example, if you cross join just a 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross join in production. However, in the example in the following query, 143,968 from the left side rows is cross joined to a single row from the subquery, therefore producing 143,968 only. Effectively, this means that the overall total column is added to each detail row. SELECT c.Customer, f.Quantity, (SELECT SUM(f1.Quantity) FROM Fact.Sale AS f1 WHERE f1.[Customer Key] = c.[Customer Key]) AS TotalCustomerQuantity, f2.TotalQuantity FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]) CROSS JOIN (SELECT SUM(f2.Quantity) FROM Fact.Sale AS f2 WHERE f2.[Customer Key] <> 0) AS f2(TotalQuantity) WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; Here is an abbreviated output of the query. Customer Quantity TotalCustomerQuantity TotalQuantity ---------------------------- ----------- --------------------- ------------- Tailspin Toys (Absecon, NJ) 360 12415 5667611 Tailspin Toys (Absecon, NJ) 324 12415 5667611 Tailspin Toys (Absecon, NJ) 288 12415 5667611 In the previous example, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where logically executes only once. Although SQL Server can many times optimize correlated subqueries and convert them to joins, there exist also a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions. The following query is using the window aggregate function SUM to calculate the total over each customer and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each customer, meaning that the total quantity per customer is reset to zero for each new customer. The second calculation uses an OVER clause without specifying partitions, thus meaning the calculation is done over all input rowset. This query produces exactly the same result as the previous one/ SELECT c.Customer, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer) AS TotalCustomerQuantity, SUM(f.Quantity) OVER() AS TotalQuantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; You can use many other functions for window calculations. For example, you can use the ranking functions, like ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A stand-alone, outer ORDER BY at the end of the query defines the order of the result. The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each customer and once ever the whole input rowset. Logical order of calculation is over quantity descending, meaning that row number 1 gets the largest quantity, either the largest for each customer or the largest in the whole input rowset. SELECT c.Customer, f.Quantity, ROW_NUMBER() OVER(PARTITION BY c.Customer ORDER BY f.Quantity DESC) AS CustomerOrderPosition, ROW_NUMBER() OVER(ORDER BY f.Quantity DESC) AS TotalOrderPosition FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; The query produces the following result, abbreviated to couple of rows only again. Customer Quantity CustomerOrderPosition TotalOrderPosition ----------------------------- ----------- --------------------- -------------------- Tailspin Toys (Absecon, NJ) 360 1 129 Tailspin Toys (Absecon, NJ) 324 2 162 Tailspin Toys (Absecon, NJ) 288 3 374 … … … … Tailspin Toys (Aceitunas, PR) 288 1 392 Tailspin Toys (Aceitunas, PR) 250 4 1331 Tailspin Toys (Aceitunas, PR) 250 3 1315 Tailspin Toys (Aceitunas, PR) 250 2 1313 Tailspin Toys (Aceitunas, PR) 240 5 1478 Note the position, or the row number, for the second customer. The order does not look to be completely correct – it is 1, 4, 3, 2, 5, and not 1, 2, 3, 4, 5, like you might expect. This is due to repeating value for the second largest quantity, for the quantity 250. The quantity is not unique, and thus the order is not deterministic. The order of the result is defined over the quantity, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined on unique values. Please also note that you might get a different order when you execute the same query on your SQL Server instance. Window functions are useful for some advanced calculations, like running totals and moving averages as well. However, the calculation of these values can't be performed over the complete partition. You can additionally frame the calculation to a subset of rows of each partition only. The following query calculates the running total of the quantity per customer (the column alias Q_RT in the query) ordered by the sale key and framed differently for each row. The frame is defined from the first row in the partition to the current row. Therefore, the running total is calculated over one row for the first row, over two rows for the second row, and so on. Additionally, the query calculates the moving average of the quantity (the column alias Q_MA in the query) for the last three rows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Q_RT, AVG(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Q_MA FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.[Sale Key]; The query returns the following (abbreviated) result. Customer SaleKey Quantity Q_RT Q_MA ---------------------------- -------- ----------- ----------- ----------- Tailspin Toys (Absecon, NJ) 2869 216 216 216 Tailspin Toys (Absecon, NJ) 2870 2 218 109 Tailspin Toys (Absecon, NJ) 2871 2 220 73 Let's find the top three orders by quantity for the Tailspin Toys (Aceitunas, PR) customer! You can do this by using the OFFSET…FETCH clause after the ORDER BY clause, like the following query shows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY; This is the complete result of the query. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 But wait… Didn't the second largest quantity, the value 250, repeat three times? Which two rows were selected in the output? Again, because the calculation is done over a non-unique column, the result is somehow nondeterministic. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way you don't know the number of the rows in the output in advance. The following query shows this approach. SELECT TOP 3 WITH TIES c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC; This is the complete result of the previous query – this time it is four rows. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 223106 250 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 The next task is to get the top three orders by quantity for each customer. You need to perform the calculation for each customer. The APPLY Transact-SQL operator comes handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similarly as the LEFT OUTER JOIN does. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three orders by quantity for each customer that actually does have some orders. SELECT c.Customer, t3.SaleKey, t3.Quantity FROM Dimension.Customer AS c CROSS APPLY (SELECT TOP(3) f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f WHERE f.[Customer Key] = c.[Customer Key] ORDER BY f.Quantity DESC) AS t3 WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, t3.Quantity DESC; Below is the result of this query, shortened to first nine rows. Customer SaleKey Quantity ---------------------------------- -------- ----------- Tailspin Toys (Absecon, NJ) 5620 360 Tailspin Toys (Absecon, NJ) 114397 324 Tailspin Toys (Absecon, NJ) 82868 288 Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 Tailspin Toys (Airport Drive, MO) 43184 250 Tailspin Toys (Airport Drive, MO) 70842 240 Tailspin Toys (Airport Drive, MO) 630 225 For the final task in this section, assume that you need to calculate some statistics over totals of customers' orders. You need to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. This means you need to calculate the totals over customers in advance, and then use aggregate functions AVG() and STDEV() on these aggregates. You could do aggregations over customers in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. Such subquery is called a common table expression, or a CTE. CTEs are more readable than derived tables, and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, then you need to define them multiple times if you want to use the multiple times in the outer query. The following query shows the usage of a CTE to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. WITH CustomerSalesCTE AS ( SELECT c.Customer, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer ) SELECT ROUND(AVG(TotalAmount), 6) AS AvgAmountPerCustomer, ROUND(STDEV(TotalAmount), 6) AS StDevAmountPerCustomer, AVG(InvoiceLinesCount) AS AvgCountPerCustomer FROM CustomerSalesCTE; It returns the following result. AvgAmountPerCustomer StDevAmountPerCustomer AvgCountPerCustomer --------------------- ---------------------- ------------------- 270479.217661 38586.082621 358 Transactions and Error Handling In a real world application, errors always appear. Syntax or even logical errors can be in the code, the database design might be incorrect, there might even be a bug in the database management system you are using. Even is everything works correctly, you might get an error because the users insert wrong data. With Transact-SQL error handling you can catch such user errors and decide what to do upon them. Typically, you want to log the errors, inform the users about the errors, and sometimes even correct them in the error handling code. Error handling for user errors works on the statement level. If you send SQL Server a batch of two or more statements and the error is in the last statement, the previous statements execute successfully. This might not be what you desire. Many times you need to execute a batch of statements as a unit, and fail all of the statements if one of the statements fails. You can achieve this by using transactions. You will learn in this section about: Error handling Transaction management Error Handling You can see there is a need for error handling by producing an error. The following code tries to insert an order and a detail row for this order. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 0; In SQL Server Management Studio, you can see that an error happened. You should get a message that the error 547 occurred, that The INSERT statement conflicted with the CHECK constraint. If you remember, in order details only rows where the value for the quantity is not equal to zero are allowed. The error occurred in the second statement, in the call of the procedure that inserts an order detail. The procedure that inserted an order executed without an error. Therefore, an order with id equal to six must be in the dbo. SimpleOrders table. The following code tries to insert order six again. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; Of course, another error occurred. This time it should be error 2627, a violation of the PRIMARY KEY constraint. The values of the OrderId column must be unique. Let's check the state of the data after these successful and unsuccessful inserts. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The previous query checks only orders and their associated details where the order id value is greater than five. The query returns the following result set. OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL You can see that only the first insert of the order with the id 6 succeeded. The second insert of an order with the same id and the insert of the detail row for the order six did not succeed. You start handling errors by enclosing the statements in the batch you are executing in the BEGIN TRY … END TRY block. You can catch the errors in the BEGIN CATCH … END CATCH block. The BEGIN CATCH statement must be immediately after the END TRY statement. The control of the execution is passed from the try part to the catch part immediately after the first error occurs. In the catch part, you can decide how to handle the errors. If you want to log the data about the error or inform an end user about the details of the error, the following functions might be very handy: ERROR_NUMBER() – this function returns the number of the error. ERROR_SEVERITY() - it returns the severity level. The severity of the error indicates the type of problem encountered. Severity levels 11 to 16 can be corrected by the user. ERROR_STATE() – this function returns the error state number. Error state gives more details about a specific error. You might want to use this number together with the error number to search Microsoft knowledge base for the specific details of the error you encountered. ERROR_PROCEDURE() – it returns the name of the stored procedure or trigger where the error occurred, or NULL if the error did not occur within a stored procedure or trigger. ERROR_LINE() – it returns the line number at which the error occurred. This might be the line number in a routine if the error occurred within a stored procedure or trigger, or the line number in the batch. ERROR_MESSAGE() – this function returns the text of the error message. The following code uses the try…catch block to handle possible errors in the batch of the statements, and returns the information of the error using the above mentioned functions. Note that the error happens in the first statement of the batch. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 5; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH There was a violation of the PRIMARY KEY constraint again, because the code tried to insert an order with id six again. The second statement would succeed if you would execute in its own batch, without error handling. However, because of the error handling, the control was passed to the catch block immediately after the error in the first statement, and the second statement never executed. You can check the data with the following query. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The result set should be the same as the results set of the last check of the orders with id greater than five – a single order without details. The following code produces an error in the second statement. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 7, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 7, @ProductId = 2, @Quantity = 0; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH You can see that the insert of the order detail violates the CHECK constraint for the quantity. If you check the data with the same query as last two times again, you would see that there are orders with id six and seven in the data, both without order details. Using Transactions Your business logic might request that the insert of the first statement fails when the second statement fails. You might need to repeal the changes of the first statement on the failure of the second statement. You can define that a batch of statements executes as a unit by using transactions. The following code shows how to use transactions. Again, the second statement in the batch in the try block is the one that produces an error. BEGIN TRY BEGIN TRANSACTION EXEC dbo.InsertSimpleOrder @OrderId = 8, @OrderDate = '20160706', @Customer = N'CustG'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 8, @ProductId = 2, @Quantity = 0; COMMIT TRANSACTION END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; IF XACT_STATE() <> 0 ROLLBACK TRANSACTION; END CATCH You can check the data again. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; Here is the result of the check: OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL 7 2016-07-06 CustF NULL NULL You can see that the order with id 8 does not exist in your data. Because of the insert of the detail row for this order failed, the insert of the order was rolled back as well. Note that in the catch block, the XACT_STATE() function was used to check whether the transaction still exists. If the transaction was rolled back automatically by SQL Server, then the ROLLBACK TRANSACTION would produce a new error. The following code drops the objects (in correct order, due to object contraints) created for the explanation of the DDL and DML statements, programmatic objects, error handling, and transactions. DROP FUNCTION dbo.Top2OrderDetails; DROP VIEW dbo.OrdersWithoutDetails; DROP PROCEDURE dbo.InsertSimpleOrderDetail; DROP PROCEDURE dbo.InsertSimpleOrder; DROP TABLE dbo.SimpleOrderDetails; DROP TABLE dbo.SimpleOrders; Beyond Relational The "beyond relational" is actually only a marketing term. The relational model, used in the relational database management system, is nowhere limited to specific data types, or specific languages only. However, with the term beyond relational, we typically mean specialized and complex data types that might include spatial and temporal data, XML or JSON data, and extending the capabilities of the Transact-SQL language with CLR languages like Visual C#, or statistical languages like R. SQL Server in versions before 2016 already supports some of the features mentioned. Here is a quick review of this support that includes: Spatial data CLR support XML data Defining Locations and Shapes with Spatial Data In modern applications, many times you want to show your data on a map, using the physical location. You might also want to show the shape of the objects that your data describes. You can use spatial data for tasks like these. You can represent the objects with points, lines, or polygons. From the simple shapes you can create complex geometrical objects or geographical objects, for example cities and roads. Spatial data appear in many contemporary database. Acquiring spatial data has become quite simple with the Global Positioning System (GPS) and other technologies. In addition, many software packages and database management systems help you working with spatial data. SQL Server supports two spatial data types, both implemented as .NET common language runtime (CLR) data types, from version 2008: The geometry type represents data in a Euclidean (flat) coordinate system. The geography type represents data in a round-earth coordinate system. We need two different spatial data types because of some important differences between them. These differences include units of measurement and orientation. In the planar, or flat-earth, system, you define the units of measurements. The length of a distance and the surface of an area are given in the same unit of measurement as you use for the coordinates of your coordinate system. You as the database developer know what the coordinates mean and what the unit of measure is. In geometry, the distance between the points described with the coordinates (1, 3) and (4, 7) is 5 units, regardless of the units used. You, as the database developer who created the database where you are storing this data, know the context. You know what these 5 units mean, is this 5 kilometers, or 5 inches. When talking about locations on earth, coordinates are given in degrees of latitude and longitude. This is the round-earth, or ellipsoidal system Lengths and areas are usually measured in the metric system, in meters and square meters. However, not everywhere in the world the metric system is used for the spatial data. The spatial reference identifier (SRID) of the geography instance defines the unit of measure. Therefore, whenever measuring some distance or area in the ellipsoidal system, you should always quote also the SRID used, which defines the units. In the planar system, the ring orientation of a polygon is not an important factor. For example, a polygon described by the points ((0, 0), (10, 0), (0, 5), (0, 0)) is the same as a polygon described by ((0, 0), (5, 0), (0, 10), (0, 0)). You can always rotate the coordinates appropriately to get the same feeling of the orientation. However, in geography, the orientation is needed to completely describe a polygon. Just think of the equator, which divides the earth in the two hemispheres. Is your spatial data describing the northern or southern hemisphere? The Wide World Importers data warehouse includes the city location in the Dimension.City table. The following query retrieves it for cities in the main part of the USA> SELECT City, [Sales Territory] AS SalesTerritory, Location AS LocationBinary, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] <> 0 AND [Sales Territory] NOT IN (N'External', N'Far West'); Here is the partial result of the query. City SalesTerritory LocationBinary LocationLongLat ------------ --------------- -------------------- ------------------------------- Carrollton Mideast 0xE6100000010C70... POINT (-78.651695 42.1083969) Carrollton Southeast 0xE6100000010C88... POINT (-76.5605078 36.9468152) Carrollton Great Lakes 0xE6100000010CDB... POINT (-90.4070632 39.3022693) You can see that the location is actually stored as a binary string. When you use the ToString() method of the location, you get the default string representation of the geographical point, which is the degrees of longitude and latitude. If SSMS, you send the results of the previous query to a grid, you get in the results pane also an additional representation for the spatial data. Click the Spatial results tab, and you can see the points represented in the longitude – latitude coordinate system, like you can see in the following figure. Figure 2-1: Spatial results showing customers' locations If you executed the query, you might have noticed that the spatial data representation control in SSMS has some limitations. It can show only 5,000 objects. The result displays only first 5,000 locations. Nevertheless, as you can see from the previous figure, this is enough to realize that these points form a contour of the main part of the USA. Therefore, the points represent the customers' locations for customers from USA. The following query gives you the details, like location and population, for Denver, Colorado. SELECT [City Key] AS CityKey, City, [State Province] AS State, [Latest Recorded Population] AS Population, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] = 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999'; Spatial data types have many useful methods. For example, the STDistance() method returns the shortest line between two geography types. This is a close approximate to the geodesic distance, defined as the shortest route between two points on the Earth's surface. The following code calculates this distance between Denver, Colorado, and Seattle, Washington. DECLARE @g AS GEOGRAPHY; DECLARE @h AS GEOGRAPHY; DECLARE @unit AS NVARCHAR(50); SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SET @h = (SELECT Location FROM Dimension.City WHERE [City Key] = 108657); SET @unit = (SELECT unit_of_measure FROM sys.spatial_reference_systems WHERE spatial_reference_id = @g.STSrid); SELECT FORMAT(@g.STDistance(@h), 'N', 'en-us') AS Distance, @unit AS Unit; The result of the previous batch is below. Distance Unit ------------- ------ 1,643,936.69 metre Note that the code uses the sys.spatial_reference_system catalog view to get the unit of measure for the distance of the SRID used to store the geographical instances of data. The unit is meter. You can see that the distance between Denver, Colorado, and Seattle, Washington, is more than 1,600 kilometers. The following query finds the major cities within a circle of 1,000 km around Denver, Colorado. Major cities are defined as the cities with population larger than 200,000. DECLARE @g AS GEOGRAPHY; SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SELECT DISTINCT City, [State Province] AS State, FORMAT([Latest Recorded Population], '000,000') AS Population, FORMAT(@g.STDistance(Location), '000,000.00') AS Distance FROM Dimension.City WHERE Location.STIntersects(@g.STBuffer(1000000)) = 1 AND [Latest Recorded Population] > 200000 AND [City Key] <> 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999' ORDER BY Distance; Here is the result abbreviated to the twelve closest cities to Denver, Colorado. City State Population Distance ----------------- ----------- ----------- ----------- Aurora Colorado 325,078 013,141.64 Colorado Springs Colorado 416,427 101,487.28 Albuquerque New Mexico 545,852 537,221.38 Wichita Kansas 382,368 702,553.01 Lincoln Nebraska 258,379 716,934.90 Lubbock Texas 229,573 738,625.38 Omaha Nebraska 408,958 784,842.10 Oklahoma City Oklahoma 579,999 809,747.65 Tulsa Oklahoma 391,906 882,203.51 El Paso Texas 649,121 895,789.96 Kansas City Missouri 459,787 898,397.45 Scottsdale Arizona 217,385 926,980.71 There are many more useful methods and properties implemented in the two spatial data types. In addition, you can improve the performance of spatial queries with help of specialized spatial indexes. Please refer to the MSDN article "Spatial Data (SQL Server)" at https://msdn.microsoft.com/en-us/library/bb933790.aspx for more details on the spatial data types, their methods, and spatial indexes. XML Support in SQL Server SQL Server in version 2005 also started to feature extended support for XML data inside the database engine, although some basic support was already included in version 2000. The support starts by generating XML data from tabular results. You can use the FOR XML clause of the SELECT statement for this task. The following query generates an XML document from the regular tabular result set by using the FOR XML clause with AUTO option, to generate element-centric XML instance, with namespace and inline schema included. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Dimension.Customer AS c INNER JOIN Fact.Sale AS f ON c.[Customer Key] = f.[Customer Key] WHERE c.[Customer Key] IN (127, 128) FOR XML AUTO, ELEMENTS, ROOT('CustomersOrders'), XMLSCHEMA('CustomersOrdersSchema'); GO Here is the partial result of this query. First part of the result is the inline schema/ <CustomersOrders> <xsd:schema targetNamespace="CustomersOrdersSchema" … <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" … <xsd:element name="c"> <xsd:complexType> <xsd:sequence> <xsd:element name="CustomerKey" type="sqltypes:int" /> <xsd:element name="CustomerId" type="sqltypes:int" /> <xsd:element name="Customer"> <xsd:simpleType> <xsd:restriction base="sqltypes:nvarchar" … <xsd:maxLength value="100" /> </xsd:restriction> </xsd:simpleType> </xsd:element> … </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> <c > <CustomerKey>127</CustomerKey> <CustomerId>127</CustomerId> <Customer>Tailspin Toys (Point Roberts, WA)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>3</Quantity> <Amount>48.00</Amount> <Profit>31.50</Profit> </f> <f> <Quantity>9</Quantity> <Amount>2160.00</Amount> <Profit>1363.50</Profit> </f> </c> <c > <CustomerKey>128</CustomerKey> <CustomerId>128</CustomerId> <Customer>Tailspin Toys (East Portal, CO)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>84</Quantity> <Amount>420.00</Amount> <Profit>294.00</Profit> </f> </c> … </CustomersOrders> You can also do the opposite process: convert XML to tables. Converting XML to relational tables is known as shredding XML. You can do this by using the nodes() method of the XML data type or with the OPENXML() rowset function. Inside SQL Server, you can also query the XML data from Transact-SQL to find specific elements, attributes, or XML fragments. XQuery is a standard language for browsing XML instances and returning XML, and is supported inside XML data type methods. You can store XML instances inside SQL Server database in a column of the XML data type. An XML data type includes five methods that accept XQuery as a parameter. The methods support querying (the query() method), retrieving atomic values (the value() method), existence checks (the exist() method), modifying sections within the XML data (the modify() method) as opposed to overriding the whole thing, and shredding XML data into multiple rows in a result set (the nodes() method). The following code creates a variable of the XML data type to store an XML instance in it. Then it uses the query() method to return XML fragments from the XML instance. This method accepts XQuery query as a parameter. The XQuery query uses the FLWOR expressions to define and shape the XML returned. DECLARE @x AS XML; SET @x = N' <CustomersOrders> <Customer custid="1"> <!-- Comment 111 --> <companyname>CustA</companyname> <Order orderid="1"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="9"> <orderdate>2016-07-03T00:00:00</orderdate> </Order> <Order orderid="12"> <orderdate>2016-07-12T00:00:00</orderdate> </Order> </Customer> <Customer custid="2"> <!-- Comment 222 --> <companyname>CustB</companyname> <Order orderid="3"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="10"> <orderdate>2016-07-05T00:00:00</orderdate> </Order> </Customer> </CustomersOrders>'; SELECT @x.query('for $i in CustomersOrders/Customer/Order let $j := $i/orderdate where $i/@orderid < 10900 order by ($j)[1] return <Order-orderid-element> <orderid>{data($i/@orderid)}</orderid> {$j} </Order-orderid-element>') AS [Filtered, sorted and reformatted orders with let clause]; Here is the result of the previous query. <Order-orderid-element> <orderid>1</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>3</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>9</orderid> <orderdate>2016-07-03T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>10</orderid> <orderdate>2016-07-05T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>12</orderid> <orderdate>2016-07-12T00:00:00</orderdate> </Order-orderid-element> Summary In this article, you got a review of the SQL Server features for developers that exists already in the previous versions. You can see that this support goes well beyond basic SQL statements, and also beyond pure Transact-SQL. Resources for Article: Further resources on this subject: Configuring a MySQL linked server on SQL Server 2008 [article] Basic Website using Node.js and MySQL database [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 1202

article-image-cnn-architecture
Packt
13 Feb 2017
14 min read
Save for later

CNN architecture

Packt
13 Feb 2017
14 min read
In this article by Giancarlo Zaccone, the author of the book Deep Learning with TensorFlow, we learn about multi-layer network the outputs of all neurons of the input layer would be connected to each neuron of the hidden layer (fully connected layer). (For more resources related to this topic, see here.) In CNN networks, instead, the connection scheme, that defines the convolutional layer that we are going to describe, is significantly different. As you can guess this is the main type of layer, the use of one or more of these layers in a convolutional neural network is indispensable. In a convolutional layer, each neuron is connected to a certain region of the input area called the receptive field. For example, using a 3×3 kernel filter, each neuron will have a bias and 9=3×3 weights connected to a single receptive field. Of course, to effectively recognize an image we need different kernel filters applied to the same receptive field, because each filter should recognize a different feature's image. The set of neurons that identify the same feature define a single feature map. The preceding figure shows a CNN architecture in action, the input image of 28×28 size will be analyzed by a convolutional layer composed of 32 feature map of 28×28 size. The figure also shows a receptive field and the kernel filter of 3×3 size. Figure: CNN in action A CNN may consist of several convolution layers connected in cascade. The output of each convolution layer is a set of feature maps (each generated by a single kernel filter), then all these matrices defines a new input that will be used by the next layer. CNNs also use pooling layers positioned immediately after the convolutional layers. A pooling layer divides a convolutional region in subregions and select a single representative value (max-pooling or average pooling) to reduce the computational time of subsequent layers and increase the robustness of the feature with respect its spatial position. The last hidden layer of a convolutional network is generally a fully connected network with softmax activation function for the output layer. A model for CNNs - LeNet Convolutional and max-pooling layers are at the heart of the LeNet family models. It is a family of multi-layered feedforward networks specialized on visual pattern recognition. While the exact details of the model will vary greatly, the following figure points out the graphical schema of a LeNet network: Figure: LeNet network In a LeNet model, the lower-layers are composed to alternating convolution and max-pooling, while the last layers are fully-connected and correspond to a traditional feed forward network (fully connected + softmax layer). The input to the first fully-connected layer is the set of all features maps at the layer below. From a TensorFlow implementation point of view, this means lower-layers operate on 4D tensors. These are then flattened to a 2D matrix to be compatible with a feed forward implementation. Build your first CNN In this section, we will learn how to build a CNN to classify images of the MNIST dataset. We will see that a simple softmax model provides about 92% classification accuracy for recognizing hand-written digits in the MNIST. Here we'll implement a CNN which has a classification accuracy of about 99%. The next figure shows how the data flow in the first two convolutional layer--the input image is processed in the first convolutional layer using the filter-weights. This results in 32 new images, one for each filter in the convolutional layer. The images are also dowsampled with the pooling operation so the image resolution is decreased from 28×28 to 14×14. These 32 smaller images are then processed in the second convolutional layer. We need filter-weights again for each of these 32 features, and we need filter-weights for each output channel of this layer. The images are again downsampled with a pooling operation so that the image resolution is decreased from 14×14 to 7×7. The total number of features for this convolutional layer is 64. Figure: Data flow of the first two convolutional layers The 64 resulting images are filtered again by a (3×3) third convolutional layer. We don't apply a pooling operation for this layer. The output of the second convolutional layer is 128 images of 7×7 pixels each. These are then flattened to a single vector of length 4×4×128, which is used as the input to a fully-connected layer with 128 neurons (or elements). This feeds into another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image, that is, which number is depicted in the following image: Figure: Data flow of the last three convolutional layers The convolutional filters are initially chosen at random. The error between the predicted and actual class of the input image is measured as the so-called cost function which generalize our network beyond training data. The optimizer then automatically propagates this error back through the convolutional network and updates the filter-weights to improve the classification error. This is done iteratively thousands of times until the classification error is sufficiently low. Now let's see in detail how to code our first CNN. Let's start by importing Tensorflow libraries for our implementation: import tensorflow as tf import numpy as np from tensorflow.examples.tutorials.mnist import input_data Set the following parameters, that indicate the number of samples to consider respectively for the training phase (128) and then in the test phase (256). batch_size = 128 test_size = 256 We define the following parameter, the value is 28 because a MNIST image, is 28 pixels in height and width: img_size = 28 And the number of classes; the value 10 means that we'll have one class for each of 10 digits: num_classes = 10 A placeholder variable, X, is defined for the input images. The data type for this tensor is set to float32 and the shape is set to [None, img_size, img_size, 1], where None means that the tensor may hold an arbitrary number of images: X = tf.placeholder("float", [None, img_size, img_size, 1]) Then we set another placeholder variable, Y, for the true labels associated with the images that were input data in the placeholder variable X. The shape of this placeholder variable is [None, num_classes] which means it may hold an arbitrary number of labels and each label is a vector of length num_classes which is 10 in this case. Y = tf.placeholder("float", [None, num_classes]) We collect the mnist data which will be copied into the data folder: mnist = mnist_data.read_data_sets("data/") We build the datasets for training (trX, trY) and testing the network (teX, teY). trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels The trX and teX image sets must be reshaped according the input shape: trX = trX.reshape(-1, img_size, img_size, 1) teX = teX.reshape(-1, img_size, img_size, 1) We shall now proceed to define the network's weights. The init_weights function builds new variables in the given shape and initializes network's weights with random values. def init_weights(shape): return tf.Variable(tf.random_normal(shape, stddev=0.01)) Each neuron of the first convolutional layer is convoluted to a small subset of the input tensor, of dimension 3×3×1, while the value 32 is just the number of feature map we are considering for this first layer. The weight w is then defined: w = init_weights([3, 3, 1, 32]) The number of inputs is then increased of 32, this means that each neuron of the second convolutional layer is convoluted to 3x3x32 neurons of the first convolution layer. The w2 weight is: w2 = init_weights([3, 3, 32, 64]) The value 64 represents the number of obtained output feature. The third convolutional layer is convoluted to 3x3x64 neurons of the previous layer, while 128 are the resulting features: w3 = init_weights([3, 3, 64, 128]) The fourth layer is fully connected, it receives 128x4x4 inputs, while the output is equal to 625: w4 = init_weights([128 * 4 * 4, 625]) The output layer receives625inputs, while the output is the number of classes: w_o = init_weights([625, num_classes]) Note that these initializations are not actually done at this point; they are merely being defined in the TensorFlow graph. p_keep_conv = tf.placeholder("float") p_keep_hidden = tf.placeholder("float") It's time to define the network model; as we did for the network's weights definition it will be a function. It receives as input, the X tensor, the weights tensors, and the dropout parameters for convolution and fully connected layer: def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden): The tf.nn.conv2d() function executes the TensorFlow operation for convolution, note that the strides are set to 1 in all dimensions. Indeed, the first and last stride must always be 1, because the first is for the image-number and the last is for the input-channel. The padding parameter is set to 'SAME' which means the input image is padded with zeroes so the size of the output is the same: conv1 = tf.nn.conv2d(X, w,strides=[1, 1, 1, 1], padding='SAME') Then we pass the conv1 layer to a relu layer. It calculates the max(x, 0) funtion for each input pixel x, adding some non-linearity to the formula and allows us to learn more complicated functions: conv1 = tf.nn.relu(conv1) The resulting layer is then pooled by the tf.nn.max_pool operator: conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1] ,strides=[1, 2, 2, 1], padding='SAME') It is a 2×2 max-pooling, which means that we are considering 2×2 windows and select the largest value in each window. Then we move 2 pixels to the next window. We try to reduce the overfitting, via the tf.nn.dropout() function, passing the conv1layer and the p_keep_convprobability value: conv1 = tf.nn.dropout(conv1, p_keep_conv) As you can note the next two convolutional layers, conv2, conv3, are defined in the same way as conv1:   conv2 = tf.nn.conv2d(conv1, w2, strides=[1, 1, 1, 1], padding='SAME') conv2 = tf.nn.relu(conv2) conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv2 = tf.nn.dropout(conv2, p_keep_conv) conv3=tf.nn.conv2d(conv2, w3, strides=[1, 1, 1, 1] ,padding='SAME') conv3_a = tf.nn.relu(conv3) Two fully-connected layers are added to the network. The input of the first FC_layer is the convolution layer from the previous convolution: FC_layer = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') FC_layer = tf.reshape(FC_layer, [-1,w4.get_shape().as_list()[0]]) A dropout function is again used to reduce the overfitting: FC_layer = tf.nn.dropout(FC_layer, p_keep_conv) The output layer receives the input as FC_layer and the w4 weight tensor. A relu and a dropout operator are respectively applied: output_layer = tf.nn.relu(tf.matmul(FC_layer, w4)) output_layer = tf.nn.dropout(output_layer, p_keep_hidden) The result is a vector of length 10 for determining which one of the 10 classes for the input image belongs to: result = tf.matmul(output_layer, w_o) return result The cross-entropy is the performance measure we used in this classifier. The cross-entropy is a continuous function that is always positive and is equal to zero, if the predicted output exactly matches the desired output. The goal of this optimization is therefore to minimize the cross-entropy so it gets, as close to zero as possible, by changing the variables of the network layers. TensorFlow has a built-in function for calculating the cross-entropy. Note that the function calculates the softmax internally so we must use the output of py_x directly: py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden) Y_ = tf.nn.softmax_cross_entropy_with_logits(py_x, Y) Now that we have defined the cross-entropy for each classified image, we have a measure of how well the model performs on each image individually. But using the cross-entropy to guide the optimization of the networks's variables we need a single scalar value, so we simply take the average of the cross-entropy for all the classified images: cost = tf.reduce_mean(Y_) To minimize the evaluated cost, we must define an optimizer. In this case, we adopt the implemented RMSPropOptimizer which is an advanced form of gradient descent. RMSPropOptimizer implements the RMSProp algorithm, that is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera class. You find George Hinton's course in https://www.coursera.org/learn/neural-networks RMSPropOptimizeras well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests setting the decay parameter to 0.9, while a good default value for the learning rate is 0.001. optimizer = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost) Basically, the common Stochastic Gradient Descent (SGD) algorithm has a problem in that learning rates must scale with 1/T to get convergence, where T is the iteration number. RMSProp tries to get around this by automatically adjusting the step size so that the step is on the same scale as the gradients as the average gradient gets smaller, the coefficient in the SGD update gets bigger to compensate. An interesting reference about this algorithm can be found here: http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf Finally, we define predict_op that is the index with the largest value across dimensions from the output of the mode: predict_op = tf.argmax(py_x, 1) Note that optimization is not performed at this point. Nothing is calculated at all; we'll just add the optimizer object to the TensorFlow graph for later execution. We now come to define the network's running session, there are 55,000 images in the training set, so it takes a long time to calculate the gradient of the model using all these images. Therefore we'll use a small batch of images in each iteration of the optimizer. If your computer crashes or becomes very slow because you run out of RAM, then you may try and lower this number, but you may then need to perform more optimization iterations. Now we can proceed to implement a TensorFlow session: with tf.Session() as sess: tf.initialize_all_variables().run() for i in range(100): We get a batch of training examples, the tensor training_batch now holds a subset of images and corresponding labels: training_batch = zip(range(0, len(trX), batch_size), range(batch_size, len(trX)+1, batch_size)) Put the batch into feed_dict with the proper names for placeholder variables in the graph. We run the optimizer using this batch of training data, TensorFlow assigns the variables in feed to the placeholder variables and then runs the optimizer: for start, end in training_batch: sess.run(optimizer, feed_dict={X: trX[start:end], Y: trY[start:end], p_keep_conv: 0.8, p_keep_hidden: 0.5}) At the same time we get a shuffled batch of test samples: test_indices = np.arange(len(teX)) np.random.shuffle(test_indices) test_indices = test_indices[0:test_size] For each iteration we display the accuracy evaluated on the batch set: print(i, np.mean(np.argmax(teY[test_indices], axis=1) == sess.run (predict_op, feed_dict={X: teX[test_indices], Y: teY[test_indices], p_keep_conv: 1.0, p_keep_hidden: 1.0}))) Training a network can take several hours depending on the used computational resources. The results on my machine is as follows: Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. Successfully extracted to train-images-idx3-ubyte.mnist 9912422 bytes. Loading ata/train-images-idx3-ubyte.mnist Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. Successfully extracted to train-labels-idx1-ubyte.mnist 28881 bytes. Loading ata/train-labels-idx1-ubyte.mnist Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. Successfully extracted to t10k-images-idx3-ubyte.mnist 1648877 bytes. Loading ata/t10k-images-idx3-ubyte.mnist Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Successfully extracted to t10k-labels-idx1-ubyte.mnist 4542 bytes. Loading ata/t10k-labels-idx1-ubyte.mnist (0, 0.95703125) (1, 0.98046875) (2, 0.9921875) (3, 0.99609375) (4, 0.99609375) (5, 0.98828125) (6, 0.99609375) (7, 0.99609375) (8, 0.98828125) (9, 0.98046875) (10, 0.99609375) . . . .. . (90, 1.0) (91, 0.9921875) (92, 0.9921875) (93, 0.99609375) (94, 1.0) (95, 0.98828125) (96, 0.98828125) (97, 0.99609375) (98, 1.0) (99, 0.99609375) After 10,000 iterations, the model has an accuracy of about 99%....no bad!! Summary In this article, we introduced Convolutional Neural Networks (CNNs). We have seen how the architecture of these networks, yield CNNs, particularly suitable for image classification problems, making faster the training phase and more accurate the test phase. We have therefore implemented an image classifier, testing it on MNIST data set, where have achieved a 99% accuracy. Finally, we built a CNN to classify emotions starting from a dataset of images; we tested the network on a single image and we evaluated the limits and the goodness of our model. Resources for Article: Further resources on this subject: Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 5351
article-image-bitcoin
Packt
10 Feb 2017
13 min read
Save for later

Bitcoin

Packt
10 Feb 2017
13 min read
In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]
Read more
  • 0
  • 0
  • 4179

Packt
07 Feb 2017
32 min read
Save for later

Context – Understanding your Data using R

Packt
07 Feb 2017
32 min read
In this article by James D Miller, the author of the book Big Data Visualization we will explore the idea of adding context to the data you are working with. Specifically, we’ll discuss the importance of establishing data context, as well as the practice of profiling your data for context discovery as well how big data effects this effort. The article is organized into the following main sections: Adding Context About R R and Big Data R Example 1 -- healthcare data R Example 2 -- healthcare data (For more resources related to this topic, see here.) When writing a book, authors leave context clues for their readers. A context clue is a “source of information” about written content that may be difficult or unique that helps readers understand. This information offers insight into the content being read or consumed (an example might be: “It was an idyllic day; sunny, warm and perfect…”). With data, context clues should be developed, through a process referred to as profiling (we’ll discuss profiling in more detail later in this article), so that the data consumer can better understand (the data) when visualized. (Additionally, having context and perspective on the data you are working with is a vital step in determining what kind of data visualization should be created). Context or profiling examples might be calculating the average age of “patients” or subjects within the data or “segmenting the data into time periods” (years or months, usually). Another motive for adding context to data might be to gain a new perspective on the data. An example of this might be recognizing and examining a comparison present in the data. For example, body fat percentages of urban high school seniors could be compared to those of rural high school seniors. Adding context to your data before creating visualizations can certainly make it (the data visualization) more relevant, but context still can’t serve as a substitute for value. Before you consider any factors such as time of day or geographic location, or average age, first and foremost, your data visualization needs to benefit those who are going to consume it so establishing appropriate context requirements will be critical. For data profiling (adding context), the rule is: Before Context, Think →Value Generally speaking, there are a several visualization contextual categories, which can be used to argument or increase the value and understanding of data for visualization. These include: Definitions and explanations, Comparisons, Contrasts Tendencies Dispersion Definitions andexplanations This is providing additional information or “attributes” about a data point. For example, if the data contains a field named “patient ID” and we come to know that records describe individual patients, we may choose to calculate and add each individual patients BMI or body mass index: Comparisons This is adding a comparable value to a particular data point. For example, you might compute and add a national ranking to each “total by state”: Contrasts This is almost like adding an “opposite” to a data point to see if it perhaps determines a different perspective. An example might be reviewing average body weights for patients who consume alcoholic beverages verses those who do not consume alcoholic beverages: Tendencies These are the “typical” mathematical calculations (or summaries) on the data as a whole or by other category within the data, such as Mean, Median, and Mode. For example, you might add a median heart rate for the age group each patient in the data is a member of: Dispersion Again, these are mathematical calculations (or summaries), such as Range, Variance, and Standard Deviation, but they describe the "average" of a data set (or group within the data). For example, you may want to add the “range” for a selected value, such as the minimum and maximum number of hospital stays found in the data for each patient age group: The “art” of profiling data to add context and identify new and interesting perspectives for visualization is still and ever evolving; no doubt there are additional contextual categories existing today that can be investigated as you continue your work with big data visualization projects. Adding Context So, how do we add context to data? …is it merely select Insert, then Data Context? No, it’s not that easy (but it’s not impossible either). Once you have identified (or “pulled together”) your big data source (or at least a significant amount of data), how do you go from mountains of raw big data to summarizations that can be used as input to create valuable data visualizations, helping you to further analyze that data and support your conclusions? The answer is through data profiling. Data profiling involves logically “getting to know” the data you think you may want to visualize – through query, experimentation & review. Following the profiling process, you can then use the information you have collected to add context (and/or apply new “perspectives”) to the data. Adding context to data requires the manipulation of that data to perhaps reformat, adding calculations, aggregations or additional columns or re-ordering and so on. Finally, you will be ready to visualize (or “picture”) your data. The complete profiling process is shown below; as in: Pull together (the data or enough of the data), Profile (the data through query, experimentation and review), add Perspective(s) (or context) and finally… Picture (visualize) the data About R R is a language and environment easy to learn, very flexible in nature and also very focused on statistical computing- making it great for manipulating, cleaning, summarizing, producing probability statistics, etc. (as well as actually creating visualizations with your data), so it’s a great choice for the exercises required for profiling, establishing context and identifying additional perspectives. In addition, here are a few more reasons to use R when profiling your big data: R is used by a large number of academic statisticians – so it’s a tool that is not “going away” R is pretty much platform independent – what you develop will run almost any where R has awesome help resources – just Goggle it; you’ll see! R and Big Data Although R is free (open sourced), super flexible, and feature rich, you must keep in mind that R preserves everything in your machine’s memory and this can become problematic when you are working with big data (even with the introduction of the low resource costs of today). Thankfully, though there are various options and strategies to “work with” this limitation, such as imploring a sort of “pseudo-sampling” technique, which we will expound on later in this article (as part of some of the examples provided). Additionally, R libraries have been developed and introduced that can leverage hard drive space (as sort of a virtual extension to your machines memory), again exposed in this article’s examples. Example 1 In this article’s first example we’ll use data collected from a theoretical hospital where upon admission, patient medical history information is collected though an online survey. Information is also added to a “patients file” as treatment is provided. The file includes many fields including basic descriptive data for the patient such as: sex, date of birth, height, weight, blood type, etc. Vital statistics such as: blood pressure, heart rate, etc. Medical history such as: number of hospital visits, surgeries, major illnesses or conditions, currently under a doctor’s care, etc. Demographical statistics such as: occupation, home state, educational background, etc. Some additional information is also collected in the file in an attempt to develop patient characters and habits such as the number of times the patient included beef, pork and fowl in their weekly diet or if they typically use a butter replacement product, and so on. Periodically, the data is “dumped” to text files, are comma-delimited and contain the following fields (in this order): Patientid, recorddate_month, recorddate_day, recorddate_year, sex, age, weight, height, no_hospital_visits, heartrate, state, relationship, Insured, Bloodtype, blood_pressure, Education, DOBMonth, DOBDay, DOBYear, current_smoker, current_drinker, currently_on_medications, known_allergies, currently_under_doctors_care, ever_operated_on, occupation, Heart_attack, Rheumatic_Fever Heart_murmur, Diseases_of_the_arteries, Varicose_veins, Arthritis, abnormal_bloodsugar, Phlebitis, Dizziness_fainting, Epilepsy_seizures, Stroke, Diphtheria, Scarlet_Fever, Infectious_mononucleosis, Nervous_emotional_problems, Anemia, hyroid_problems, Pneumonia, Bronchitis, Asthma, Abnormal_chest_Xray, lung_disease, Injuries_back_arms_legs_joints_Broken_bones, Jaundice_gallbladder_problems, Father_alive, Father_current_age, Fathers_general_health, Fathers_reason_poor_health, Fathersdeceased_age_death, mother_alive, Mother_current_age, Mother_general_health, Mothers_reason_poor_health, Mothers_deceased_age_death, No_of_brothers, No_of_sisters, age_range, siblings_health_problems, Heart_attacks_under_50, Strokes_under_50, High_blood_pressure, Elevated_cholesterol, Diabetes, Asthma_hayfever, Congenital_heart_disease, Heart_operations, Glaucoma, ever_smoked_cigs, cigars_or_pipes, no_cigs_day, no_cigars_day, no_pipefuls_day, if_stopped_smoking_when_was_it, if_still_smoke_how_long_ago_start,target_weight, most_ever_weighed, 1_year_ago_weight, age_21_weight, No_of_meals_eatten_per_day, No_of_times_per_week_eat_beef, No_of_times_per_week_eat_pork, No_of_times_per_week_eat_fish, No_of_times_per_week_eat_fowl, No_of_times_per_week_eat_desserts, No_of_times_per_week_eat_fried_foods, No_servings_per_week_wholemilk, No_servings_per_week_2%_milk, No_servings_per_week_tea, No_servings_per_week_buttermilk, No_servings_per_week_1%_milk, No_servings_per_week_regular_or_diet_soda, No_servings_per_week_skim_milk, No_servings_per_week_coffee No_servings_per_week_water, beer_intake, wine_intake, liquor_intake, use_butter, use_extra_sugar, use_extra_salt, different_diet_weekends, activity_level, sexually_active, vision_problems, wear_glasses Following is the image showing a portion of the file (displayed in MS Windows notepad): Assuming we have been given no further information about the data, other than the provided field name list and the knowledge that the data is captured by hospital personnel upon patient admission, the next step would be to perform some sort of profiling of the data- investigating to start understanding the data and then to start adding context and perspectives (so ultimately we can create some visualizations). Initially, we start out by looking through the field or column names in our file and some ideas start to come to mind. For example: What is the data time-frame we are dealing with? Using the field record date, can we establish a period of time (or time frame) for the data? (In other words, over what period of time was this data captured). Can we start “grouping the data” using fields such as sex, age and state? Eventually, what we should be asking is, “what can we learn from visualizing the data?” Perhaps: What is the breakdown of those currently smoking by age group? What is the ratio of those currently smoking to the number of hospital visits? Do those patients currently under a doctor’s care, on average have better BMI ratios? And so on. Dig-in with R Using the power of R programming, we can run various queries on the data; noting that the results of those quires may spawn additional questions and queries and eventually, yield data ready for visualizing. Let’s start with a few simple profile queries. I always start my data profiling by “time boxing” the data. The following R scripts (although as mentioned earlier, there are many ways to accomplish the same objective) work well for this: # --- read our file into a temporary R table tmpRTable4TimeBox<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt”, sep=",") # --- convert to an R data frame and filter it to just include # --- the 2nd column or field of data data.df <- data.frame(tmpRTable4TimeBox) data.df <- data.df[,2] # --- provides a sorted list of the years in the file YearsInData = substr(substr(data.df[],(regexpr('/',data.df[])+1),11),( regexpr('/',substr(data.df[],(regexpr('/',data.df[])+1),11))+1),11) # -- write a new file named ListofYears write.csv(sort(unique(YearsInData)),file="C:/Big Data Visualization /Chapter 3/ListofYears.txt",quote = FALSE, row.names = FALSE) The above simple R script provides a sorted list file (ListofYears.txt) (shown below) containing the years found in the data we are profiling: Now we can see that our patient survey data covers patient survey data collected during the years 1999 through 2016 and with this information we start to add context (or allow us to gain a perspective) on our data. We could further time-box the data by perhaps breaking the years into months (we will do this later on in this article) but let’s move on now to some basic “grouping profiling”. Assuming that each record in our data represents a unique hospital visit, how can we determine the number of hospital visits (the number of records) by sex, age and state? Here I will point out that it may be worthwhile establishing the size (number of rows or records (we already know the number of columns or fields) of the file you are working with. This is important since the size of the data file will dictate the programming or scripting approach you will need to use during your profiling. Simple R functions valuable to know are: nrow and head. These simple command can be used to count the total rows in a file: nrow:mydata Of to view the first n umber of rows of data: head(mydata, nrow=10) So, using R, one could write a script to load the data into a table, convert it to a data frame and then read through all the records in the file and “count up” or “tally” the number of hospital visits (the number of records) for males and females. Such logic is a snap to write: # --- assuming tmpRTable holds the data already datas.df<-data.frame(tmpRTable) # --- initialize 2 counter variables NumberMaleVisits <-0;NumberFemaleVisits <-0 # --- read through the data for(i in 1:nrow(datas.df)) { if (datas.df[i,3] == 'Male') {NumberMaleVisits <- NumberMaleVisits + 1} if (datas.df[i,3] == 'Female') {NumberFemaleVisits <- NumberFemaleVisits + 1} } # --- show me the totals NumberMaleVisits NumberFemaleVisits The previous script works, but in a big data scenario, there is a more efficient way, since reading or “looping through” and counting each record will take far too long. Thankfully, R provides the table function that can be used similar to the SQL “group by” command. The following script assumes that our data is already in an R data frame (named datas.df), so using the sequence number of the field in the file, if we want to see the number of hospital visits for Males and the number of hospital visits for Females we can write: # --- using R table function as "group by" field number # --- patient sex is the 3rd field in the file table(datas.df[,3]) Following is the output generated from running the above stated script. Notice that R shows “sex” with a count of 1 since the script included the files “header record” of the file as a unique value: We can also establish the number of hospital visits by state (state is the 9th field in the file): table(datas.df[,9]) Age (or the fourth field in the file) can also be studied using the R functions sort and table: Sort(table(datas.df[,4])) Note that since there are quite a few more values for age within the file, I’ve sorted the output using the R sort function. Moving on now, let’s see if there is a difference between the number of hospital visits for patients who are current smokers (field name current_smoker and is field number 16 in the file) and those indicating that they are non-current smokers. We can use the same R scripting logic: sort(table(datas.df[16])) Surprisingly (one might think) it appears from our profiling that those patients who currently do not smoke have had more hospital visits (113,681) than those who currently are smokers (12,561): Another interesting R script to continue profiling our data might be: table(datas.df[,3],datas.df[,16]) The above shown script again uses the R table function to group data, but shows how we can “group within a group”, in other words, using this script we can get totals for “current” and “non-current” smokers, grouped by sex. In the below image we see that the difference between female smokers and male smokers might be considered to be marginal: So we see that by using the above simple R script examples, we’ve been able to add some context to our healthcare survey data. By reviewing the list of fields provided in the file we can come up with the R profiling queries shown (and many others) without much effort. We will continue with some more complex profiling in the next section, but for now, let’s use R to create a few data visualizations - based upon what we’ve learned so far through our profiling. Going back to the number of hospital visits by sex, we can use the R function barplot to create a visualization of visits by sex. But first, a couple of “helpful hints” for creating the script. First, rather than using the table function, you can use the ftable function which creates a “flat” version of the original function’s output. This makes it easier to exclude the header record count of 1 that comes back from the table function. Next, we can leverage some additional arguments of the barplot function like col, border, names.arg and Title to make the visualization a little “nicer to look at”. Below is the script: # -- use ftable function to drop out the header record forChart<- ftable(datas.df[,3]) # --- create bar names barnames<-c("Female","Male") # -- use barplot to draw bar visual barplot(forChart[2:3], col = "brown1", border = TRUE, names.arg = barnames) # --- add a title title(main = list("Hospital Visits by Sex", font = 4)) The scripts output (our visualization) is shown below: We could follow the same logic for creating a similar visualization of hospital visits by state: st<-ftable(datas.df[,9]) barplot(st) title(main = list("Hospital Visits by State", font = 2)) But the visualization generated isn’t very clear: One can always experiment a bit more with this data to make the visualization a little more interesting. Using the R functions substr and regexpr, we can create an R data frame that contains a record for each hospital visit by state within each year in the file. Then we can use the function plot (rather than barplot) to generate the visualization. Below is the R script: # --- create a data frame from our original table file datas.df <- data.frame(tmpRTable) # --- create a filtered data frame of records from the file # --- using the record year and state fields from the file dats.df<-data.frame(substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11),datas.df[,9]) # --- plot to show a visualization plot(sort(table(dats.df[2]),decreasing = TRUE),type="o", col="blue") title(main = list("Hospital Visits by State (Highest to Lowest)", font = 2)) Here is the different (perhaps more interesting) version of the visualization generated by the previous script: Another earlier perspective on the data was concerning Age. We grouped the hospital visits by the age of the patients (using the R table function). Since there are many different patient ages, a common practice is to establish age ranges, such as the following: 21 and under 22 to 34 35 to 44 45 to 54 55 to 64 65 and over To implement the previous age ranges, we need to organize the data and could use the following R script: # --- initialize age range counters a1 <-0;a2 <-0;a3 <-0;a4 <-0;a5 <-0;a6 <-0 # --- read and count visits by age range for(i in 2:nrow(datas.df)) { if (as.numeric(datas.df[i,4]) < 22) {a1 <- a1 + 1} if (as.numeric(datas.df[i,4]) > 21 & as.numeric(datas.df[i,4]) < 35) {a2 <- a2 + 1} if (as.numeric(datas.df[i,4]) > 34 & as.numeric(datas.df[i,4]) < 45) {a3 <- a3 + 1} if (as.numeric(datas.df[i,4]) > 44 & as.numeric(datas.df[i,4]) < 55) {a4 <- a4 + 1} if (as.numeric(datas.df[i,4]) > 54 & as.numeric(datas.df[i,4]) < 65) {a5 <- a5 + 1} if (as.numeric(datas.df[i,4]) > 64) {a6 <- a6 + 1} } Big Data Note: Looping or reading through each of the records in our file isn’t very practical if there are a trillion records. Later in this article we’ll use a much better approach, but for now will assume a smaller file size for convenience. Once the above script is run, we can use the R pie function and the following code to create our pie chart visualization: # --- create Pie Chart slices <- c(a1, a2, a3, a4, a5, a6) lbls <- c("under 21", "22-34","35-44","45-54","55-64", "65 & over") pie(slices, labels = lbls, main="Hospital Visits by Age Range") Following is the generated visualization: Finally, earlier in this section we looked at the values in field 16 of our file - which indicates whether the survey patient was a current smoker. We could build a simple visual showing the totals, but (again) the visualization isn’t very interesting or all that informative. With some simple R scripts, we can proceed to create a visualization showing the number of hospital visits, year-over-year by those patients that are current smokers. First, we can “reformat” the data in our R data frame (named datas.df) to store only the year (of the record date) using the R function substr. This makes it a little easier to aggregate the data by year shown in the next steps. The R script using the substr function is shown below: # --- redefine the record date field to hold just the record # --- year value datas.df[,2]<-substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11) Next, we can create an R table named c to hold the record date year and totals (of non and current smokers) for each year. Following is the R script: used: # --- create a table holding record year and total count for # --- smokers and not smoking c<-table(datas.df[,2],datas.df[,16]) Finally, we can use the R barplot function to create our visualization. Again, there is more than likely a cleverer way to setup the objects bars and lbls, but for now, I simply hand-coded the year’s data I wanted to see in my visualization: # --- set up the values to chart and the labels for each bar # --- in the chart bars<-c(c[2,3], c[3,3], c[4,3],c[5,3],c[6,3],c[7,3],c[8,3],c[9,3],c[10,3],c[11,3],c[12,3],c[13,3]) lbls<-c("99","00","01","02","03","04","05","06","07","08","09","10") Now the R script to actually produce the bar chart visualization is shown below: # --- create the bar chart barplot(bars, names.arg=lbls, col="red") title(main = list("Smoking Patients Year to Year", font = 2)) Below is the generated visualization: Example 2 In the above examples, we’ve presented some pretty basic and straight forward data profiling exercises. Typically, once you’ve become somewhat familiar with your data – having added some context (though some basic profiling), one would extend the profiling process, trying to look at the data in additional ways using technics such as those mentioned in the beginning of this article: Defining new data points based upon the existing data, performing comparisons, looking at contrasts (between data points), identifying tendencies and using dispersions to establish the variability of the data. Let’s now review some of these options for extended profiling using simple examples as well as the same source data as was used in the previous section examples. Definitions & Explanations One method of extending your data profiling is to “add to” the existing data by creating additional definition or explanatory “attributes” (in other words add new fields to the file). This means that you use existing data points found in the data to create (hopefully new and interesting) perspectives on the data. In the data used in this article, a thought-provoking example might be to use the existing patient information (such as the patients weight and height) to calculate a new point of data: body mass index (BMI) information. A generally accepted formula for calculating a patient’s body mass index is: BMI = (Weight (lbs.) / (Height (in))2) x 703 For example: (165 lbs.) / (702) x 703 = 23.67 BMI. Using the above formula, we can use the following R script (assuming we’ve already loaded the R object named tmpRTable with our file data) to generate a new file of BMI percentages and state names: j=1 for(i in 2:nrow(tmpRTable)) { W<-as.numeric(as.character(tmpRTable[i,5])) H<-as.numeric(as.character(tmpRTable[i,6])) P<-(W/(H^2)*703) datas2.df[j,1]<-format(P,digits=3) datas2.df[j,2]<-tmpRTable[i,9] j=j+1 } write.csv(datas2.df[1:j-1,1:2],file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE, row.names = FALSE) Below is a portion of the generated file: Now we have a new file of BMI percentages by state (one BMI record for each hospital visit in each state). Earlier in this article we touched on the concept of looping or reading through all of the records in a file or data source and creating counts based on various field or column values. Such logic works fine for medium or smaller files but a much better approach (especially with big data files) would be to use the power of various R commands. No Looping Although the above described R script does work, it requires looping through each record in our file which is slow and inefficient to say the least. So, let’s consider a better approach. Again, assuming we’ve already loaded the R object named tmpRTable with our data, the below R script can accomplish the same results (create the same file) in just 2 lines: PDQ<-paste(format((as.numeric(as.character(tmpRTable[,5]))/(as.numeric(as.character(tmpRTable[,6]))^2)*703),digits=2),',',tmpRTable[,9],sep="") write.csv(PDQ,file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE,row.names = FALSE) We could now use this file (or one similar) as input to additional profiling exercise or to create a visualization, but let’s move on. Comparisons Performing comparisons during data profiling can also add new and different perspectives to the data. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average). The following 2 lines of code use the R functions table and write.csv to create a list (a file) of the total number of hospital visits found for each state: # --- calculates the number of hospital visits for each # --- state (state ID is in field 9 of the file StateVisitCount<-table(datas.df[9]) # --- write out a csv file of counts by state write.csv (StateVisitCount, file="C:/Big Data Visualization/Chapter 3/visitsByStateName.txt", quote = FALSE, row.names = FALSE) Below is a portion of the file that is generated: The following R command can be used to calculate the average number of hospitals by using the nrow function to obtain a count of records in the data source and then divide it by the number of states: # --- calculate the average averageVisits<-nrow(datas.df)/50 Going a bit further with this line of thinking, you might consider that the nine states the U.S. Census Bureau designates as the Northeast region are Connecticut, Maine, Massachusetts, New Hampshire, New York, New Jersey, Pennsylvania, Rhode Island and Vermont. What is the total number of hospital visits recorded in our file for the northeast region? R makes it simple with the subset function: # --- use subset function and the “OR” operator to only have # --- northeast region states in our list NERVisits<-subset(tmpRTable, as.character(V9)=="Connecticut" | as.character(V9)=="Maine" | as.character(V9)=="Massachusetts" | as.character(V9)=="New Hampshire" | as.character(V9)=="New York" | as.character(V9)=="New Jersey" | as.character(V9)=="Pennsylvania" | as.character(V9)=="Rhode Island" | as.character(V9)=="Vermont") Extending our scripting we can add some additional queries to calculate the average number of hospital visits for the northeast region and the total country: AvgNERVisits<-nrow(NERVisits)/9 averageVisits<-nrow(tmpRTable)/50 And let’s add a visualization: # -- the c objet is the the data for the barplot function to # --- graph c<-c(AvgNERVisits, averageVisits) # --- use R barplot barplot(c, ylim=c(0,3000), ylab="Average Visits", border="Black", names.arg = c("Northeast","all")) title("Northeast Region vs Country") The generated visualzation is shown below: Contrasts The examination of contrasting data is another form of extending data profiling. For example, using this article’s data, one could contrast the average body weight of patients that are under doctor’s care against the average body weight of patients that are not under a doctor’s care (after calculating average body weights for each group). To accomplish this, we can calculate the average weights for patients that fall into each category (those currently under a doctor’s care and those not currently under a doctor’s care) as well as for all patients, using the following R script: # --- read in our entire file tmpRTable<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt",sep=",") # --- use the subset functionto create the 2 groups we are # --- interested in UCare.sub<-subset(tmpRTable, V20=="Yes") NUCare.sub<-subset(tmpRTable, V20=="No") # --- use the mean function to get the average body weight of all pateints in the file as well as for each of our separate groups average_undercare<-mean(as.numeric(as.character(UCare.sub[,5]))) average_notundercare<-mean(as.numeric(as.character(NUCare.sub[,5]))) averageoverall<-mean(as.numeric(as.character(tmpRTable[2:nrow(tmpRTable),5]))) average_undercare;average_notundercare;averageoverall In “short order”, we can use R’s ability to create subsets (using the subset function) of the data based upon values in a certain field (or column), then use the mean function to calculate the average patient weight for the group. The results from running the script (the calculated average weights) are shown below: And if we use the calculated results to create a simple visualization: # --- use R barplot to create the bar graph of # --- average patient weight barplot(c, ylim=c(0,200), ylab="Patient Weight", border="Black", names.arg = c("under care","not under care", "all"), legend.text= c(format(c[1],digits=5),format(c[2],digits=5),format(c[3],digits=5)))> title("Average Patient Weight") Tendencies Identifying tendencies present within your data is also an interesting way of extending data profiling. For example, using this article’s sample data, you might determine what the number of servings of water that was consumed per week by each patient age group. Earlier in this section we created a simple R script to count visits by age groups; it worked, but in a big data scenario, this may not work. A better approach would be to categorize the data into the age groups (age is the fourth field or column in the file) using the following script: # --- build subsets of each age group agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) After we have our grouped data, we can calculate water consumption. For example, to count the total weekly servings of water (which is in field or column 96) for age group 1 we can use: # --- field 96 in the file is the number of servings of water # --- below line counts the total number of water servings for # --- age group 1 sum(as.numeric(agegroup1[,96])) Or the average number of servings of water for the same age group: mean(as.numeric(agegroup1[,96])) Take note that R requires the explicit conversion of the value of field 96 (even though it comes in the file as a number) to a number using the R function as.numeric. Now, let’s see create the visualization of this perspective of our data. Below is the R script used to generate the visualization: # --- group the data into age groups agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) # --- calculate the averages by group g1<-mean(as.numeric(agegroup1[,96])) g2<-mean(as.numeric(agegroup2[,96])) g3<-mean(as.numeric(agegroup3[,96])) g4<-mean(as.numeric(agegroup4[,96])) g5<-mean(as.numeric(agegroup5[,96])) g6<-mean(as.numeric(agegroup6[,96])) # --- create the visualization barplot(c(g1,g2,g3,g4,g5,g6), + axisnames=TRUE, names.arg = c("<21", "22-34", "35-44", "45-54", "55-64", ">65")) > title("Glasses of Water by Age Group") The generated visualization is shown below: Dispersion Finally, dispersion is still another method of extended data profiling. Dispersion measures how various elements selected behave with regards to some sort of central tendency, usually the mean. For example, we might look at the total number of hospital visits for each age group, per calendar month in regards to the average number of hospital visits per month. For this example, we can use the R function subset in the R scripts (to define our age groups and then group the hospital records by those age groups) like we did in our last example. Below is the script, showing the calculation for each group: agegroup1<-subset(tmpRTable, as.numeric(V4) <22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) Remember, the previous scripts create subsets of the entire file (which we loaded into the object tmpRTable) and they contain all of the fields of the entire file. The agegroup1 group is partially displayed as follows: Once we have our data categorized by age group (agegroup1 through agegroup6), we can then go on and calculate a count of hospital stays by month for each group (shown in the following R commands). Note that the substr function is used to look at the month code (the first 3 characters of the record date) in the file since we (for now) don’t care about the year. The table function then can be used to create an array of counts by month. az1<-table(substr(agegroup1[,2],1,3)) az2<-table(substr(agegroup2[,2],1,3)) az3<-table(substr(agegroup3[,2],1,3)) az4<-table(substr(agegroup4[,2],1,3)) az5<-table(substr(agegroup5[,2],1,3)) az6<-table(substr(agegroup6[,2],1,3)) Using the above month totals, we can then calculate an average number of hospital visits for each month using the R function mean. This will be the mean function of the total for the month for ALL age groups: JanAvg<-mean(az1["Jan"], az2["Jan"], az3["Jan"], az4["Jan"], az5["Jan"], az6["Jan"]) Note that the above code example can be used to calculate an average for each month Next we can calculate the totals for each month, for each age group: Janag1<-az1["Jan"];Febag1<-az1["Feb"];Marag1<-az1["Mar"];Aprag1<-az1["Apr"];Mayag1<-az1["May"];Junag1<-az1["Jun"] Julag1<-az1["Jul"];Augag1<-az1["Aug"];Sepag1<-az1["Sep"];Octag1<-az1["Oct"];Novag1<-az1["Nov"];Decag1<-az1["Dec"] The following code “stacks” the totals so we can more easily visualize it later (we would have one line for each age group (that is, Group1Visits, Group2Visits and so on). Monthly_Visits<-c(JanAvg, FebAvg, MarAvg, AprAvg, MayAvg, JunAvg, JulAvg, AugAvg, SepAvg, OctAvg, NovAvg, DecAvg) Group1Visits<-c(Janag1,Febag1,Marag1,Aprag1,Mayag1,Junag1,Julag1,Augag1,Sepag1,Octag1,Novag1,Decag1) Group2Visits<-c(Janag2,Febag2,Marag2,Aprag2,Mayag2,Junag2,Julag2,Augag2,Sepag2,Octag2,Novag2,Decag2) Finally, we can now create the visualization: plot(Monthly_Visits, ylim=c(1000,4000)) lines(Group1Visits, type="b", col="red") lines(Group2Visits, type="b", col="purple") lines(Group3Visits, type="b", col="green") lines(Group4Visits, type="b", col="yellow") lines(Group5Visits, type="b", col="pink") lines(Group6Visits, type="b", col="blue") title("Hosptial Visits", sub = "Month to Month", cex.main = 2, font.main= 4, col.main= "blue", cex.sub = 0.75, font.sub = 3, col.sub = "red") and enjoy the generated output: Summary In this article we went over the idea and importance of establishing context and perhaps identifying perspectives to big data, using the data profiling with R. Additionally, we introduced and explored the R Programming language as an effective means to profile big data and used R in numerous illustrative examples. Once again, R is an extremely flexible and powerful tool that works well for data profiling and the reader would be well served researching and experimenting with the languages vast libraries available today as we have only scratched the surface of the features currently available. Resources for Article: Further resources on this subject: Introduction to R Programming Language and Statistical Environment [article] Fast Data Manipulation with R [article] DevOps Tools and Technologies [article]
Read more
  • 0
  • 2
  • 1488

article-image-hierarchical-clustering
Packt
07 Feb 2017
6 min read
Save for later

Hierarchical Clustering

Packt
07 Feb 2017
6 min read
In this article by Atul Tripathi, author of the book Machine Learning Cookbook, we will cover hierarchical clustering with a World Bank sample dataset. (For more resources related to this topic, see here.) Introduction Hierarchical clustering is one of the most important methods in unsupervised learning is hierarchical clustering. In hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned to a separate cluster. Evaluation of all the clusters shall take place based on a pairwise distance matrix. The distance matrix shall be constructed using distance values. The pair of clusters with the shortest distance must be considered. The pair identified should then be removed from the matrix and merged together. The merged clusters must be distance must be evaluated with the other clusters and the distance matrix must be updated. The process must be repeated until the distance matrix is reduced to a single element. Hierarchical clustering - World Bank sample dataset One of the main goals for establishing the World Bank has been to fight and eliminate poverty. Continuous evolution and fine tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of poor. The development gains which will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner. Getting ready In order to perform hierarchical clustering, we shall be using a dataset collected from the World Bank dataset.  Step 1 - Collecting and describing data The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data. There are 14 variables. The numeric variables are: new.forest Rural log.CO2 log.GNI log.Energy.2011 LifeExp Fertility InfMort log.Exports log.Imports CellPhone RuralWater Pop The non-numeric variables are: Country How to do it Step 2 - exploring data Version info: Code for this page was tested in R version 3.2.3 (2015-12-10) Let's explore the data and understand the relationships among the variables. We'll begin by importing the CSV file named WBClust2013.csv. We will be saving the data to the wbclust data frame: > wbclust=read.csv("d:/WBClust2013.csv",header=T) Next, we shall print the wbclust data frame. The head() function returns the wbclust data frame. The wbclust data frame is passed as an input parameter: > head(wbclust) The results are as follows: Step 3 - transforming data Centering variables and creating z-scores are two common data analysis activities to standardize data. The numeric variables mentioned above need to create z-scores. The scale()function is a generic function whose default method centers and/or scales the columns of a numeric matrix. The data frame, wbclust is passed to the scale function. All the numeric fields are only considered. The result is then stored in another data frame, wbnorm. > wbnorm<-scale(wbclust[,2:13]) > wbnorm The results are as follows: All data frames have a row names attribute. In order to retrieve or set the row or column names of a matrix-like object, the rownames()function is used. The data frame, wbclust with the first column is passed to the rownames()function. > rownames(wbnorm)=wbclust[,1] > rownames(wbnorm) The call to the function rownames(wbnorm)results in display of the values from the first column. The results are as follows: Step 4 - training and evaluating the model performance The next step is about training the model. The first step is to calculate the distance matrix. The dist()function is used. Using the specified distance measure, distances between the rows of a data matrix are computed. The distance measure used can be Euclidean, maximum, Manhattan, Canberra, binary, or Minkowski. The distance measure used is Euclidean. The Euclidean distance calculates the distance between two vectors as sqrt(sum((x_i - y_i)^2)).The result is then stored in a new data frame, dist1. > dist1<-dist(wbnorm, method="euclidean") The next step is to perform clustering using Ward's method. The hclust() function is used. In order to perform cluster analysis on a set of dissimilarities of the n objects, the hclust()function is used. At the first stage, each of the objects is assigned to its own cluster. After which, at each stage the algorithm iterates and joins two of the most similar clusters. This process will continue till there is just a single cluster left. The hclust() function requires that we provide the data in the form of a distance matrix. The dist1 data frame is passed. By default, the complete linkage method is used. There are multiple agglomeration methods which can be used. Some of the agglomeration methods could be ward.D, ward.D2, single, complete, average. > clust1<-hclust(dist1,method="ward.D") > clust1 The call to the function, clust1results in display of the agglomeration methods used, the manner in which the distance is calculated, and the number of objects. The results are as follows: Step 5 - plotting the model The plot()function is a generic function for plotting of R objects. Here, the plot() function is used to draw the dendrogram: > plot(clust1,labels= wbclust$Country, cex=0.7, xlab="",ylab="Distance",main="Clustering for 80 Most Populous Countries") The result is as follows: The rect.hclust() function highlights the clusters and draws the rectangles around the branches of the dendrogram. The dendrogram is first cut at a certain level followed by drawing a rectangle around the selected branches. The object, clust1 is passed as an object to the function along with the number of clusters to be formed: > rect.hclust(clust1,k=5) The result is as follows: The cuts()function shall cut the tree into multiple groups on the basis of the desired number of groups or the cut height. Here, clust1 is passed as an object to the function along with the number of the desired group: > cuts=cutree(clust1,k=5) > cuts The result is as follows: Getting the list of countries in each group. The result is as follows: Summary In this article we covered hierarchical clustering by collecting, exploring its contents, transforming the data. We trained and evaluated it by using distance matrix and finally plotted the data as a dendrogram. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Specialized Machine Learning Topics [article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 1314
article-image-classification-using-convolutional-neural-networks
Mohammad Pezeshki
07 Feb 2017
5 min read
Save for later

Classification using Convolutional Neural Networks

Mohammad Pezeshki
07 Feb 2017
5 min read
In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.
Read more
  • 0
  • 0
  • 3780

article-image-building-search-geo-locator-elasticsearch-and-spark
Packt
31 Jan 2017
12 min read
Save for later

Building A Search Geo Locator with Elasticsearch and Spark

Packt
31 Jan 2017
12 min read
In this article, Alberto Paro, the author of the book Elasticsearch 5.x Cookbook - Third Edition discusses how to use and manage Elasticsearch covering topics as installation/setup, mapping management, indices management, queries, aggregations/analytics, scripting, building custom plugins, and integration with Python, Java, Scala and some big data tools such as Apache Spark and Apache Pig. (For more resources related to this topic, see here.) Background Elasticsearch is a common answer for every needs of search on data and with its aggregation framework, it can provides analytics in real-time. Elasticsearch was one of the first software that was able to bring the search in BigData world. It’s cloud native design, JSON as standard format for both data and search, and its HTTP based approach are only the solid bases of this product. Elasticsearch solves a growing list of search, log analysis, and analytics challenges across virtually every industry. It’s used by big companies such as Linkedin, Wikipedia, Cisco, Ebay, Facebook, and many others (source https://www.elastic.co/use-cases). In this article, we will show how to easily build a simple search geolocator with Elasticsearch using Apache Spark for ingestion. Objective In this article, they will develop a search geolocator application using the world geonames database. To make this happen the following steps will be covered: Data collection Optimized Index creation Ingestion via Apache Spark Searching for a location name Searching for a city given a location position Executing some analytics on the dataset. All the article code is available on GitHub at https://github.com/aparo/elasticsearch-geonames-locator. All the below commands need to be executed in the code directory on Linux/MacOS X. The requirements are a local Elasticsearch Server instance, a working local Spark installation and SBT installed (http://www.scala-sbt.org/) . Data collection To populate our application we need a database of geo locations. One of the most famous and used dataset is the GeoNames geographical database, that is available for download free of charge under a creative commons attribution license. It contains over 10 million geographical names and consists of over 9 million unique features whereof 2.8 million populated places and 5.5 million alternate names. It can be easily downloaded from http://download.geonames.org/export/dump. The dump directory provided CSV divided in counties and but in our case we’ll take the dump with all the countries allCountries.zip file To download the code we can use wget via: wget http://download.geonames.org/export/dump/allCountries.zip Then we need to unzip it and put in downloads folder: unzip allCountries.zip mv allCountries.txt downloads The Geoname dump has the following fields: No. Attribute name Explanation 1 geonameid Unique ID for this geoname 2 name The name of the geoname 3 asciiname ASCII representation of the name 4 alternatenames Other forms of this name. Generally in several languages 5 latitude Latitude in decimal degrees of the Geoname 6 longitude Longitude in decimal degrees of the Geoname 7 fclass Feature class see http://www.geonames.org/export/codes.html 8 fcode Feature code see http://www.geonames.org/export/codes.html 9 country ISO-3166 2-letter country code 10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code 11 admin1 Fipscode (subject to change to iso code 12 admin2 Code for the second administrative division, a county in the US 13 admin3 Code for third level administrative division 14 admin4 Code for fourth level administrative division 15 population The population of Geoname 16 elevation The elevation in meters of Geoname 17 gtopo30 Digital elevation model 18 timezone The timezone of Geoname 19 moddate The date of last change of this Geoname Table 1: Dataset characteristics Optimized Index creation Elasticsearch provides automatic schema inference for your data, but the inferred schema is not the best possible. Often you need to tune it for: Removing not-required fields Managing Geo fields. Optimizing string fields that are index twice in their tokenized and keyword version. Given the Geoname dataset, we will add a new field location that is a GeoPoint that we will use in geo searches. Another important optimization for indexing, it’s define the correct number of shards. In this case we have only 11M records, so using only 2 shards is enough. The settings for creating our optimized index with mapping and shards is the following one: { "mappings": { "geoname": { "properties": { "admin1": { "type": "keyword", "ignore_above": 256 }, "admin2": { "type": "keyword", "ignore_above": 256 }, "admin3": { "type": "keyword", "ignore_above": 256 }, "admin4": { "type": "keyword", "ignore_above": 256 }, "alternatenames": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "asciiname": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "cc2": { "type": "keyword", "ignore_above": 256 }, "country": { "type": "keyword", "ignore_above": 256 }, "elevation": { "type": "long" }, "fclass": { "type": "keyword", "ignore_above": 256 }, "fcode": { "type": "keyword", "ignore_above": 256 }, "geonameid": { "type": "long" }, "gtopo30": { "type": "long" }, "latitude": { "type": "float" }, "location": { "type": "geo_point" }, "longitude": { "type": "float" }, "moddate": { "type": "date" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "population": { "type": "long" }, "timezone": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }, "settings": { "index": { "number_of_shards": "2", "number_of_replicas": "1" } } } We can store the above JSON in a file called settings.json and we can create an index via the curl command: curl -XPUT http://localhost:9200/geonames -d @settings.json Now our index is created and ready to receive our documents. Ingestion via Apache Spark Apache Spark is very hardy for processing CSV and manipulate the data before saving it in a storage both disk or NoSQL. Elasticsearch provides easy integration with Apache Spark allowing write Spark RDD with a single command in Elasticsearch. We will build a spark job called GeonameIngester that will execute the following steps: Initialize the Spark Job Parse the CSV Defining our required structures and conversions Populating our classes Writing the RDD in Elasticsearch Executing the Spark Job Initialize the Spark Job We need to import required classes: import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.elasticsearch.spark.rdd.EsSpark import scala.util.Try We define the GeonameIngester object and the SparkSession: object GeonameIngester { def main(args: Array[String]) { val sparkSession = SparkSession.builder .master("local") .appName("GeonameIngester") .getOrCreate() To easy serialize complex datatypes, we switch to use the Kryo encoder: import scala.reflect.ClassTag implicit def kryoEncoder[A](implicit ct: ClassTag[A]) = org.apache.spark.sql.Encoders.kryo[A](ct) import sparkSession.implicits._ Parse the CSV For parsing the CSV, we need to define the Geoname schema to be used to read: val geonameSchema = StructType(Array( StructField("geonameid", IntegerType, false), StructField("name", StringType, false), StructField("asciiname", StringType, true), StructField("alternatenames", StringType, true), StructField("latitude", FloatType, true), StructField("longitude", FloatType, true), StructField("fclass", StringType, true), StructField("fcode", StringType, true), StructField("country", StringType, true), StructField("cc2", StringType, true), StructField("admin1", StringType, true), StructField("admin2", StringType, true), StructField("admin3", StringType, true), StructField("admin4", StringType, true), StructField("population", DoubleType, true), // Asia population overflows Integer StructField("elevation", IntegerType, true), StructField("gtopo30", IntegerType, true), StructField("timezone", StringType, true), StructField("moddate", DateType, true))) Now we can read all the geonames from CSV via: val GEONAME_PATH = "downloads/allCountries.txt" val geonames = sparkSession.sqlContext.read .option("header", false) .option("quote", "") .option("delimiter", "t") .option("maxColumns", 22) .schema(geonameSchema) .csv(GEONAME_PATH) .cache() Defining our required structures and conversions The plain CSV data is not suitable for our advanced requirements, so we define new classes to store our Geoname data. We define a GeoPoint object to store the Geo Point location of our geoname. case class GeoPoint(lat: Double, lon: Double) We define also our Geoname class with optional and list types: case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String) To reduce the boilerplate of the conversion we define an implicit method that convert a String in an Option[String] if it is empty or null. implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean) } } During processing, in case of the population value is null we need a function to fix this value and set it to 0: to do this we define a function to fixNullInt: def fixNullInt(value: Any): Int = { if (value == null) 0 else { Try(value.asInstanceOf[Int]).toOption.getOrElse(0) } } Populating our classes We can populate the records that we need to store in Elasticsearch via a map on geonames DataFrame. val records = geonames.map { row => val id = row.getInt(0) val lat = row.getFloat(4) val lon = row.getFloat(5) Geoname(id, row.getString(1), row.getString(2), Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil), lat, lon, GeoPoint(lat, lon), row.getString(6), row.getString(7), row.getString(8), row.getString(9), row.getString(10), row.getString(11), row.getString(12), row.getString(13), row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17), row.getDate(18).toString ) } Writing the RDD in Elasticsearch The final step is to store our new build DataFrame records in Elasticsearch via: EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid")) The value “geonames/geoname” are the index/type to be used for store the records in Elasticsearch. To maintain the same ID of the geonames in both CSV and Elasticsearch we pass an additional parameter es.mapping.id that refers to where find the id to be used in Elasticsearch geonameid in the above example. Executing the Spark Job To execute a Spark job you need to build a Jar with all the required library and than to execute it on spark. The first step is done via sbt assembly command that will generate a fatJar with only the required libraries. To submit the Spark Job in the jar, we can use the spark-submit command: spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-assembly-1.0.jar Now you need to wait (about 20 minutes on my machine) that Spark will send all the documents to Elasticsearch and that they are indexed. Searching for a location name After having indexed all the geonames, you can search for them. In case we want search for Moscow, we need a complex query because: City in geonames are entities with fclass=”P” We want skip not populated cities We sort by population descendent to have first the most populated The city name can be in name, alternatenames or asciiname field To achieve this kind of query in Elasticsearch we can use a simple Boolean with several should queries for match the names and some filter to filter out unwanted results. We can execute it via curl via: curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "term": { "name": "moscow"}}, { "term": { "alternatenames": "moscow"}}, { "term": { "asciiname": "moscow" }} ], "filter": [ { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}] }' We used “moscow” lowercase because it’s the standard token generate for a tokenized string (Elasticsearch text type). The result will be similar to this one: { "took": 14, "timed_out": false, "_shards": { "total": 2, "successful": 2, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "geonames", "_type": "geoname", "_id": "524901", "_score": null, "_source": { "name": "Moscow", "location": { "lat": 55.752220153808594, "lon": 37.61555862426758 }, "latitude": 55.75222, "population": 10381222, "moddate": "2016-04-13", "timezone": "Europe/Moscow", "alternatenames": [ "Gorad Maskva", "MOW", "Maeskuy", .... ], "country": "RU", "admin1": "48", "longitude": 37.61556, "admin3": null, "gtopo30": 144, "asciiname": "Moscow", "admin4": null, "elevation": 0, "admin2": null, "fcode": "PPLC", "fclass": "P", "geonameid": 524901, "cc2": null }, "sort": [ 10381222 ] }, Searching for cities given a location position We have processed the geoname so that in Elasticsearch, we were able to have a GeoPoint field. Elasticsearch GeoPoint field allows to enable search for a lot of geolocation queries. One of the most common search is to find cities near me via a Geo Distance Query. This can be achieved modifying the above search in curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "filter": [ { "geo_distance" : { "distance" : "100km", "location" : { "lat" : 55.7522201, "lon" : 36.6155586 } } }, { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}] }' Executing an analytic on the dataset. Having indexed all the geonames, we can check the completes of our dataset and executing analytics on them. For example, it’s useful to check how many geonames there are for a single country and the feature class for every single top country to evaluate their distribution. This can be easily achieved using an Elasticsearch aggregation in a single query: curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d ' { "size": 0, "aggs": { "geoname_by_country": { "terms": { "field": "country", "size": 5 }, "aggs": { "feature_by_country": { "terms": { "field": "fclass", "size": 5 } } } } } }’ The result can be will be something similar: { "took": 477, "timed_out": false, "_shards": { "total": 2, "successful": 2, "failed": 0 }, "hits": { "total": 11301974, "max_score": 0, "hits": [ ] }, "aggregations": { "geoname_by_country": { "doc_count_error_upper_bound": 113415, "sum_other_doc_count": 6787106, "buckets": [ { "key": "US", "doc_count": 2229464, "feature_by_country": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 82076, "buckets": [ { "key": "S", "doc_count": 1140332 }, { "key": "H", "doc_count": 506875 }, { "key": "T", "doc_count": 225276 }, { "key": "P", "doc_count": 192697 }, { "key": "L", "doc_count": 79544 } ] } },…truncated… These are simple examples how to easy index and search data with Elasticsearch. Integrating Elasticsearch with Apache Spark it’s very trivial: the core of part is to design your index and your data model to efficiently use it. After having correct indexed your data to cover your use case, Elasticsearch is able to provides your result or analytics in few microseconds. Summary In this article, we learned how to easily build a simple search geolocator with Elasticsearch using Apache Spark for ingestion. Resources for Article: Further resources on this subject: Basic Operations of Elasticsearch [article] Extending ElasticSearch with Scripting [article] Integrating Elasticsearch with the Hadoop ecosystem [article]
Read more
  • 0
  • 0
  • 5078