Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-share-insights-using-alteryx-server
Sunith Shetty
20 Feb 2018
6 min read
Save for later

How to share insights using Alteryx Server

Sunith Shetty
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to implement efficient business intelligence solutions without writing a single line of code using Alteryx platform.[/box] In today’s tutorial, we will learn about Alteryx Server, an easiest and fastest way to deploy data intensive analytics across the organization. What is Alteryx Server? Alteryx Server provides a scalable platform for deploying and sharing analytics. This is an effective and secure establishment when deploying data rapidly. You can integrate Alteryx processes directly into other internal and external applications from the built in macros and APIs. Alteryx Server can help you speed up business decisions and enable you to get answers in hours, not weeks. You will learn about these powerful features that revolutionize data processing using Alteryx Server: Speed time-to-insight with highly scalable workloads Empower every employee to make data-driven decisions Reduce risk and downtime with analytic governance Before learning about these powerful features, let’s review the Server Structure illustration so you have a solid understanding of how the server functions: Enterprise scalability Enterprise Scalability allows you to scale your enterprise analytics that will speed time to insight. Alteryx Server will compute the data processing by scheduling and running workflows. This reliable server architecture will process data intensive workflows at your scalable fashion. Deploy Alteryx Server on a single machine or in a multi-node environment, allowing you to scale up the number of cores on your existing server or add additional server nodes for availability and improved performance as needed. Ultimate flexibility and scalability Highly complex analytics and large scale data can use a large amount of memory and processing that can take hours to run on analysts' desktops. This can lead to a delay in business answers and sharing those insights. In addition, less risk is associated with running jobs on Alteryx Server, due to system shutdowns and it being less compressible compared to running on desktop. Your IT professionals will install and maintain Alteryx Server; you can rest assured that critical workflow backups and software updates take place regularly. Alteryx Server provides a flexible server architecture with on-premise or cloud deployment to build out enterprise analytic practice for 15 users or 15,000 users. Alteryx Server can be scaled in three different ways: Scaling the Worker node for additional processing power: Increase the total number of workflows that can be processed at any given time by creating multiple Worker nodes. This will scale out the Workers. Scaling the Gallery node for additional web users: Add a load balancer to increase capacity and create multiple Gallery nodes to place behind a load balancer. This will be helpful if you have many Gallery users. Scaling the Database node for availability and redundancy: Create multiple Database nodes by scaling out the persistent databases. This is great for improving overall system performance and ensuring backups. More hardware for Alteryx Server components may need to be added and the following table provides some guidelines: Scheduling and automating workflow execution to deliver data whenever and wherever you want Maximize automation potential by utilizing built-in scheduling and automation capabilities to schedule and run analytic workflows as needed, refresh data sets on a centralized server, and generate reports so everyone can access the data, anytime, anywhere. This will allow you to focus more time on analytic problems, rather than keeping an eye on your workflows running on the desktop. Let the server manage the jobs on a schedule. You can schedule workflows, packages, or apps to run automatically through the company's Gallery, or to a controller. Also, you can schedule to your computer through Desktop Automation (Scheduler). To schedule a workflow, go to Options | Schedule Workflow and to View Schedules go to Options | View Schedules as shown in the following image: If you want to schedule to your company's Gallery, you will need to connect to your company's Gallery first. Add a Gallery if you aren't connected to one. To add a Gallery, select Options | Schedule Workflow | Add Gallery. Type the URL path to your company's Gallery and click Connect. The connection is made based on built-in authentication by adding your Gallery email and password or Windows authentication by logging in through your user name. The following screenshot shows the URL entry screen: Schedule your workflow to run on a controller. A controller is a machine that runs and manages schedules for your organization. A token is needed to connect to the controller once the Alteryx Server Administrator at your company sets up the controller. To add a controller, select Options | Schedule Workflow | Add Controller. The following illustration is where you will add the server name and the controller token to proceed with connecting to the controller: Sharing and collaboration Data Analysts spend too much time customizing existing reports and rerunning workflows for different decision-makers instead of adding business value by working on new analytics projects. Alteryx Server lets you share macros and analytic applications, empowering business users to perform their own self-service analytics. You can easily share, collaborate on, and iterate workflows with analysts throughout your organization through integrated version control for published analytic applications. The administrators and authors of analytic applications can grant access to analytic workflows and specific apps within the gallery to ensure that the right people have access to the analytics they need. The following image shows schedules on the Alteryx Analytics Gallery for easy sharing and collaboration: Analytic governance Alteryx Server provides a built-in secure repository and version control capabilities to enable effective collaboration, allowing you to store analytic applications in a centralized location and ensure users only access the data for which they have permissions. The following screenshot shows the permission types to assign for maintaining secure access and sharing deployment: The goal of managing multiple teams collaborating together and deploying enterprise self-service analytics is to reduce downtime and risk, while ensuring analytic and information governance. Many organizations have become accustomed to a data-driven culture, enabling every employee to use analytics and helping business users to leverage the analytic tools available. You can meet service-level agreements with detailed auditing, usage reporting, and logging tools, and your system administrators can rest assured that your data remains safe and secure. To summarize, we learned about Alteryx Server which has powerful abilities to schedule and deploy workflows to share it with your team. We also explored how the scheduler is used to process workflows and is helpful for running night jobs, since the server functions 24*7. To know more about workflow optimization, and to carry out efficient data preparation and blending, do checkout this book Learning Alteryx.  
Read more
  • 0
  • 0
  • 3160

article-image-what-makes-hadoop-so-revolutionary
Packt
20 Feb 2018
17 min read
Save for later

What makes Hadoop so revolutionary?

Packt
20 Feb 2018
17 min read
In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over:    The priority of the tasks at hand.    Number of containers to be launched to complete the tasks.    The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x).    Available nodes where job containers can be launched with required resources    Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]
Read more
  • 0
  • 0
  • 4193

article-image-installing-configuring-x-pack-elasticsearch-kibana
Pravin Dhandre
20 Feb 2018
6 min read
Save for later

Installing and Configuring X-pack on Elasticsearch and Kibana

Pravin Dhandre
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed coverage on fundamentals of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] In this short tutorial, we will show step-by-step installation and configuration of X-pack components in Elastic Stack to extend the functionalities of Elasticsearch and Kibana. As X-Pack is an extension of Elastic Stack, prior to installing X-Pack, you need to have both Elasticsearch and Kibana installed. You must run the version of X-Pack that matches the version of Elasticsearch and Kibana. Installing X-Pack on Elasticsearch X-Pack is installed just like any plugin to extend Elasticsearch. These are the steps to install X-Pack in Elasticsearch: Navigate to the ES_HOME folder. Install X-Pack using the following command: $ ES_HOME> bin/elasticsearch-plugin install x-pack During installation, it will ask you to grant extra permissions to X-Pack, which are required by Watcher to send email alerts and also to enable Elasticsearch to launch the machine learning analytical engine. Specify y to continue the installation or N to abort the installation. You should get the following logs/prompts during installation: -> Downloading x-pack from elastic [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.io.FilePermission .pipe* read,write * java.lang.RuntimePermissionaccessClassInPackage.com.sun.activation.registries * java.lang.RuntimePermission getClassLoader * java.lang.RuntimePermission setContextClassLoader * java.lang.RuntimePermission setFactory * java.net.SocketPermission * connect,accept,resolve * java.security.SecurityPermission createPolicy.JavaPolicy * java.security.SecurityPermission getPolicy * java.security.SecurityPermission putProviderProperty.BC * java.security.SecurityPermission setPolicy * java.util.PropertyPermission * read,write * java.util.PropertyPermission sun.nio.ch.bugLevel write See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated Risks. Continue with installation? [y/N]y @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin forks a native controller @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ This plugin launches a native controller that is not subject to the Java security manager nor to system call filters. Continue with installation? [y/N]y Elasticsearch keystore is required by plugin [x-pack], creating... -> Installed x-pack Restart Elasticsearch: $ ES_HOME> bin/elasticsearch Generate the passwords for the default/reserved users—elastic, kibana, and logstash_system—by executing this command: $ ES_HOME>bin/x-pack/setup-passwords interactive You should get the following logs/prompts to enter the password for the reserved/default users: Initiating the setup of reserved user elastic,kibana,logstash_system passwords. You will be prompted to enter passwords as the process progresses. Please confirm that you would like to continue [y/N]y Enter password for [elastic]: elastic Reenter password for [elastic]: elastic Enter password for [kibana]: kibana Reenter password for [kibana]:kibana Enter password for [logstash_system]: logstash Reenter password for [logstash_system]: logstash Changed password for user [kibana] Changed password for user [logstash_system] Changed password for user [elastic] Please make a note of the passwords set for the reserved/default users. You can choose any password of your liking. We have chosen the passwords as elastic, kibana, and logstash for elastic, kibana, and logstash_system users, respectively, and we will be using them throughout this chapter. To verify the X-Pack installation and enforcement of security, point your web browser to http://localhost:9200/ to open Elasticsearch. You should be prompted to log in to Elasticsearch. To log in, you can use the built-in elastic user and the password elastic. Upon a successful log in, you should see the following response: { name: "fwDdHSI", cluster_name: "elasticsearch", cluster_uuid: "08wSPsjSQCmeRaxF4iHizw", version: { number: "6.0.0", build_hash: "8f0685b", build_date: "2017-11-10T18:41:22.859Z", build_snapshot: false, lucene_version: "7.0.1", minimum_wire_compatibility_version: "5.6.0", minimum_index_compatibility_version: "5.0.0" }, tagline: "You Know, for Search" } A typical cluster in Elasticsearch is made up of multiple nodes, and X-Pack needs to be installed on each node belonging to the cluster. To skip the install prompt, use the—batch parameters during installation: $ES_HOME>bin/elasticsearch-plugin install x-pack --batch. Your installation of X-Pack will have created folders named x-pack in bin, config, and plugins found under ES_HOME. We shall explore these in later sections of the chapter. Installing X-Pack on Kibana X-Pack is installed just like any plugins to extend Kibana. The following are the steps to install X-Pack in Kibana: Navigate to the KIBANA_HOME folder. Install X-Pack using the following command: $KIBANA_HOME>bin/kibana-plugin install x-pack You should get the following logs/prompts during installation: Attempting to transfer from x-pack Attempting to transfer from https://artifacts.elastic.co/downloads/kibana-plugins/x-pack/x-pack -6.0.0.zip Transferring 120307264 bytes.................... Transfer complete Retrieving metadata from plugin archive Extracting plugin archive Extraction complete Optimizing and caching browser bundles... Plugin installation complete Add the following credentials in the kibana.yml file found under $KIBANA_HOME/config and save it: elasticsearch.username: "kibana" elasticsearch.password: "kibana" If you have chosen a different password for the kibana user during password setup, use that value for the elasticsearch.password property. Start Kibana: $KIBANA_HOME>bin/kibana To verify the X-Pack installation, go to http://localhost:5601/ to open Kibana. You should be prompted to log in to Kibana. To log in, you can use the built-in elastic user and the password elastic. Your installation of X-Pack will have created a folder named x-pack in the plugins folder found under KIBANA_HOME. You can also optionally install X-Pack on Logstash. However, X-Pack currently supports only monitoring of Logstash. Uninstalling X-Pack To uninstall X-Pack: Stop Elasticsearch. Remove X-Pack from Elasticsearch: $ES_HOME>bin/elasticsearch-plugin remove x-pack Restart Elasticsearch and stop Kibana 2. Remove X-Pack from Kibana: $KIBANA_HOME>bin/kibana-plugin remove x-pack Restart Kibana. Configuring X-Pack X-Pack comes bundled with security, alerting, monitoring, reporting, machine learning, and graph capabilities. By default, all of these features are enabled. However, one might not be interested in all the features it provides. One can selectively enable and disable the features that they are interested in from the elasticsearch.yml and kibana.yml configuration files. Elasticsearch supports the following features and settings in the elasticsearch.yml file: Kibana supports these features and settings in the kibana.yml file: If X-Pack is installed on Logstash, you can disable the monitoring by setting the xpack.monitoring.enabled property to false in the logstash.yml configuration file.   With this, we successfully explored how to install and configure the X-Pack components in order to bundle different capabilities of X-pack into one package of Elasticsearch and Kibana. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.    
Read more
  • 0
  • 0
  • 16098
Visually different images

article-image-implementing-face-detection-using-haar-cascades-adaboost-algorithm
Sugandha Lahoti
20 Feb 2018
7 min read
Save for later

Implementing face detection using the Haar Cascades and AdaBoost algorithm

Sugandha Lahoti
20 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book serves as an effective guide to using ensemble techniques to enhance machine learning models.[/box] In today’s tutorial, we will learn how to apply the AdaBoost classifier in face detection using Haar cascades. Face detection using Haar cascades Object detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper Rapid Object Detection using a Boosted Cascade of Simple Features in 2001. It is a machine-learning-based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images. Here, we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Features are nothing but numerical information extracted from the images that can be used to distinguish one image from another; for example, a histogram (distribution of intensity values) is one of the features that can be used to define several characteristics of an image even without looking at the image, such as dark or bright image, the intensity range of the image, contrast, and so on. We will use Haar features to detect faces in an image. Here is a figure showing different Haar features: These features are just like the convolution kernel; to know about convolution, you need to wait for the following chapters. For a basic understanding, convolutions can be described as in the following figure: So we can summarize convolution with these steps: Pick a pixel location from the image. Now crop a sub-image with the selected pixel as the center from the source image with the same size as the convolution kernel. Calculate an element-wise product between the values of the kernel and sub- image. Add the result of the product. Put the resultant value into the new image at the same place where you picked up the pixel location. Now we are going to do a similar kind of procedure, but with a slight difference for our images. Each feature of ours is a single value obtained by subtracting the sum of the pixels under the white rectangle from the sum of the pixels under the black rectangle. Now, all possible sizes and locations of each kernel are used to calculate plenty of features. (Just imagine how much computation it needs. Even a 24x24 window results in over 160,000 features.) For each feature calculation, we need to find the sum of the pixels under the white and black rectangles. To solve this, we will use the concept of integral image; we will discuss this concept very briefly here, as it's not a part of our context. Integral image Integral images are those images in which the pixel value at any (x,y) location is the sum of the all pixel values present before the current pixel. Its use can be understood by the following example: Image on the left and the integral image on the right. Let's see how this concept can help reduce computation time; let us assume a matrix A of size 5x5 representing an image, as shown here: Now, let's say we want to calculate the average intensity over the area highlighted: Region for addition Normally, you'd do the following: 9 + 1 + 2 + 6 + 0 + 5 + 3 + 6 + 5 = 37 37 / 9 = 4.11 This requires a total of 9 operations. Doing the same for 100 such operations would require: 100 * 9 = 900 operations. Now, let us first make a integral image of the preceding image: Making this image requires a total of 56 operations. Again, focus on the highlighted portion: To calculate the avg intensity, all you have to do is: (76 - 20) - (24 - 5) = 37 37 / 9 = 4.11 This required a total of 4 operations. To do this for 100 such operations, we would require: 56 + 100 * 4 = 456 operations. For just a hundred operations over a 5x5 matrix, using an integral image requires about 50% less computations. Imagine the difference it makes for large images and other such operations. Creation of an integral image changes other sum difference operations by almost O(1) time complexity, thereby decreasing the number of calculations. It simplifies the calculation of the sum of pixels—no matter how large the number of pixels—to an operation involving just four pixels. Nice, isn't it? It makes things superfast. However, among all of these features we calculated, most of them are irrelevant. For example, consider the following image. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other part is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by AdaBoost. To do this, we apply each and every feature on all the training images. For each feature, it finds the best threshold that will classify the faces as positive and negative. Obviously, there will be errors or misclassifications. We select the features with the minimum error rate, which means they are the features that best classify the face and non-face images. Note: The process is not as simple as this. Each image is given an equal weight in the       beginning. After each classification, the weights of misclassified images are increased. Again, the same process is done. New error rates are calculated among the new weights. This process continues until the required accuracy or error rate is achieved or the required number of features is found. The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others, it forms a strong classifier. The paper says that even 200 features provide detection with 95% accuracy. Their final setup had around 6,000 features. (Imagine a reduction from 160,000+ to 6000 features. That is a big gain.) Face detection framework using the Haar cascade and AdaBoost algorithm So now, you take an image take each 24x24 window, apply 6,000 features to it, and check if it is a face or not. Wow! Wow! Isn't this a little inefficient and time consuming? Yes, it is. The authors of the algorithm have a good solution for that. In an image, most of the image region is non-face. So it is a better idea to have a simple method to verify that a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on the region where there can be a face. This way, we can find more time to check a possible face region. For this, they introduced the concept of a cascade of classifiers. Instead of applying all the 6,000 features to a window, we group the features into different stages of classifiers and apply one by one (normally first few stages will contain very few features). If a window fails in the first stage, discard it. We don’t consider the remaining features in it. If it passes, apply the second stage of features and continue the process. The window that passes all stages is a face region. How cool is the plan!!! The authors' detector had 6,000+ features with 38 stages, with 1, 10, 25, 25, and 50 features in the first five stages (two features in the preceding image were actually obtained as the best two features from AdaBoost). According to the authors, on average, 10 features out of 6,000+ are evaluated per subwindow. So this is a simple, intuitive explanation of how Viola-Jones face detection works. Read the paper for more details. If you found this post useful, do check out the book Ensemble Machine Learning to learn different machine learning aspects such as bagging, boosting, and stacking.    
Read more
  • 0
  • 0
  • 16602

article-image-crud-create-read-update-delete-operations-elasticsearch
Pravin Dhandre
19 Feb 2018
5 min read
Save for later

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre
19 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID:  DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe  The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0  and start building end-to-end real-time data processing solutions for your enterprise analytics applications.
Read more
  • 0
  • 0
  • 13098

article-image-how-to-classify-digits-using-keras-and-tensorflow
Sugandha Lahoti
19 Feb 2018
13 min read
Save for later

How to Classify Digits using Keras and TensorFlow

Sugandha Lahoti
19 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book provides a practical approach to building efficient machine learning models using ensemble techniques with real-world use cases.[/box] Today we will look at how we can create, train, and test a neural network to perform digit classification using Keras and TensorFlow. This article uses MNIST dataset with images of handwritten digits.It contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate. One paper, by using a hierarchical system of CNNs, manages to get an error rate on the MNIST database of 0.23 percent. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they used a support vector machine to get an error rate of 0.8 percent. Images in the dataset look like this: So let's not waste our time and start implementing our very first neural network in Python. Let’s start the code by importing the supporting projects. # Imports for array-handling and plotting import numpy as np import matplotlib import matplotlib.pyplot as plt Keras already has the MNIST dataset as a sample dataset, so we can import it as it is. Generally, it downloads the data over the internet and stores it into the database. So, if your system does not have the dataset, Internet will be required to download it: # Keras imports for the dataset and building our neural network from keras.datasets import mnist Now, we will import the Sequential and load_model classes from the keras.model class. We are working with sequential networks as all layers will be in forward sequence only. We are not using any split in the layers. The Sequential class will create a sequential model by combining the layers sequentially. The load_model class will help us to load the trained model for testing and evaluation purposes: #Import Sequential and Load model for creating and loading model from keras.models import Sequential, load_model In the next line, we will call three types of layers from the keras library. Dense layer means a fully connected layer; that is, each neuron of current layer will have a connection to the each neuron of the previous as well as next layer. The dropout layer is for reducing overfitting in our model. It randomly selects some neurons and does not use them for training for that iteration. So there are less chances that two different neurons of the same layer learn the same features from the input. By doing this, it prevents redundancy and correlation between neurons in the network, which eventually helps prevent overfitting in the network. The activation layer applies the activation function to the output of the neuron. We will use rectified linear units (ReLU) and the softmax function as the activation layer. We will discuss their operation when we use them in network creation: #We will use Dense, Drop out and Activation layers from keras.layers.core import Dense, Dropout, Activation from keras.utils import np_utils So we will start with loading our dataset by mnist.load. It will give us training and testing input and output instances. Then, we will visualize some instances so that we know what kind of data we are dealing with. We will use matplotlib to plot them. As the images have gray values, we can easily plot a histogram of the images, which can give us the pixel intensity distribution: #Let's Start by loading our dataset (X_train, y_train), (X_test, y_test) = mnist.load_data() #Plot the digits to verify plt.figure() for i in range(9): plt.subplot(3,3,i+1) plt.tight_layout() plt.imshow(X_train[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[i])) plt.xticks([]) plt.yticks([]) plt.show() When we execute  our code for the preceding code block, we will get the output as: #Lets analyze histogram of the image plt.figure() plt.subplot(2,1,1) plt.imshow(X_train[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2,1,2) plt.hist(X_train[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() The histogram of an image will look like this: # Print the shape before we reshape and normalize print("X_train shape", X_train.shape) print("y_train shape", y_train.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) Currently, this is shape of the dataset we have: X_train shape (60000, 28, 28) y_train shape (60000,) X_test shape (10000, 28, 28) y_test shape (10000,) As we are working with 2D images, we cannot train them as with our neural network. For training 2D images, there are different types of neural networks available; we will discuss those in the future. To remove this data compatibility issue, we will reshape the input images into 1D vectors of 784 values (as images have size 28X28). We have 60000 such images in training data and 10000 in testing: # As we have data in image form convert it to row vectors X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') Normalize the input data into the range of 0 to 1 so that it leads to a faster convergence of the network. The purpose of normalizing data is to transform our dataset into a bounded range; it also involves relativity between the pixel values. There are various kinds of normalizing techniques available such as mean normalization, min-max normalization, and so on: # Normalizing the data to between 0 and 1 to help with the training X_train /= 255 X_test /= 255 # Print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape) Let's print the shape of the data: Train matrix shape (60000, 784) Test matrix shape (10000, 784) Now, our training set contains output variables as discrete class values; say, for an image of number eight, the output class value is eight. But our output neurons will be able to give an output only in the range of zero to one. So, we need to convert discrete output values to categorical values so that eight can be represented as a vector of zero and one with the length equal to the number of classes. For example, for the number eight, the output class vector should be: 8 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] # One-hot encoding using keras' numpy-related utilities n_classes = 10 print("Shape before one-hot encoding: ", y_train.shape) Y_train = np_utils.to_categorical(y_train, n_classes) Y_test = np_utils.to_categorical(y_test, n_classes) print("Shape after one-hot encoding: ", Y_train.shape) After one-hot encoding of our output, the variable’s shape will be modified as: Shape before one-hot encoding:  (60000,) Shape after one-hot encoding:     (60000, 10) So, you can see that now we have an output variable of 10 dimensions instead of 1. Now, we are ready to define our network parameters and layer architecture. We will start creating our network by creating a Sequential class object, model. We can add different layers to this model as we have done in the following code block. We will create a network of an input layer, two hidden layers, and one output layer. As the input layer is always our data layer, it doesn't have any learning parameters. For hidden layers, we will use 512 neurons in each. At the end, for a 10-dimensional output, we will use 10 neurons in the final layer: # Here, we will create model of our ANN # Create a linear stack of layers with the sequential model model = Sequential() #Input Layer with 512 Weights model.add(Dense(512, input_shape=(784,))) #We will use relu as Activation model.add(Activation('relu')) #Put Drop out to prevent over-fitting model.add(Dropout(0.2)) #Add Hidden layer with 512 neurons with relu activation model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2)) #This is our Output layer with 10 neurons model.add(Dense(10))model.add(Activation('softmax')) After defining the preceding structure, our neural network will look something like this: The Shape field in each layer shows the shape of the data matrix in that layer, and it is quite intuitive. As we first get the multiplication of input with length of 784 values to 512 neurons, the data shape at Hidden-1 will be 784 X 512. It will be calculated similarly for the other two layers. We have used two different kinds of activation functions here. The first one is ReLU and the second one is sofmax probabilities. We will give some time to discuss these two. ReLU prevent the output of the neuron from becoming negative. The expression for relu function is: So if any neuron produces an output less than 0, it converts it to 0. We can write it in conditional form as: You just need to know that ReLU is a slightly better activation function than sigmoid. If we plot a sigmoid function, it will look like: If you look closer, the sigmoid function starts getting saturated before reaching its minimum (0) or maximum (1) values. So at the time of gradient calculation, values in the saturated region result in a very small gradient. That causes a very small change in the weight values, which is not sufficient to optimize the cost function. Now, as we go more backward during the backpropagation, that small change becomes smaller and almost reaches zero. This problem is known as the problem of vanishing gradients. So, in practical cases, we avoid sigmoid activation when our network has many stacked layers. Whereas if we see the expression of ReLU activation, it is more like a straight line: So, the gradient of the preceding function will always a non-zero value until and unless the output itself is a zero value. Thus, it prevents the problem of vanishing gradients. We have discussed the significance of the dropout layer earlier and I don’t think that it is further required. We are using 20% neuron dropout during the training time. We will not use the dropout layer during the testing time. Now, we are all set to train our very first ANN, but before starting training, we have to define the values of the network hyperparameters. We will use SGD using adaptive momentum. There are many algorithms to optimize the performance of the SGD algorithm. You just need to know that adaptive momentum is a better choice than simple gradient descent because it modifies the learning rate using previous errors created by the network. So, there are less chances of getting trapped at the local minima or missing the global minima conditions. We are using SGD with ADAM, using its default parameters. Here, we use  batch_size of 128 samples. That means we will update the weights after calculating the error on these 128 samples. It is a sufficient batch size for our total data population. We are going to train our network for 20 epochs for the time being. Here, one epoch means one complete training cycle of all mini-batches. Now, let's start training our network: #Here we will be compiling the sequential model model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') # Start training the model and saving metrics in history history = model.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=2, validation_data=(X_test, Y_test)) We will save our trained model on disk so that we can use it for further fine-tuning whenever required. We will store the model in the HDF5 file format: # Saving the model on disk path2save = 'E:/PyDevWorkSpaceTest/Ensembles/Chapter_10/keras_mnist.h5' model.save(path2save) print('Saved trained model at %s ' % path2save) # Plotting the metrics fig = plt.figure() plt.subplot(2,1,1) plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='lower right') plt.subplot(2,1,2) plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.tight_layout() plt.show() Let's analyze the loss with each iteration during the training of our neural network; we will also plot the accuracies for validation and test set. You should always monitor validation and training loss as it can help you know whether your model is underfitting or overfitting: Test Loss 0.0824991761778 Test Accuracy 0.9813 As you can see, we are getting almost similar performance for our training and validation sets in terms of loss and accuracy. You can see how accuracy is increasing as the number of epochs increases. This shows that our network is learning. Now, we have trained and stored our model. It's time to reload it and test it with the 10000 test instances: #Let's load the model for testing data path2save = 'D:/PyDevWorkspace/EnsembleMachineLearning/Chapter_10/keras_mnist.h5' mnist_model = load_model(path2save) #We will use Evaluate function loss_and_metrics = mnist_model.evaluate(X_test, Y_test, verbose=2) print("Test Loss", loss_and_metrics[0]) print("Test Accuracy", loss_and_metrics[1]) #Load the model and create predictions on the test set mnist_model = load_model(path2save) predicted_classes = mnist_model.predict_classes(X_test) #See which we predicted correctly and which not correct_indices = np.nonzero(predicted_classes == y_test)[0] incorrect_indices = np.nonzero(predicted_classes != y_test)[0] print(len(correct_indices)," classified correctly") print(len(incorrect_indices)," classified incorrectly") So, here is the performance of our model on the test set: 9813  classified correctly 187  classified incorrectly As you can see, we have misclassified 187 instances out of 10000, which I think is a very good accuracy on such a complex dataset. In the next code block, we will analyze such cases where we detect false labels: #Adapt figure size to accomodate 18 subplots plt.rcParams['figure.figsize'] = (7,14) plt.figure() # plot 9 correct predictions for i, correct in enumerate(correct_indices[:9]): plt.subplot(6,3,i+1) plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted: {}, Truth: {}".format(predicted_classes[correct], y_test[correct])) plt.xticks([]) plt.yticks([]) # plot 9 incorrect predictions for i, incorrect in enumerate(incorrect_indices[:9]): plt.subplot(6,3,i+10) plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted {}, Truth: {}".format(predicted_classes[incorrect], y_test[incorrect])) plt.xticks([]) plt.yticks([]) plt.show() If you look closely, our network is failing on such cases that are very difficult to identify by a human, too. So, we can say that we are getting quite a good accuracy from a very simple model. We saw how to create, train, and test a neural network to perform digit classification using Keras and TensorFlow. If you found our post  useful, do check out this book Ensemble Machine Learning to build ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy.  
Read more
  • 0
  • 0
  • 5076
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-perform-predictive-forecasting-sap-analytics-cloud
Kunal Chaudhari
17 Feb 2018
7 min read
Save for later

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari
17 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.  
Read more
  • 0
  • 0
  • 4750

article-image-implement-memory-oltp-sql-server-linux
Fatema Patrawala
17 Feb 2018
11 min read
Save for later

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala
17 Feb 2018
11 min read
[box type="note" align="" class="" width=""]Below given article is an excerpt from the book SQL Server on Linux, authored by Jasmin Azemović. This book is a handy guide to setting up and implementing your SQL Server solution on the open source Linux platform.[/box] Today we will learn about the basics of In-Memory OLTP and how to implement it on SQL Server on Linux through the following topics: Elements of performance What is In-Memory OLTP Implementation Elements of performance How do you know if you have a performance issue in your database environment? Well, let's put it in these terms. You notice it (the good), users start calling technical support and complaining about how everything is slow (the bad) or you don't know about your performance issues (the ugly). Try to never get in to the last category. The good Achieving best performance is an iterative process where you need to define a set of tasks that you will execute on a regular basics and monitor their results. Here is a list that will give you an idea and guide you through this process: Establish the baseline Define the problem Fix one thing at a time Test and re-establish the baseline Repeat everything Establishing the baseline is the critical part. In most case scenarios, it is not possible without real stress testing. Example: How many users' systems can you handle on the current configuration? The next step is to measure the processing time. Do your queries or stored procedures require milliseconds, seconds, or minutes to execute? Now you need to monitor your database server using a set of tools and correct methodologies. During that process, you notice that some queries show elements of performance degradation. This is the point that defines the problem. Let's say that frequent UPDATE and DELETE operations are resulting in index fragmentation. The next step is to fix this issue with REORGANIZE or REBUILD index operations. Test your solution in the control environment and then in the production. Results can be better, same, or worse. It depends and there is no magic answer here. Maybe now something else is creating the problem: disk, memory, CPU, network, and so on. In this step, you should re-establish the old or a new baseline. Measuring performance process is something that never ends. You should keep monitoring the system and stay alert. The bad If you are in this category, then you probably have an issue with establishing the baseline and alerting the system. So, users are becoming your alerts and that is a bad thing. The rest of the steps are the same except re-establishing the baseline. But this can be your wake-up call to move yourself in the good category. The ugly This means that you don't know or you don't want to know about performance issues. The best case scenario is a headline on some news portal, but that is the ugly thing. Every decent DBA should try to be light years away from this category. What do you need to start working with performance measuring, monitoring, and fixing? Here are some tips that can help you: Know the data and the app Know your server and its capacity Use dynamic management views—DMVs: Sys.dm_os_wait_stats Sys.dm_exec_query_stats sys.dm_db_index_operational_stats Look for top queries by reads, writes, CPU, execution count Put everything in to LibreOffice Calc or another spreadsheet application and do some basic comparative math Fortunately, there is something in the field that can make your life really easy. It can boost your environment to the scale of warp speed (I am a Star Trek fan). What is In-Memory OLTP? SQL Server In-Memory feature is unique in the database world. The reason is very simple; because it is built-in to the databases' engine itself. It is not a separate database solution and there are some major benefits of this. One of these benefits is that in most cases you don't have to rewrite entire SQL Server applications to see performance benefits. On average, you will see 10x more speed while you are testing the new In-Memory capabilities. Sometimes you will even see up to 50x improvement, but it all depends on the amount of business logic that is done in the database via stored procedures. The greater the logic in the database, the greater the performance increase. The more the business logic sits in the app, the less opportunity there is for performance increase. This is one of the reasons for always separating database world from the rest of the application layer. It has built-in compatibility with other non-memory tables. This way you can optimize thememory you have for the most heavily used tables and leave others on the disk. This also means you won't have to go out and buy expensive new hardware to make large InMemory databases work; you can optimize In-Memory to fit your existing hardware. In-Memory was started in SQL Server 2014. One of the first companies that has started to use this feature during the development of the 2014 version was Bwin. This is an online gaming company. With In-Memory OLTP they improved their transaction speed by 16x, without investing in new expensive hardware. The same company has achieved 1.2 Million requests/second on SQL Server 2016 with a single machine using In-Memory OLTP: https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/ Not every application will benefit from In-Memory OLTP. If an application is not suffering from performance problems related to concurrency, IO pressure, or blocking, it's probably not a good candidate. If the application has long-running transactions that consume large amounts of buffer space, such as ETL processing, it's probably not a good candidate either. The best applications for consideration would be those that run high volumes of small fast transactions, with repeatable query plans such as order processing, reservation systems, stock trading, and ticket processing. The biggest benefits will be seen on systems that suffer performance penalties from tables that are having concurrency issues related to a large number of users and locking/blocking. Applications that heavily use the tempdb for temporary tables could benefit from In-Memory OLTP by creating the table as memory optimized, and performing the expensive sorts, and groups, and selective queries on the tables that are memory optimized. In-Memory OLTP quick start An important thing to remember is that the databases that will contain memory-optimized tables must have a MEMORY_OPTIMIZED_DATA filegroup. This filegroup is used for storing the checkpoint needed by SQL Server to recover the memory-optimized tables. Here is a simple DDL SQL statement to create a database that is prepared for In-Memory tables: 1> USE master 2> GO 1> CREATE DATABASE InMemorySandbox 2> ON 3> PRIMARY (NAME = InMemorySandbox_data, 4> FILENAME = 5> '/var/opt/mssql/data/InMemorySandbox_data_data.mdf', 6> size=500MB), 7> FILEGROUP InMemorySandbox_fg 8> CONTAINS MEMORY_OPTIMIZED_DATA 9> (NAME = InMemorySandbox_dir, 10> FILENAME = 11> '/var/opt/mssql/data/InMemorySandbox_dir') 12> LOG ON (name = InMemorySandbox_log, 13> Filename= 14>'/var/opt/mssql/data/InMemorySandbox_data_data.ldf', 15> size=500MB) 16 GO   The next step is to alter the existing database and configure it to access memory-optimized tables. This part is helpful when you need to test and/or migrate current business solutions: --First, we need to check compatibility level of database. -- Minimum is 130 1> USE AdventureWorks 2> GO 3> SELECT T.compatibility_level 4> FROM sys.databases as T 5> WHERE T.name = Db_Name(); 6> GO compatibility_level ------------------- 120 (1 row(s) affected) --Change the compatibility level 1> ALTER DATABASE CURRENT 2> SET COMPATIBILITY_LEVEL = 130; 3> GO --Modify the transaction isolation level 1> ALTER DATABASE CURRENT SET 2> MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON 3> GO --Finlay create memory optimized filegroup 1> ALTER DATABASE AdventureWorks 2> ADD FILEGROUP AdventureWorks_fg CONTAINS 3> MEMORY_OPTIMIZED_DATA 4> GO 1> ALTER DATABASE AdventureWorks ADD FILE 2> (NAME='AdventureWorks_mem', 3> FILENAME='/var/opt/mssql/data/AdventureWorks_mem') 4> TO FILEGROUP AdventureWorks_fg 5> GO   How to create memory-optimized table? The syntax for creating memory-optimized tables is almost the same as the syntax for creating classic disk-based tables. You will need to specify that the table is a memory-optimized table, which is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can be created with two DURABILITY values: SCHEMA_AND_DATA (default) SCHEMA_ONLY If you defined a memory-optimized table with DURABILITY=SCHEMA_ONLY, it means that changes to the table's data are not logged and the data is not persisted on disk. However, the schema is persisted as part of the database metadata. A side effect is that an empty table will be available after the database is recovered during a restart of SQL Server on Linux service.   The following table is a summary of key differences between those two DURABILITY Options. When you create a memory-optimized table, the database engine will generate DML routines just for accessing that table, and load them as DLLs files. SQL Server itself does not perform data manipulation, instead it calls the appropriate DLL: Now let's add some memory-optimized tables to our sample database:     1> USE InMemorySandbox 2> GO -- Create a durable memory-optimized table 1> CREATE TABLE Basket( 2> BasketID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED, 4> UserID INT NOT NULL INDEX ix_UserID 5> NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000), 6> CreatedDate DATETIME2 NOT NULL,   7> TotalPrice MONEY) WITH (MEMORY_OPTIMIZED=ON) 8> GO -- Create a non-durable table. 1> CREATE TABLE UserLogs ( 2> SessionID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000), 4> UserID int NOT NULL, 5> CreatedDate DATETIME2 NOT NULL, 6> BasketID INT, 7> INDEX ix_UserID 8> NONCLUSTERED HASH (UserID) WITH (BUCKET_COUNT=400000)) 9> WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY) 10> GO -- Add some sample records 1> INSERT INTO UserLogs VALUES 2> (432, SYSDATETIME(), 1), 3> (231, SYSDATETIME(), 7), 4> (256, SYSDATETIME(), 7), 5> (134, SYSDATETIME(), NULL), 6> (858, SYSDATETIME(), 2), 7> (965, SYSDATETIME(), NULL) 8> GO 1> INSERT INTO Basket VALUES 2> (231, SYSDATETIME(), 536), 3> (256, SYSDATETIME(), 6547), 4> (432, SYSDATETIME(), 23.6), 5> (134, SYSDATETIME(), NULL) 6> GO -- Checking the content of the tables 1> SELECT SessionID, UserID, BasketID 2> FROM UserLogs 3> GO 1> SELECT BasketID, UserID 2> FROM Basket 3> GO   What is natively compiled stored procedure? This is another great feature that comes comes within In-Memory package. In a nutshell, it is a classic SQL stored procedure, but it is compiled into machine code for blazing fast performance. They are stored as native DLLs, enabling faster data access and more efficient query execution than traditional T-SQL. Now you will create a natively compiled stored procedure to insert 1,000,000 rows into Basket: 1> USE InMemorySandbox 2> GO 1> CREATE PROCEDURE dbo.usp_BasketInsert @InsertCount int 2> WITH NATIVE_COMPILATION, SCHEMABINDING AS 3> BEGIN ATOMIC 4> WITH 5> (TRANSACTION ISOLATION LEVEL = SNAPSHOT, 6> LANGUAGE = N'us_english') 7> DECLARE @i int = 0 8> WHILE @i < @InsertCount 9> BEGIN 10> INSERT INTO dbo.Basket VALUES (1, SYSDATETIME() , NULL) 11> SET @i += 1 12> END 13> END 14> GO --Add 1000000 records 1> EXEC dbo.usp_BasketInsert 1000000 2> GO   The insert part should be blazing fast. Again, it depends on your environment (CPU, RAM, disk, and virtualization). My insert was done in less than three seconds, on an average machine. But significant improvement should be visible now. Execute the following SELECT statement and count the number of records:   1> SELECT COUNT(*) 2> FROM dbo.Basket 3> GO ----------- 1000004 (1 row(s) affected)   In my case, counting of one million records was less than one second. It is really hard to achieve this performance on any kind of disk. Let's try another query. We want to know how much time it will take to find the top 10 records where the insert time was longer than 10 microseconds:   1> SELECT TOP 10 BasketID, CreatedDate 2> FROM dbo.Basket 3> WHERE DATEDIFF 4> (MICROSECOND,'2017-05-30 15:17:20.9308732', CreatedDate) 5> >10 6> GO   Again, query execution time was less than a second. Even if you remove TOP and try to get all the records it will take less than a second (in my case scenario). Advantages of InMemory tables are more than obvious.   We learnt about the basic concepts of In-Memory OLTP and how to implement it on new and existing database. We also got to know that a memory-optimized table can be created with two DURABILITY values and finally, we created an In-Memory table. If you found this article useful, check out the book SQL Server on Linux, which covers advanced SQL Server topics, demonstrating the process of setting up SQL Server database solution in the Linux environment.        
Read more
  • 0
  • 0
  • 4412

article-image-use-labview-data-acquisition
Fatema Patrawala
17 Feb 2018
14 min read
Save for later

How to use LabVIEW for data acquisition

Fatema Patrawala
17 Feb 2018
14 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Data Acquisition Using LabVIEW written by Behzad Ehsani. In this book you will learn to transform physical phenomena into computer-acceptable data using an object-oriented language.[/box] Today we will discuss basics of LabVIEW, focus on its installation with an example of a LabVIEW program which is generally known as Virtual Instrument (VI). Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environments by its completely graphical approach to programming. As an example, while representation of a while loop in a text-based language such as C consists of several predefined, extremely compact, and sometimes extremely cryptic lines of text, a while loop in LabVIEW is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning curve for the beginner. LabVIEW is based on what is called the G language, but there are still other languages, especially C, under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempted to start projects in LabVIEW only because, at first glance, the graphical nature of the interface and the concept of drag and drop used in LabVIEW appears to do away with the required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that, in many higher-level development and testing environments, especially when using complicated test equipment and complex mathematical calculations or even creating embedded software, LabVIEW's approach will be a much more time-efficient and bug-free environment which otherwise would require several lines of code in a traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses.   LabVIEW does not completely replace the need for traditional text based languages and, depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern-day program installation; that is, insert the DVD 1 and follow the onscreen guided installation steps. LabVIEW comes in one DVD for the Mac and Linux versions but in four or more DVDs for the Windows edition (depending on additional software, different licensing, and additional libraries and packages purchased). In this book, we will use the LabVIEW 2013 Professional Development version for Windows. Given the target audience of this book, we assume the user is fully capable of installing the program. Installation is also well documented by National Instruments (NI) and the mandatory 1-year support purchase with each copy of LabVIEW is a valuable source of live and e-mail help. Also, the NI website (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups, local group events and meetings of fellow LabVIEW developers, and so on. It's worth noting for those who are new to the installation of LabVIEW that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediately needed!). This additional software is fully functional in demo mode for 7 days, which may be extended for about a month with online registration. This is a very good opportunity to have hands-on experience with even more of the power and functionality that LabVIEW is capable of offering. The additional information gained by installing the other software available on the DVDs may help in further development of a given project. Just imagine, if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition is probably going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need for web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all the software on all the DVDs is selected: When installing a fresh version of LabVIEW, if you do decide to observe the given advice, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI... and Measurement Studio... for Visual Studio. LabWindows, according to NI, is an ANSI C integrated development environment. Also note that, by default, NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipment. Also, note that device drivers (on Windows installations) come on a separate DVD, which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well-established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, NI has programmers that would do just that. But this would come at a cost to the user. VI Package Manager, now installed as a part of standard installation, is also a must these days. NI distributes third-party software and drivers and public domain packages via VI Package Manager. We are going to use examples using Arduino (http://www.arduino.cc) microcontrollers in later chapters of this book. Appropriate software and drivers for these microcontrollers are installed via VI Package Manager. You can install many public domain packages that further install many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by NI. Finally, note that the more modules, packages, and software that are selected to be installed, the longer it will take to complete the installation. This may sound like making an obvious point but, surprisingly enough, installation of all software on the three DVDs (for Windows) takes up over 5 hours! On a standard laptop or PC we used. Obviously, a more powerful PC (such as one with a solid state hard drive) may not take such long time. Basic LabVIEW VI Once the LabVIEW application is launched, by default two blank windows open simultaneously–a Front Panel and a Block Diagram window–and a VI is created: VIs are the heart and soul of LabVIEW. They are what separate LabVIEW from all other text-based development environments. In LabVIEW, everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs. These graphical representations of a thing, be it a simple while loop, a complex mathematical concept such as Polynomial Interpolation, or simply a Boolean constant, are all graphically represented. To use an object, right-click inside the Block Diagram or Front Panel window, a pallet list appears. Follow the arrow and pick an object from the list of objects from subsequent pallet and place it on the appropriate window. The selected object can now be dragged and placed on different locations on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of course, there are many exceptions to this rule. For example, a while loop can only be selected in Block Diagram and, by itself, a while loop does not have a graphical representation on the Front Panel window. Needless to say, LabVIEW also has keyboard combinations that expedite selecting and placing any given toolkit objects onto the appropriate window: Each object has one (or several) wire connections going into it as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to the input and output of one or more objects. Example 1 – counter with a gauge This is a fairly simple program with simple user interaction. Once the program has been launched, it uses a while loop to wait for the user input. This is a typical behavior of almost any user-friendly program. For example, if the user launches Microsoft Office, the program launches and waits for the user to pick a menu item, click on a button, or perform any other action that the program may provide. Similarly, this program starts execution but waits in a loop for the user to choose a command. In this case only a simple Start or Stop is available. If the Start button is clicked, the program uses a for loop function to simply count from 0 to 10 in intervals of 200 milliseconds. After each count is completed, the gauge on the Front Panel, the GUI part of the program, is updated to show the current count. The counter is then set to the zero location of the gauge and the program awaits subsequent user input. If the Start button is clicked again, this action is repeated, and, obviously, if the Stop button is clicked, the program exits. Although very simple, in this example, you can find many of the concepts that are often used in a much more elaborate program. Let's walk through the code and point out some of these concepts. The following steps not only walk the reader through the example code but are also a brief tutorial on how to use LabVIEW, how to utilize each working window, and how to wire objects. Launch LabVIEW and from the File menu, choose New VI and follow the steps:    Right-click on the Block Diagram window.    From Programming Functions, choose Structures and select While Loop.    Click (and hold) and drag the cursor to create a (resizable) rectangle. On the bottom-left corner, right-click on the wire to the stop loop and choose Create a control. Note that a Stop button appears on both the Block Diagram and Front panel windows. Inside the while loop box, right-click on the Block Diagram window and from Programming Function, choose Structures and select Case Structures. Click and (and hold) and drag the cursor to create a (resizable) rectangle. On the Front Panel window, next to the Stop button created, right-click and from Modern Controls, choose Boolean and select an OK button. Double-click on the text label of the OK button and replace the OK button text with Start. Note that an OK button is also created on the Block Diagram window and the text label on that button also changed when you changed the text label on the Front Panel window. On the Front Panel window, drag-and-drop the newly created Start button next to the tiny green question mark on the left-hand side of the Case Structure box, outside of the case structure but inside the while loop. Wire the Start button to the Case Structure. Inside the Case Structure box, right-click on the Block Diagram window and from Programming Function, choose Structures and select For Loop. Click and (and hold) and drag the cursor to create a (resizable) rectangle. Inside the Case Structure box, right-click on N on the top-left side of the Case Structure and choose Create Constant. An integer blue box with a value of 0 will be connected to the For Loop. This is the number of irritations the for loop is going to have. Change 0 to 11. Inside the For Loop box, right click on the Block Diagram widow and from Programming Function, choose Timing and select Wait(ms). Right-click on the Wait function created in step 10 and connect a integer value of 200 similar to step 9. On the Front Panel window, right-click and from Modern functions, choose Gauge. Note that a Gauge function will appear on the Block Diagram window too. If the function is not inside the For Loop, drag and drop it inside the For Loop. Inside the For loop, on the Block Diagram widow, connect the iteration count i to the Gauge. On the Block Diagram, right-click on the Gauge, and under the Create submenu, choose Local variable. If it is not already inside the while loop, drag and drop it inside the while loop but outside of the case structure. Right-click on the local variable created in step 15 and connect a Zero to the input of the local variable. Click on the Clean Up icon on the main menu bar on the Block Diagram window and drag and move items on the Front Panel window so that both windows look similar to the following screenshots: Creating a project is a must When LabVIEW is launched, a default screen such as in the following screenshot appears on the screen: The most common way of using LabVIEW, at least in the beginning of a small project or test program, is to create a new VI. A common rule of programming is that each function, or in this case VI, should not be larger than a page. Keep in mind that, by nature, LabVIEW will have two windows to begin with and, being a graphical programming environment only, each VI may require more screen space than the similar text based development environment. To start off development and in order to set up all devices and connections required for tasks such as data acquisition, a developer may get the job done by simply creating one, and, more likely several VIs. Speaking from experience among engineers and other developers (in other words, in situations where R&D looms more heavily on the project than collecting raw data), quick VIs are more efficient initially, but almost all projects that start in this fashion end up growing very quickly and other people and other departments will need be involved and/or be fed the gathered data. In most cases, within a short time from the beginning of the project, technicians from the same department or related teams may be need to be trained to use the software in development. This is why it is best to develop the habit of creating a new project from the very beginning. Note the center button on the left-hand window in the preceding screenshot. Creating a new project (as opposed to creating VIs and sub-VIs) has many advantages and it is a must if the program created will have to run as an executable on computers that do not have LabVIEW installed on them. Later versions of LabVIEW have streamlined the creation of a project and have added many templates and starting points to them. Although, for the sake of simplicity, we created our first example with the creation of a simple VI, one could almost as easily create a project and choose from many starting points, templates, and other concepts (such as architecture) in LabVIEW. The most useful starting point for a complete and user-friendly application for data acquisition would be a state machine. Throughout the book, we will create simple VIs as a quick and simple way to illustrate a point but, by the end of the book, we will collect all of the VIs, icons, drivers, and sub-VIs in one complete state machine, all collected in one complete project. From the project created, we will create a standalone application that will not need the LabVIEW environment to execute, which could run on any computer that has LabVIEW runtime engine installed on it. To summarize, we went through the basics of LabVIEW and the main functionality of each of its icons by way of an actual user-interactive example. LabVIEW is capable of developing embedded systems, fuzzy logic, and almost everything in between! If you are interested to know more about LabVIEW, check out this book Data Acquisition Using LabVIEW.    
Read more
  • 0
  • 0
  • 4449

article-image-install-elasticsearch-ubuntu-windows
Fatema Patrawala
16 Feb 2018
3 min read
Save for later

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala
16 Feb 2018
3 min read
[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack  co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.  
Read more
  • 0
  • 0
  • 11845
article-image-build-enable-jenkins-mesos-plugin
Vijin Boricha
16 Feb 2018
4 min read
Save for later

How to build and enable the Jenkins Mesos plugin

Vijin Boricha
16 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by David Blomquist and Tomasz Janiszewski titled Apache Mesos Cookbook. From this book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In today’s tutorial, we will learn about building and enabling the Jenkins Mesos plugin. Building the Jenkins Mesos plugin By default, Jenkins uses statically created agents and runs jobs on them. We can extend this behavior with a plugin that will make Jenkins use Mesos as a resource manager. Jenkins will register as a Mesos framework and accept offers when it needs to run a job. How to do it The Jenkins Mesos plugin installation is a little bit harder than Marathon. There are no official binary packages for it, so it must be installed from sources:  First of all, we need to download the source code: curl -L https://github.com/jenkinsci/mesos-plugin/archive/mesos-0.14.0.tar. gz | tar -zx cd jenkinsci-mesos-plugin-*  The plugin is written in Java and to build it we need Maven (mvn): sudo apt install maven  Finally, build the package: mvn package If everything goes smoothly, you should see information, that all tests passed and the plugin package will be placed in target/mesos.hpi. Jenkins is written in Java and presents an API for creating plugins. Plugins do not have to be written in Java, but must be compatible with those interfaces so most plugins are written in Java. The natural choice for building a Java application is Maven, although Gradle is getting more and more popular. The Jenkins Mesos plugin uses the Mesos native library to communicate with Mesos. This communication is now deprecated, so the plugin does not support all Mesos features that are available with the Mesos HTTP API. Enabling the Jenkins Mesos plugin Here  you will learn how to enable the Mesos Jenkins plugin and configure a job to be run on Mesos. How to do it... The first step is to install the Mesos Jenkins plugin. To do so, navigate to the Plugin Manager by clicking Manage Jenkins | Manage Plugins, and select the Advanced tab. You should see the following screen: Click Choose  file and select the previously built plugin to upload it. Once the plugin is installed, you have to configure it. To do so, go to the configuration (Manage Jenkins | Configure  System).  At the bottom of the page, the cloud section  should appear. Fill in all the fields with  the desired configuration values: This was the last step of the plugin installation. If you now disable Advanced On- demand framework registration, you should see the Jenkins Scheduler registered in the Mesos frameworks. Remember to configure Slave  username to the existing system user on Mesos agents. It will be used to run your jobs. By default, it will be jenkins. You can create it on slaves with the following command: adduser jenkins Be careful when providing an IP or hostnames for Mesos and Jenkins. It must match the IP used later by the scheduler for communication. By default, the Mesos native library binds to the interface that the hostname resolves to. This could lead to problems in communication, especially when receiving messages from Mesos. If you see your Jenkins is connected, but jobs are stuck and agents do not start, check if Jenkins is registered with the proper IP. You can set the IP used by Jenkins by adding the following line in /etc/default/jenkins (in this example, we assume Jenkins should bind on 10.10.10.10): LIBPROCESS_IP=10.10.10.10 We learnt about building and enabling Jenkins Mesos plugin. You can know more about how to configure and maintain Apache Mesos from Apache Mesos Cookbook.    
Read more
  • 0
  • 0
  • 2453

article-image-4-ways-implement-feature-selection-python-machine-learning
Sugandha Lahoti
16 Feb 2018
13 min read
Save for later

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti
16 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52   1411.887 17.605 53.108  2175.565   127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False   True  True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663   0.06159078  0.02579012] [[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02 9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03] [ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn         from sklearn.feature_selection         import SelectFromModel          np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees            = 250 max_feat     = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.
Read more
  • 0
  • 4
  • 54913

article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 6349
article-image-manipulating-text-data-using-python-regular-expressions-regex
Sugandha Lahoti
16 Feb 2018
8 min read
Save for later

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti
16 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Allan Visochek titled Practical Data Wrangling. This book covers practical data wrangling techniques in Python and R to turn your noisy data into relevant, insight-ready information.[/box] In today’s tutorial, we will learn how to manipulate text data using regular expressions in Python. What is a Regular expression A regular expression, or regex for short, is simply a sequence of characters that specifies a certain search pattern. Regular expressions have been around for quite a while and are a field of computer science in and of themselves. In Python, regular expression operations are handled using Python's built in re module. In this section, I will walk through the basics of creating regular expressions and using them to  You can implement a regular expression with the following steps: Specify a pattern string. Compile the pattern string to a regular expression object. Use the regular expression object to search a string for the pattern. Optional: Extract the matched pattern from the string. Writing and using a regular expression The first step to creating a regular expression in Python is to import the re module: import re Python regular expressions are expressed using pattern strings, which are strings that specify the desired search pattern. In its simplest form, a pattern string can consist only of letters, numbers, and spaces. The following pattern string expresses a search query for an exact sequence of characters. You can think of each character as an individual pattern. In later examples, I will discuss more sophisticated patterns: import re pattern_string = "this is the pattern" The next step is to process the pattern string into an object that Python can use in order to search for the pattern. This is done using the compile() method of the re module. The compile() method takes the pattern string as an argument and returns a regex object: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) Once you have a regex object, you can use it to search within a search string for the pattern specified in the pattern string. A search string is just the name for the string in which you are looking for a pattern. To search for the pattern, you can use the search() method of the regex object as follows: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) match = regex.search("this is the pattern") If the pattern specified in the pattern string is in the search string, the search() method will return a match object. Otherwise, it returns the None data type, which is an empty value. Since Python interprets True and False values rather loosely, the result of the search function can be used like a Boolean value in an if statement, which can be rather convenient: .... match = regex.search("this is the pattern") if match: print("this was a match!") The search string this is the pattern should produce a match, because it matches exactly the pattern specified in the pattern string. The search function will produce a match if the pattern is found at any point in the search string as the following demonstrates: .... match = regex.search("this is the pattern") if match: print("this was a match!") if regex.search("*** this is the pattern ***"): print("this was not a match!") if not regex.search("this is not the pattern"): print("this was not a match!") Special Characters Regular expressions depend on the use of certain special characters in order to express patterns. Due to this, the following characters should not be used directly unless they are used for their intended purpose: . ^ $ * + ? {} () [] | If you do need to use any of the previously mentioned characters in a pattern string to search for that character, you can write the character preceded by a backslash character. This is called escaping characters. Here's an example: pattern string = "c*b" ## matches "c*b" If you need to search for the backslash character itself, you use two backslash characters, as follows: pattern string = "cb" ## matches "cb" Matching whitespace Using s at any point in the pattern string matches a whitespace character. This is more general then the space character, as it applies to tabs and newline characters: .... a_space_b = re.compile("asb") if a_space_b.search("a b"): print("'a b' is a match!") if a_space_b.search("1234 a b 1234"): print("'1234 a b 1234' is a match") if a_space_b.search("ab"): print("'1234 a b 1234' is a match") Matching the start of string If the ^ character is used at the beginning of the pattern string, the regular expression will only produce a match if the pattern is found at the beginning of the search string: .... a_at_start = re.compile("^a") if a_at_start.search("a"): print("'a' is a match") if a_at_start.search("a 1234"): print("'a 1234' is a match") if a_at_start.search("1234 a"): print("'1234 a' is a match") Matching the end of a string Similarly, if the $ symbol is used at the end of the pattern string, the regular expression will only produce a match if the pattern appears at the end of the search string: .... a_at_end = re.compile("a$") if a_at_end.search("a"): print("'a' is a match") if a_at_end.search("a 1234"): print("'a 1234' is a match") if a_at_end.search("1234 a"): print("'1234 a' is a match") Matching a range of characters It is possible to match a range of characters instead of just one. This can add some flexibility to the pattern: [A-Z] matches all capital letters [a-z] matches all lowercase letters [0-9] matches all digits .... lower_case_letter = re.compile("[a-z]") if lower_case_letter.search("a"): print("'a' is a match") if lower_case_letter.search("B"): print("'B' is a match") if lower_case_letter.search("123 A B 2"): print("'123 A B 2' is a match") digit = re.compile("[0-9]") if digit.search("1"): print("'a' is a match") if digit.search("342"): print("'a' is a match") if digit.search("asdf abcd"): print("'a' is a match") Matching any one of several patterns If there is a fixed number of patterns that would constitute a match, they can be combined using the following syntax: (<pattern1>|<pattern2>|<pattern3>) The following a_or_b regular expression will match any string where there is either an a character or a b character: .... a_or_b = re.compile("(a|b)") if a_or_b.search("a"): print("'a' is a match") if a_or_b.search("b"): print("'b' is a match") if a_or_b.search("c"): print("'c' is a match") Matching a sequence instead of just one character If the + character comes after another character or pattern, the regular expression will match an arbitrarily long sequence of that pattern. This is quite useful, because it makes it easy to express something like a word or number that can be of arbitrary length. Putting patterns together More sophisticated patterns can be produced by combining pattern strings one after the other. In the following example, I've created a regular expression that searches for a number strictly followed by a word. The pattern string that generates the regular expression is composed of the following: A pattern string that matches a sequence of digits: [0-9]+ A pattern string that matches a whitespace character: s  A pattern string that matches a sequence of letters: [a-z]+ A pattern string that matches either the end of the string or a whitespace character: (s|$) .... number_then_word = re.compile("[0-9]+s[a-z]+(s|$)") The regex split() function Regex objects in Python also have a split() method. The split method splits the search string into an array of substrings. The splits occur at each location along the string where the pattern is identified. The result is an array of strings that occur between instances of the pattern. If the pattern occurs at the beginning or end of the search string, an empty string is included at the beginning or end of the resulting array, respectively: .... print(a_or_b.split("123a456b789")) print(a_or_b.split("a1b")) If you are interested, the Python documentation has a more complete coverage of regular expressions. It can be found at https://docs.python.org/3.6/library/re.html. We saw various ways of using regular expressions in Python. To know more about data wrangling techniques using simple and real-world data-sets you may check out this book Practical Data Wrangling.  
Read more
  • 0
  • 0
  • 5768

article-image-6-popular-regression-techniques-must-know
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

6 Popular Regression Techniques you must know

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 3284