How-To Tutorials

article-image-stack-wars-epic-struggle-who-controls-tech-stack

20 Feb 2018

4 min read

Stack Wars: The epic struggle for control of the tech stack

20 Feb 2018

The choice of tech stack for a project, team or organisation is an ongoing struggle between competing forces. Each of the players has their own logic, beliefs and drivers. Where you stand and what side you are on totally determines the way you see the struggle. Packt is on the developer team. This is how we see the struggle we’re all part of: Technology vendors are the Empire Any organisation that is selling tools, technologies or platform services is either already behaving like the Empire, or will, eventually, become the Empire. Vendors want the stack to include their tech, and if the vendor has a full stack like IBM, MS, or Oracle then they want you to live in their world. To be completely Blue or Red Stack. The economics driving this are relentless. The biggest cost for large software vendors is acquiring customers. Once you have a customer, it makes sense to keep expanding your product portfolio to sell more to each customer. The end game is when the Empire captures whole planets from the Alliance and enslaves the occupants in a move called Large Outsourcing Deals. Businesses and IT departments are the Rebel Alliance Companies and organisations build systems to try and serve their users and customers. Their underlying intentions are good. They are trying to do the right thing. They do the best they can. They have to manage within a structured organisation, co-ordinating different groups and teams. They sometimes have some cool new stuff, but often they are struggling with outdated kit, against overwhelming odds. Companies sometimes achieve great things in specific battles with heroic individuals and teams, but they also have to keep the whole show on the road. The Empire Vendors are constantly trying to bring them into their captive stack-universe, to make life “easier" with the comforting myth of the one-stop-shop. The Alliance gets new weapons and allies in the form of insurgent vendors who start out fighting the Empire, like GitHhub, Jira and AWS. However, these can be dangerous alliances. The iron law of the costs of customer acquisition will drive even the insurgent vendors to continually expand their product offer and then - BAM! – another empire wanting to lock you in. They call this the ‘Land and Expand’ strategy and every vendor has it, overtly or secretly. Even the currently much-beloved Slack will eventually try and turn itself into the Facebook of the office, and will gobble up the app ecosystem just like Facebook. They all cross over to the dark side eventually. Developers are the Jedi Devs have a deep understanding of how technologies really work in action because they have to actually build things. This knowledge can appear mystical to outsiders. It is hard to express and articulate the intuitive skills gained from actual development experience. The very best devs are 10, 100, 1000 times more productive than the implementation teams from the vendors. Devs know what vendor tools are really like under the hood, when the action starts. They know that even the Death Star has hidden yet fatal vulnerabilities, no matter how great it looks from a distance. Over the years devs have evolved their own special ways of working that is hard for outsiders to understand. These go by the names of Agile and Open Source. Agile is a semi-mysterious Way, trusting the process to migrate towards success, without being really able to say what that is before we realise we get there. Open Source is the shared network that binds developers together into a powerful network of shared power on platforms like GitHub. Devs have two forces driving them. The first is to get the very best tech stack for each project, based on their unique technical insight into how it really works. Devs always want to choose best of breed, for this problem, here and now. But devs also have personal weapons of choice, over which they have mastery, and will try and use these wherever possible. Laser swords can do a lot more than you think, but there are other, better weapons in certain circumstances. Stack Wars are never going to end. There will be more and more episodes of this eternal struggle. The Empire can never be completely defeated, any more than the Jedi can die out. The story needs all three, and ebbs and flows over time in a pattern that repeats itself but in new and different ways.

0
0
2533

How-To Tutorials

article-image-getting-know-generative-models-types

Sunith Shetty

20 Feb 2018

9 min read

Getting to know Generative Models and their types

Sunith Shetty

20 Feb 2018

9 min read

0
0
6685

article-image-introduction-device-management

Packt

20 Feb 2018

10 min read

Introduction with Device Management

Packt

20 Feb 2018

10 min read

In this article by Yatish Patil, the author of the book Microsoft Azure IOT Development Cookbook, we will look at device management using different techniques with Azure IoT Hub. We will see the following recipes: Device registry operations Device twins Device direct methods Device jobs (For more resources related to this topic, see here.) Azure IoT Hub has the capabilities that can be used by a developer to build a robust device management. There could be different use cases or scenarios across multiple industries but these device management capabilities, their patterns and the SDK code remains same, saving the significant time in developing and managing as well as maintaining the millions of devices. Device management will be the central part of any IoT solution. The IoT solution is going to help the users to manage the devices remotely, take actions from the cloud based application like disable, update data, run any command, and firmware update. In this article, we are going to perform all these tasks for device management and will start with creating the device. Device registry operations This sample application is focused on device registry operations and how it works, we will create a console application as our first IoT solution and look at the various device management techniques. Getting ready Let’s create a console application to start with IoT: Create a new project in Visual Studio: Create a Console Application Add IoT Hub connectivity extension in Visual Studio: Add the extension for IoT Hub connectivity Now right click on the Solution and go to Add a Connected Services. Select Azure IoT Hub and click Add. Now select Azure subscription and the IoT Hub created: Select IoT Hub for our application Next it will ask you to add device or you can skip this step and click Complete the configuration. How to do it... Create device identity: initialize the Azure IoT Hub registry connection: registryManager = RegistryManager.CreateFromConnectionString(connectionString); Device device = new Device(); try { device = await registryManager.AddDeviceAsync(new Device(deviceId)); success = true; } catch (DeviceAlreadyExistsException) { success = false; } Retrieve device identity by ID: Device device = new Device(); try { device = await registryManager.GetDeviceAsync(deviceId); } catch (DeviceAlreadyExistsException) { return device; } Delete device identity: Device device = new Device(); try { device = GetDevice(deviceId); await registryManager.RemoveDeviceAsync(device); success = true; } catch (Exception ex) { success = false; } List up to 1000 identities: try { var devicelist = registryManager.GetDevicesAsync(1000); return devicelist.Result; } catch (Exception ex) { // Export all identities to Azure blob storage: var blobClient = storageAccount.CreateCloudBlobClient(); string Containername = "iothubdevices"; //Get a reference to a container var container = blobClient.GetContainerReference(Containername); container.CreateIfNotExists(); //Generate a SAS token var storageUri = GetContainerSasUri(container); await registryManager.ExportDevicesAsync(storageUri, "devices1.txt", false); } Import all identities to Azure blob storage: await registryManager.ImportDevicesAsync(storageUri, OutputStorageUri); How it works... Let’s now understand the steps we performed. We initiated by creating a console application and configured it for the Azure IoT Hub solution. The idea behind this is to see the simple operation for device management. In this article, we started with simple operation for provision of the device by adding it to IoT Hub. We need to create connection to the IoT Hub followed by the created object of registry manager which is a part of devices namespace. Once we are connected we can perform operations like, add device, delete device, get device, these methods are asynchronous ones. IoT Hub also provides a way where in it connects with Azure storage blob for bulk operations like export all devices or import all devices, this works on JSON format only, the entire set of IoT devices gets exported in this way. There's more... Device identities are represented as JSON documents. It consists of properties like: deviceId: It represents the unique identification or the IoT device. ETag: A string representing a weak ETag for the device identity. symkey: A composite object containing a primary and a secondary key, stored in base64 format. status: If enabled, the device can connect. If disabled, this device cannot access any device-facing Endpoint. statusReason: A string that can be used to store the reason for the status changes. connectionState: It can be connected or disconnected. Device twins First we need to understand what device twin is and what is the purpose where we can use the device twin in any IoT solution. The device twin is a JSON formatted document that describes the metadata, properties of any device created within IoT Hub. It describes the individual device specific information. The device twin is made up of: tags, desired properties, and the reported properties. The operation that can be done by a IoT solution are basically update this the data, query for any IoT device. Tags hold the device metadata that can be accessed from IoT solution only. Desired properties are set from IoT solution and can be accessed on the device. Whereas the reported properties are set on the device and retrieved at IoT solution end. How to do it... Store device metadata: var patch = new { properties = new { desired = new { deviceConfig = new { configId = Guid.NewGuid().ToString(), DeviceOwner = "yatish", latitude = "17.5122560", longitude = "70.7760470" } }, reported = new { deviceConfig = new { configId = Guid.NewGuid().ToString(), DeviceOwner = "yatish", latitude = "17.5122560", longitude = "70.7760470" } } }, tags = new { location = new { region = "US", plant = "Redmond43" } } }; await registryManager.UpdateTwinAsync(deviceTwin.DeviceId, JsonConvert.SerializeObject(patch), deviceTwin.ETag); Query device metadata: var query = registryManager.CreateQuery("SELECT * FROM devices WHERE deviceId = '" + deviceTwin.DeviceId + "'"); Report current state of device: var results = await query.GetNextAsTwinAsync(); How it works... In this sample, we retrieved the current information of the device twin and updated the desired properties, which will be accessible on the device side. In the code, we will set the co-ordinates of the device with latitude and longitude values, also the device owner name and so on. This same value will be accessible on the device side. In the similar manner, we can set some properties on the device side which will be a part of the reported properties. While using the device twin we must always consider: Tags can be set, read, and accessed only by backend . Reported properties are set by device and can be read by backend. Desired properties are set by backend and can be read by backend. Use version and last updated properties to detect updates when necessary. Each device twin size is limited to 8 KB by default per device by IoT Hub There's more... Device twin metadata always maintains the last updated time stamp for any modifications. This is UTC time stamp maintained in the metadata. Device twin format is JSON format in which the tags, desired, and reported properties are stored, here is sample JSON with different nodes showing how it is stored: "tags": { "$etag": "1234321", "location": { "country": "India" "city": "Mumbai", "zipCode": "400001" } }, "properties": { "desired": { "latitude": 18.75, "longitude": -75.75, "status": 1, "$version": 4 }, "reported": { "latitude": 18.75, "longitude": -75.75, "status": 1, "$version": 4 } } Device direct methods Azure IoT Hub provides a fully managed bi-directional communication between the IoT solution on the backend and the IoT devices in the fields. When there is need for an immediate communication result, a direct method best suites the scenarios. Lets take example in home automation system, one needs to control the AC temperature or on/off the faucet showers. Invoke method from application: public async Task<CloudToDeviceMethodResult> InvokeDirectMethodOnDevice(string deviceId, ServiceClient serviceClient) { var methodInvocation = new CloudToDeviceMethod("WriteToMessage") { ResponseTimeout = TimeSpan.FromSeconds(300) }; methodInvocation.SetPayloadJson("'1234567890'"); var response = await serviceClient.InvokeDeviceMethodAsync(deviceId, methodInvocation); return response; } Method execution on device: deviceClient = DeviceClient.CreateFromConnectionString("", TransportType.Mqtt); deviceClient.SetMethodHandlerAsync("WriteToMessage", new DeviceSimulator().WriteToMessage, null).Wait(); deviceClient.SetMethodHandlerAsync("GetDeviceName", new DeviceSimulator().GetDeviceName, new DeviceData("DeviceClientMethodMqttSample")).Wait(); How it works... Direct method works on request-response interaction with the IoT device and backend solution. It works on timeout basis if no reply within that, it fails. These synchronous requests have by default 30 seconds of timeout, one can modify the timeout and increase up to 3600 depending on the IoT scenarios they have. The device needs to connect using the MQTT protocol whereas the backend solution can be using HTTP. The JSON data size direct method can work up to 8 KB Device jobs In a typical scenario, device administrator or operators are required to manage the devices in bulk. We look at the device twin which maintains the properties and tags. Conceptually the job is nothing but a wrapper on the possible actions which can be done in bulk. Suppose we have a scenario in which we need to update the properties for multiple devices, in that case one can schedule the job and track the progress of that job. I would like to set the frequency to send the data at every 1 hour instead of every 30 min for 1000 IoT devices. Another example could be to reboot the multiple devices at the same time. Device administrators can perform device registration in bulk using the export and import methods. How to do it... Job to update twin properties. var twin = new Twin(); twin.Properties.Desired["HighTemperature"] = "44"; twin.Properties.Desired["City"] = "Mumbai"; twin.ETag = "*"; return await jobClient.ScheduleTwinUpdateAsync(jobId, "deviceId='"+ deviceId + "'", twin, DateTime.Now, 10); Job status. var twin = new Twin(); twin.Properties.Desired["HighTemperature"] = "44"; twin.Properties.Desired["City"] = "Mumbai"; twin.ETag = "*"; return await jobClient.ScheduleTwinUpdateAsync(jobId, "deviceId='"+ deviceId + "'", twin, DateTime.Now, 10); How it works... In this example, we looked at a job updating the device twin information and we can follow up the job for its status to find out if the job was completed or failed. In this case, instead of having single API calls, a job can be created to execute on multiple IoT devices. The job client object provides the jobs available with the IoT Hub using the connection to it. Once we locate the job using its unique ID we can retrieve the status for it. The code snippet mentioned in the How to do it... preceding recipe, uses the temperature properties and updates the data. The job is scheduled to start execution immediately with 10 seconds of execution timeout set. There's more... For a job, the life cycle begins with initiation from the IoT solution. If any job is in execution, we can query to it and see the status of execution. Another most common scenario where this could be useful is the firmware update, reboot, configuration updates, and so on, apart from the device property read or write. Each device job has properties that helps us working with them. The useful properties are start and end date time, status, and lastly device job statistics which gives the job execution statistics. Summary We have learned the device management using different techniques with Azure IoT Hub in detail. We have explained, how the IoT solution is going to help the users to manage the devices remotely, take actions from the cloud based application like disable, update data, run any command, and firmware update. We also performed different tasks for device management. Resources for Article: Further resources on this subject: Device Management in Zenoss Core Network and System Monitoring: Part 1 [article] Device Management in Zenoss Core Network and System Monitoring: Part 2 [article] Managing Network Devices [article]

0
0
3512

article-image-share-insights-using-alteryx-server

Sunith Shetty

20 Feb 2018

6 min read

How to share insights using Alteryx Server

Sunith Shetty

20 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to implement efficient business intelligence solutions without writing a single line of code using Alteryx platform.[/box] In today’s tutorial, we will learn about Alteryx Server, an easiest and fastest way to deploy data intensive analytics across the organization. What is Alteryx Server? Alteryx Server provides a scalable platform for deploying and sharing analytics. This is an effective and secure establishment when deploying data rapidly. You can integrate Alteryx processes directly into other internal and external applications from the built in macros and APIs. Alteryx Server can help you speed up business decisions and enable you to get answers in hours, not weeks. You will learn about these powerful features that revolutionize data processing using Alteryx Server: Speed time-to-insight with highly scalable workloads Empower every employee to make data-driven decisions Reduce risk and downtime with analytic governance Before learning about these powerful features, let’s review the Server Structure illustration so you have a solid understanding of how the server functions: Enterprise scalability Enterprise Scalability allows you to scale your enterprise analytics that will speed time to insight. Alteryx Server will compute the data processing by scheduling and running workflows. This reliable server architecture will process data intensive workflows at your scalable fashion. Deploy Alteryx Server on a single machine or in a multi-node environment, allowing you to scale up the number of cores on your existing server or add additional server nodes for availability and improved performance as needed. Ultimate flexibility and scalability Highly complex analytics and large scale data can use a large amount of memory and processing that can take hours to run on analysts' desktops. This can lead to a delay in business answers and sharing those insights. In addition, less risk is associated with running jobs on Alteryx Server, due to system shutdowns and it being less compressible compared to running on desktop. Your IT professionals will install and maintain Alteryx Server; you can rest assured that critical workflow backups and software updates take place regularly. Alteryx Server provides a flexible server architecture with on-premise or cloud deployment to build out enterprise analytic practice for 15 users or 15,000 users. Alteryx Server can be scaled in three different ways: Scaling the Worker node for additional processing power: Increase the total number of workflows that can be processed at any given time by creating multiple Worker nodes. This will scale out the Workers. Scaling the Gallery node for additional web users: Add a load balancer to increase capacity and create multiple Gallery nodes to place behind a load balancer. This will be helpful if you have many Gallery users. Scaling the Database node for availability and redundancy: Create multiple Database nodes by scaling out the persistent databases. This is great for improving overall system performance and ensuring backups. More hardware for Alteryx Server components may need to be added and the following table provides some guidelines: Scheduling and automating workflow execution to deliver data whenever and wherever you want Maximize automation potential by utilizing built-in scheduling and automation capabilities to schedule and run analytic workflows as needed, refresh data sets on a centralized server, and generate reports so everyone can access the data, anytime, anywhere. This will allow you to focus more time on analytic problems, rather than keeping an eye on your workflows running on the desktop. Let the server manage the jobs on a schedule. You can schedule workflows, packages, or apps to run automatically through the company's Gallery, or to a controller. Also, you can schedule to your computer through Desktop Automation (Scheduler). To schedule a workflow, go to Options | Schedule Workflow and to View Schedules go to Options | View Schedules as shown in the following image: If you want to schedule to your company's Gallery, you will need to connect to your company's Gallery first. Add a Gallery if you aren't connected to one. To add a Gallery, select Options | Schedule Workflow | Add Gallery. Type the URL path to your company's Gallery and click Connect. The connection is made based on built-in authentication by adding your Gallery email and password or Windows authentication by logging in through your user name. The following screenshot shows the URL entry screen: Schedule your workflow to run on a controller. A controller is a machine that runs and manages schedules for your organization. A token is needed to connect to the controller once the Alteryx Server Administrator at your company sets up the controller. To add a controller, select Options | Schedule Workflow | Add Controller. The following illustration is where you will add the server name and the controller token to proceed with connecting to the controller: Sharing and collaboration Data Analysts spend too much time customizing existing reports and rerunning workflows for different decision-makers instead of adding business value by working on new analytics projects. Alteryx Server lets you share macros and analytic applications, empowering business users to perform their own self-service analytics. You can easily share, collaborate on, and iterate workflows with analysts throughout your organization through integrated version control for published analytic applications. The administrators and authors of analytic applications can grant access to analytic workflows and specific apps within the gallery to ensure that the right people have access to the analytics they need. The following image shows schedules on the Alteryx Analytics Gallery for easy sharing and collaboration: Analytic governance Alteryx Server provides a built-in secure repository and version control capabilities to enable effective collaboration, allowing you to store analytic applications in a centralized location and ensure users only access the data for which they have permissions. The following screenshot shows the permission types to assign for maintaining secure access and sharing deployment: The goal of managing multiple teams collaborating together and deploying enterprise self-service analytics is to reduce downtime and risk, while ensuring analytic and information governance. Many organizations have become accustomed to a data-driven culture, enabling every employee to use analytics and helping business users to leverage the analytic tools available. You can meet service-level agreements with detailed auditing, usage reporting, and logging tools, and your system administrators can rest assured that your data remains safe and secure. To summarize, we learned about Alteryx Server which has powerful abilities to schedule and deploy workflows to share it with your team. We also explored how the scheduler is used to process workflows and is helpful for running night jobs, since the server functions 24*7. To know more about workflow optimization, and to carry out efficient data preparation and blending, do checkout this book Learning Alteryx.

0
0
3160

article-image-what-makes-hadoop-so-revolutionary

Packt

20 Feb 2018

17 min read

What makes Hadoop so revolutionary?

Packt

20 Feb 2018

17 min read

In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over: The priority of the tasks at hand. Number of containers to be launched to complete the tasks. The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x). Available nodes where job containers can be launched with required resources Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]

0
0
4193

article-image-installing-configuring-x-pack-elasticsearch-kibana

Pravin Dhandre

20 Feb 2018

6 min read

Installing and Configuring X-pack on Elasticsearch and Kibana

Pravin Dhandre

20 Feb 2018

6 min read

0
0
16098

article-image-introduction-performance-testing-and-jmeter

Packt

20 Feb 2018

11 min read

Introduction to Performance Testing and JMeter

Packt

20 Feb 2018

11 min read

In this article by Bayo Erinle, the author of the book Performance Testing with JMeter 3, will explore some of the options that make JMeter a great tool of choice for performance testing. (For more resources related to this topic, see here.) Performance testing and tuning There is a strong relationship between performance testing and tuning, in the sense that one often leads to the other. Often, end-to-end testing unveils system or application bottlenecks that are regarded unacceptable with project target goals. Once those bottlenecks are discovered, the next step for most teams is a series of tuning efforts to make the application perform adequately. Such efforts normally include, but are not limited to, the following: Configuring changes in system resources Optimizing database queries Reducing round trips in application calls, sometimes leading to redesigning and re-architecting problematic modules Scaling out application and database server capacity Reducing application resource footprint Optimizing and refactoring code, including eliminating redundancy and reducing execution time Tuning efforts may also commence if the application has reached acceptable performance but the team wants to reduce the amount of system resources being used, decrease the volume of hardware needed, or further increase system performance. After each change (or series of changes), the test is re-executed to see whether the performance has improved or declined due to the changes. The process will be continued with the performance results having reached acceptable goals. The outcome of these test-tuning circles normally produces a baseline. Baselines Baseline is a process of capturing performance metric data for the sole purpose of evaluating the efficacy of successive changes to the system or application. It is important that all characteristics and configurations, except those specifically being varied for comparison, remain the same in order to make effective comparisons as to which change (or series of changes) is driving results toward the targeted goal. Armed with such baseline results, subsequent changes can be made to the system configuration or application and testing results can be compared to see whether such changes were relevant or not. Some considerations when generating baselines include the following: They are application-specific They can be created for system, application, or modules They are metrics/results They should not be over generalized They evolve and may need to be redefined from time to time They act as a shared frame of reference They are reusable They help identify changes in performance Load and stress testing Load testing is the process of putting demand on a system and measuring its response, that is, determining how much volume the system can handle. Stress testing is the process of subjecting the system to unusually high loads far beyond its normal usage pattern to determine its responsiveness. These are different from performance testing, whose sole purpose is to determine the response and effectiveness of a system, that is, how fast the system is. Since load ultimately affects how a system responds, performance testing is always done in conjunction with stress testing. JMeter to the rescue One of the areas performance testing covers is testing tools. Which testing tool do you use to put the system and application under load? There are numerous testing tools available to perform this operation, from free to commercial solutions. However, our focus will be on Apache JMeter, a free, open source, cross-platform desktop application from the Apache Software foundation. JMeter has been around since 1998 according to historic change logs on its official site, making it a mature, robust, and reliable testing tool. Cost may also have played a role in its wide adoption. Small companies usually may not want to foot the bill for commercial end testing tools, which often place restrictions, for example, on how many concurrent users one can spin off. My first encounter with JMeter was exactly a result of this. I worked in a small shop that had paid for a commercial testing tool, but during the course of testing, we had outrun the licensing limits of how many concurrent users we needed to simulate for realistic test plans. Since JMeter was free, we explored it and were quite delighted with the offerings and the share amount of features we got for free. Here are some of its features: Performance tests of different server types, including web (HTTP and HTTPS), SOAP, database, LDAP, JMS, mail, and native commands or shell scripts Complete portability across various operating systems Full multithreading framework allowing concurrent sampling by many threads and simultaneous sampling of different functions by separate thread groups Full featured Test IDE that allows fast Test Plan recording, building, and debugging Dashboard Report for detailed analysis of application performance indexes and key transactions In-built integration with real-time reporting and analysis tools, such as Graphite, InfluxDB, and Grafana, to name a few Complete dynamic HTML reports Graphical User Interface (GUI) HTTP proxy recording server Caching and offline analysis/replaying of test results High extensibility Live view of results as testing is being conducted JMeter allows multiple concurrent users to be simulated on the application, allowing you to work toward most of the target goals obtained earlier, such as attaining baseline and identifying bottlenecks. It will help answer questions, such as the following: Will the application still be responsive if 50 users are accessing it concurrently? How reliable will it be under a load of 200 users? How much of the system resources will be consumed under a load of 250 users? What will the throughput look like with 1000 users active in the system? What will be the response time for the various components in the application under load? JMeter, however, should not be confused with a browser. It doesn't perform all the operations supported by browsers; in particular, JMeter does not execute JavaScript found in HTML pages, nor does it render HTML pages the way a browser does. However, it does give you the ability to view request responses as HTML through many of its listeners, but the timings are not included in any samples. Furthermore, there are limitations to how many users can be spun on a single machine. These vary depending on the machine specifications (for example, memory, processor speed, and so on) and the test scenarios being executed. In our experience, we have mostly been able to successfully spin off 250-450 users on a single machine with a 2.2 GHz processor and 8 GB of RAM. Up and running with JMeter Now, let's get up and running with JMeter, beginning with its installation. Installation JMeter comes as a bundled archive, so it is super easy to get started with it. Those working in corporate environments behind a firewall or machines with non-admin privileges appreciate this more. To get started, grab the latest binary release by pointing your browser to http://jmeter.apache.org/download_jmeter.cgi. At the time of writing this, the current release version is 3.1. The download site offers the bundle as both a .zip file and a .tgz file. We go with the .zip file option, but feel free to download the .tgz file if that's your preferred way of grabbing archives. Once downloaded, extract the archive to a location of your choice. The location you extracted the archive to will be referred to as JMETER_HOME. Provided you have a JDK/JRE correctly installed and a JAVA_HOME environment variable set, you are all set and ready to run! The following screenshot shows a trimmed down directory structure of a vanilla JMeter install: JMETER_HOME folder structure The following are some of the folders in Apache-JMeter-3.2, as shown in the preceding screenshot: bin: This folder contains executable scripts to run and perform other operations in JMeter docs: This folder contains a well-documented user guide extras: This folder contains miscellaneous items, including samples illustrating the usage of the Apache Ant build tool (http://ant.apache.org/) with JMeter, and bean shell scripting lib: This folder contains utility JAR files needed by JMeter (you may add additional JARs here to use from within JMeter; we will cover this in detail later) printable_docs: This is the printable documentation Installing Java JDK Follow these steps to install Java JDK: Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html. Download Java JDK (not JRE) compatible with the system that you will use to test. At the time of writing, JDK 1.8 (update 131) was the latest. Double-click on the executable and follow the onscreen instructions. On Windows systems, the default location for the JDK is under Program Files. While there is nothing wrong with this, the issue is that the folder name contains a space, which can sometimes be problematic when attempting to set PATH and run programs, such as JMeter, depending on the JDK from the command line. With this in mind, it is advisable to change the default location to something like C:toolsjdk. Setting up JAVA_HOME Here are the steps to set up the JAVA_HOME environment variable on Windows and Unix operating systems. On Windows For illustrative purposes, assume that you have installed Java JDK at C:toolsjdk: Go to Control Panel. Click on System. Click on Advance System settings. Add Environment to the following variables: Value: JAVA_HOME Path: C:toolsjdk Locate Path (under system variables, bottom half of the screen). Click on Edit. Append %JAVA_HOME%/bin to the end of the existing path value (if any). On Unix For illustrative purposes, assume that you have installed Java JDK at /opt/tools/jdk: Open up a Terminal window. Export JAVA_HOME=/opt/tools/jdk. Export PATH=$PATH:$JAVA_HOME. It is advisable to set this in your shell profile settings, such as .bash_profile (for bash users) or .zshrc (for zsh users), so that you won't have to set it for each new Terminal window you open. Running JMeter Once installed, the bin folder under the JMETER_HOME folder contains all the executable scripts that can be run. Based on the operating system that you installed JMeter on, you either execute the shell scripts (.sh file) for operating systems that are Unix/Linux flavored, or their batch (.bat file) counterparts on operating systems that are Windows flavored. JMeter files are saved as XML files with a .jmx extension. We refer to them as test scripts or JMX files. These scripts include the following: jmeter.sh: This script launches JMeter GUI (the default) jmeter-n.sh: This script launches JMeter in non-GUI mode (takes a JMX file as input) jmeter-n-r.sh: This script launches JMeter in non-GUI mode remotely jmeter-t.sh: This opens a JMX file in the GUI jmeter-server.sh: This script starts JMeter in server mode (this will be kicked off on the master node when testing with multiple machines remotely) mirror-server.sh: This script runs the mirror server for JMeter shutdown.sh: This script gracefully shuts down a running non-GUI instance stoptest.sh: This script abruptly shuts down a running non-GUI instance To start JMeter, open a Terminal shell, change to the JMETER_HOME/bin folder, and run the following command on Unix/Linux: ./jmeter.sh Alternatively, run the following command on Windows: jmeter.bat Take a moment to explore the GUI. Hover over each icon to see a short description of what it does. The Apache JMeter team has done an excellent job with the GUI. Most icons are very similar to what you are used to, which helps ease the learning curve for new adapters. Some of the icons, for example, stop and shutdown, are disabled for now till a scenario/test is being conducted. The JVM_ARGS environment variable can be used to override JVM settings in the jmeter.bat or jmeter.sh script. Consider the following example: export JVM_ARGS="-Xms1024m -Xmx1024m -Dpropname=propvalue". Command-line options To see all the options available to start JMeter, run the JMeter executable with the -? command. The options provided are as follows: . ./jmeter.sh -? -? print command line options and exit -h, --help print usage information and exit -v, --version print the version information and exit -p, --propfile <argument> the jmeter property file to use -q, --addprop <argument> additional JMeter property file(s) -t, --testfile <argument> the jmeter test(.jmx) file to run -l, --logfile <argument> the file to log samples to -j, --jmeterlogfile <argument> jmeter run log file (jmeter.log) -n, --nongui run JMeter in nongui mode ... -J, --jmeterproperty <argument>=<value> Define additional JMeter properties -G, --globalproperty <argument>=<value> Define Global properties (sent to servers) e.g. -Gport=123 or -Gglobal.properties -D, --systemproperty <argument>=<value> Define additional system properties -S, --systemPropertyFile <argument> additional system property file(s) This is a snippet (non-exhaustive list) of what you might see if you did the same. Summary In this article we have learnt relationship between performance testing and tuning, and how to install and run JMeter. Resources for Article: Further resources on this subject: Functional Testing with JMeter [article] Creating an Apache JMeter™ test workbench [article] Getting Started with Apache Spark DataFrames [article]

0
0
2106

article-image-the-great-unbundling-tech-stack-developer-learning

Dave Maclean

19 Feb 2018

5 min read

The great unbundling of the tech stack and developer learning

Dave Maclean

19 Feb 2018

5 min read

The nineties: the large software vendors dominate When I started my first tech publisher, Wrox, back in 1994, tech was dominated by a few large IT vendors: IBM, Oracle, SAP, DEC [!], CA. The insurgent company was Microsoft, and the disruptive technology was Unix. Microsoft grew out from their ownership of the desktop back into the enterprise, and had a uniquely open strategy to encourage a third party developer ecosystem. Unix was an open ecosystem from the start. Both of these created a window for start-up tech publishers like Wrox and O’Reilly to get started. No-one had ever published a profitable book on an IBM tool, say DB2. IBM had that whole world locked down with a vertically integrated model across hardware, software, services and education. The research firm IDC has at various times modelled the total value of IT software and services spending in the US at $100bn - $200bn, and IT training to be about 2-5% of this total, usually bundled in with the total package. [An old version of IDC analysis] The “hidden” market for IT skills training, both for users and for developers, bundled in with closed vendor ecosystems was an estimated $5bn - $10bn in the US alone. SAP Education claims to train 500,000 customers and implementers each year. The impact of the internet Then the internet arrived. The internet is a machine for unbundling everything. By joining everyone and everything for zero marginal cost the internet relentlessly breaks companies, industries, products and services into independent functional elements. All media business models have been unravelling for years, well-documented by the excellent Ben Thompson in his Stratechery Blog. More on the next 10 years of unbundling is mapped on the incomparable CBinsights Trends report. In parallel, unbundling is also happening to IT vendor stacks. We’re moving relentlessly away from one vendor, and all projects toward a dynamic stack mix, project by project. The internet is both the driver and enabler of this great unbundling. As every aspect of our lives moves online, every organisation is becoming a network of software systems. Nobody nails this concept better than Marc Andreesen in his classicSoftware is Eating the World. The only way to meet this demand is to break free of a vertical vendor stack. SAP might make great accounting software but who wants customer facing apps from SAP? From the point of view of software development, the internet has been a story of relentless atomisation. The internet itself grew out of the granddaddy of all open platforms, Unix. Open source exploded in the 2000s as the global community of developers built shared tools and code together. We started Packt in 2003 to publish books for open source developers. We saw that in a friction-free world, tools would become more fragmented and specialised, and would need a new business model for niche technical content. Open source tools try and do one job well, leaving developers to assemble specific solutions, case-by-case. Cloud and micro-services follow the same logic. There’s a nice timeline here from CapGemini. This podcast from a16z is a good take on the emerging world of micro services, and has a great quote from Chris Dixon, saying how many cloud and micro-services start-ups are offering one old school Unix shell command as a whole company. The impact on the way developers learn Clearly where the tech stack unbundles, so does developer learning. If your project has a dozen elements from different vendors, you’ve got two learning problems. Firstly, you have to navigate the developer support and training centres from each project and vendor. Secondly, you have to work out how the tech fits together. It’s the logic of unbundling that is creating the market for developer eLearning that big players like Pluralsight and Lynda have surged into over the last 5-8 years. Get all your learning across the stack from one place. Pluralsight are probably growing at 50% YonY and cover all vendors, all stacks in one place. Consistently our top titles at Packt show developers how to combine tools and technologies together, ensuring developer learning matches the way software is actually created. At Packt, we think that the market for developer eLearning globally is around $1-$3bn today and will double over the next 5 years as learning peels away from the closed vendor worlds. That’s the market we're playing in, along with more and more innovative competitors. But in the same way as CIOs now have to curate an open ecosystem of technology, so each developer has to curate their own ecosystem of learning. And if software is unbundling, there is a need for analysis, insight and curation to help developers and CIOs make sense of it. Packt, and its online learning platform Mapt, aim to help make sense of this fragmented landscape. There’s a big question lurking behind this: is the emergence of public clouds like AWS with a rich aggregation of micro-services a return to a closed vendor ecosystem? But that’s for another discussion...

0
0
1969

How-To Tutorials

article-image-crud-create-read-update-delete-operations-elasticsearch

Pravin Dhandre

19 Feb 2018

5 min read

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre

19 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID: DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 and start building end-to-end real-time data processing solutions for your enterprise analytics applications.

0
0
13098

article-image-how-to-classify-digits-using-keras-and-tensorflow

Sugandha Lahoti

19 Feb 2018

13 min read

How to Classify Digits using Keras and TensorFlow

Sugandha Lahoti

19 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book provides a practical approach to building efficient machine learning models using ensemble techniques with real-world use cases.[/box] Today we will look at how we can create, train, and test a neural network to perform digit classification using Keras and TensorFlow. This article uses MNIST dataset with images of handwritten digits.It contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate. One paper, by using a hierarchical system of CNNs, manages to get an error rate on the MNIST database of 0.23 percent. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they used a support vector machine to get an error rate of 0.8 percent. Images in the dataset look like this: So let's not waste our time and start implementing our very first neural network in Python. Let’s start the code by importing the supporting projects. # Imports for array-handling and plotting import numpy as np import matplotlib import matplotlib.pyplot as plt Keras already has the MNIST dataset as a sample dataset, so we can import it as it is. Generally, it downloads the data over the internet and stores it into the database. So, if your system does not have the dataset, Internet will be required to download it: # Keras imports for the dataset and building our neural network from keras.datasets import mnist Now, we will import the Sequential and load_model classes from the keras.model class. We are working with sequential networks as all layers will be in forward sequence only. We are not using any split in the layers. The Sequential class will create a sequential model by combining the layers sequentially. The load_model class will help us to load the trained model for testing and evaluation purposes: #Import Sequential and Load model for creating and loading model from keras.models import Sequential, load_model In the next line, we will call three types of layers from the keras library. Dense layer means a fully connected layer; that is, each neuron of current layer will have a connection to the each neuron of the previous as well as next layer. The dropout layer is for reducing overfitting in our model. It randomly selects some neurons and does not use them for training for that iteration. So there are less chances that two different neurons of the same layer learn the same features from the input. By doing this, it prevents redundancy and correlation between neurons in the network, which eventually helps prevent overfitting in the network. The activation layer applies the activation function to the output of the neuron. We will use rectified linear units (ReLU) and the softmax function as the activation layer. We will discuss their operation when we use them in network creation: #We will use Dense, Drop out and Activation layers from keras.layers.core import Dense, Dropout, Activation from keras.utils import np_utils So we will start with loading our dataset by mnist.load. It will give us training and testing input and output instances. Then, we will visualize some instances so that we know what kind of data we are dealing with. We will use matplotlib to plot them. As the images have gray values, we can easily plot a histogram of the images, which can give us the pixel intensity distribution: #Let's Start by loading our dataset (X_train, y_train), (X_test, y_test) = mnist.load_data() #Plot the digits to verify plt.figure() for i in range(9): plt.subplot(3,3,i+1) plt.tight_layout() plt.imshow(X_train[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[i])) plt.xticks([]) plt.yticks([]) plt.show() When we execute our code for the preceding code block, we will get the output as: #Lets analyze histogram of the image plt.figure() plt.subplot(2,1,1) plt.imshow(X_train[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2,1,2) plt.hist(X_train[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() The histogram of an image will look like this: # Print the shape before we reshape and normalize print("X_train shape", X_train.shape) print("y_train shape", y_train.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) Currently, this is shape of the dataset we have: X_train shape (60000, 28, 28) y_train shape (60000,) X_test shape (10000, 28, 28) y_test shape (10000,) As we are working with 2D images, we cannot train them as with our neural network. For training 2D images, there are different types of neural networks available; we will discuss those in the future. To remove this data compatibility issue, we will reshape the input images into 1D vectors of 784 values (as images have size 28X28). We have 60000 such images in training data and 10000 in testing: # As we have data in image form convert it to row vectors X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') Normalize the input data into the range of 0 to 1 so that it leads to a faster convergence of the network. The purpose of normalizing data is to transform our dataset into a bounded range; it also involves relativity between the pixel values. There are various kinds of normalizing techniques available such as mean normalization, min-max normalization, and so on: # Normalizing the data to between 0 and 1 to help with the training X_train /= 255 X_test /= 255 # Print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape) Let's print the shape of the data: Train matrix shape (60000, 784) Test matrix shape (10000, 784) Now, our training set contains output variables as discrete class values; say, for an image of number eight, the output class value is eight. But our output neurons will be able to give an output only in the range of zero to one. So, we need to convert discrete output values to categorical values so that eight can be represented as a vector of zero and one with the length equal to the number of classes. For example, for the number eight, the output class vector should be: 8 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] # One-hot encoding using keras' numpy-related utilities n_classes = 10 print("Shape before one-hot encoding: ", y_train.shape) Y_train = np_utils.to_categorical(y_train, n_classes) Y_test = np_utils.to_categorical(y_test, n_classes) print("Shape after one-hot encoding: ", Y_train.shape) After one-hot encoding of our output, the variable’s shape will be modified as: Shape before one-hot encoding: (60000,) Shape after one-hot encoding: (60000, 10) So, you can see that now we have an output variable of 10 dimensions instead of 1. Now, we are ready to define our network parameters and layer architecture. We will start creating our network by creating a Sequential class object, model. We can add different layers to this model as we have done in the following code block. We will create a network of an input layer, two hidden layers, and one output layer. As the input layer is always our data layer, it doesn't have any learning parameters. For hidden layers, we will use 512 neurons in each. At the end, for a 10-dimensional output, we will use 10 neurons in the final layer: # Here, we will create model of our ANN # Create a linear stack of layers with the sequential model model = Sequential() #Input Layer with 512 Weights model.add(Dense(512, input_shape=(784,))) #We will use relu as Activation model.add(Activation('relu')) #Put Drop out to prevent over-fitting model.add(Dropout(0.2)) #Add Hidden layer with 512 neurons with relu activation model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2)) #This is our Output layer with 10 neurons model.add(Dense(10))model.add(Activation('softmax')) After defining the preceding structure, our neural network will look something like this: The Shape field in each layer shows the shape of the data matrix in that layer, and it is quite intuitive. As we first get the multiplication of input with length of 784 values to 512 neurons, the data shape at Hidden-1 will be 784 X 512. It will be calculated similarly for the other two layers. We have used two different kinds of activation functions here. The first one is ReLU and the second one is sofmax probabilities. We will give some time to discuss these two. ReLU prevent the output of the neuron from becoming negative. The expression for relu function is: So if any neuron produces an output less than 0, it converts it to 0. We can write it in conditional form as: You just need to know that ReLU is a slightly better activation function than sigmoid. If we plot a sigmoid function, it will look like: If you look closer, the sigmoid function starts getting saturated before reaching its minimum (0) or maximum (1) values. So at the time of gradient calculation, values in the saturated region result in a very small gradient. That causes a very small change in the weight values, which is not sufficient to optimize the cost function. Now, as we go more backward during the backpropagation, that small change becomes smaller and almost reaches zero. This problem is known as the problem of vanishing gradients. So, in practical cases, we avoid sigmoid activation when our network has many stacked layers. Whereas if we see the expression of ReLU activation, it is more like a straight line: So, the gradient of the preceding function will always a non-zero value until and unless the output itself is a zero value. Thus, it prevents the problem of vanishing gradients. We have discussed the significance of the dropout layer earlier and I don’t think that it is further required. We are using 20% neuron dropout during the training time. We will not use the dropout layer during the testing time. Now, we are all set to train our very first ANN, but before starting training, we have to define the values of the network hyperparameters. We will use SGD using adaptive momentum. There are many algorithms to optimize the performance of the SGD algorithm. You just need to know that adaptive momentum is a better choice than simple gradient descent because it modifies the learning rate using previous errors created by the network. So, there are less chances of getting trapped at the local minima or missing the global minima conditions. We are using SGD with ADAM, using its default parameters. Here, we use batch_size of 128 samples. That means we will update the weights after calculating the error on these 128 samples. It is a sufficient batch size for our total data population. We are going to train our network for 20 epochs for the time being. Here, one epoch means one complete training cycle of all mini-batches. Now, let's start training our network: #Here we will be compiling the sequential model model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') # Start training the model and saving metrics in history history = model.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=2, validation_data=(X_test, Y_test)) We will save our trained model on disk so that we can use it for further fine-tuning whenever required. We will store the model in the HDF5 file format: # Saving the model on disk path2save = 'E:/PyDevWorkSpaceTest/Ensembles/Chapter_10/keras_mnist.h5' model.save(path2save) print('Saved trained model at %s ' % path2save) # Plotting the metrics fig = plt.figure() plt.subplot(2,1,1) plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='lower right') plt.subplot(2,1,2) plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.tight_layout() plt.show() Let's analyze the loss with each iteration during the training of our neural network; we will also plot the accuracies for validation and test set. You should always monitor validation and training loss as it can help you know whether your model is underfitting or overfitting: Test Loss 0.0824991761778 Test Accuracy 0.9813 As you can see, we are getting almost similar performance for our training and validation sets in terms of loss and accuracy. You can see how accuracy is increasing as the number of epochs increases. This shows that our network is learning. Now, we have trained and stored our model. It's time to reload it and test it with the 10000 test instances: #Let's load the model for testing data path2save = 'D:/PyDevWorkspace/EnsembleMachineLearning/Chapter_10/keras_mnist.h5' mnist_model = load_model(path2save) #We will use Evaluate function loss_and_metrics = mnist_model.evaluate(X_test, Y_test, verbose=2) print("Test Loss", loss_and_metrics[0]) print("Test Accuracy", loss_and_metrics[1]) #Load the model and create predictions on the test set mnist_model = load_model(path2save) predicted_classes = mnist_model.predict_classes(X_test) #See which we predicted correctly and which not correct_indices = np.nonzero(predicted_classes == y_test)[0] incorrect_indices = np.nonzero(predicted_classes != y_test)[0] print(len(correct_indices)," classified correctly") print(len(incorrect_indices)," classified incorrectly") So, here is the performance of our model on the test set: 9813 classified correctly 187 classified incorrectly As you can see, we have misclassified 187 instances out of 10000, which I think is a very good accuracy on such a complex dataset. In the next code block, we will analyze such cases where we detect false labels: #Adapt figure size to accomodate 18 subplots plt.rcParams['figure.figsize'] = (7,14) plt.figure() # plot 9 correct predictions for i, correct in enumerate(correct_indices[:9]): plt.subplot(6,3,i+1) plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted: {}, Truth: {}".format(predicted_classes[correct], y_test[correct])) plt.xticks([]) plt.yticks([]) # plot 9 incorrect predictions for i, incorrect in enumerate(incorrect_indices[:9]): plt.subplot(6,3,i+10) plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted {}, Truth: {}".format(predicted_classes[incorrect], y_test[incorrect])) plt.xticks([]) plt.yticks([]) plt.show() If you look closely, our network is failing on such cases that are very difficult to identify by a human, too. So, we can say that we are getting quite a good accuracy from a very simple model. We saw how to create, train, and test a neural network to perform digit classification using Keras and TensorFlow. If you found our post useful, do check out this book Ensemble Machine Learning to build ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy.

0
0
5076

article-image-use-labview-data-acquisition

Fatema Patrawala

17 Feb 2018

14 min read

How to use LabVIEW for data acquisition

Fatema Patrawala

17 Feb 2018

14 min read

0
0
4449

article-image-implement-memory-oltp-sql-server-linux

Fatema Patrawala

17 Feb 2018

11 min read

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala

17 Feb 2018

11 min read

0
0
4412

article-image-perform-predictive-forecasting-sap-analytics-cloud

Kunal Chaudhari

17 Feb 2018

7 min read

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari

17 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.

0
0
4750

article-image-manipulating-text-data-using-python-regular-expressions-regex

Sugandha Lahoti

16 Feb 2018

8 min read

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti

16 Feb 2018

8 min read

0
0
5768

article-image-install-elasticsearch-ubuntu-windows

Fatema Patrawala

16 Feb 2018

3 min read

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala

16 Feb 2018

3 min read

[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.

0
0
11845

Stack Wars: The epic struggle for control of the tech stack

Getting to know Generative Models and their types

Introduction with Device Management

How to share insights using Alteryx Server

What makes Hadoop so revolutionary?

Installing and Configuring X-pack on Elasticsearch and Kibana

Introduction to Performance Testing and JMeter

The great unbundling of the tech stack and developer learning

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

How to Classify Digits using Keras and TensorFlow

Trending Topics

How to use LabVIEW for data acquisition

How to implement In-Memory OLTP on SQL Server in Linux

How to perform predictive forecasting in SAP Analytics Cloud

Manipulating text data using Python Regular Expressions (regex)

How to install Elasticsearch in Ubuntu and Windows