Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-writing-postgis-functions-in-python-tutorial
Pravin Dhandre
01 Aug 2018
5 min read
Save for later

Writing PostGIS functions in Python language [Tutorial]

Pravin Dhandre
01 Aug 2018
5 min read
In this tutorial, you will learn to write a Python function for PostGIS and PostgreSQL using the PL/Python language and effective libraries like urllib2 and simplejson. You will use Python to query the http://openweathermap.org/ web services to get the weather for a PostGIS geometry from within a PostgreSQL function. This tutorial is an excerpt from a book written by Mayra Zurbaran,Pedro Wightman, Paolo Corti, Stephen Mather, Thomas Kraft and Bborie Park titled PostGIS Cookbook - Second Edition. Adding Python support to database Verify your PostgreSQL server installation has PL/Python support. In Windows, this should be already included, but this is not the default if you are using, for example, Ubuntu 16.04 LTS, so you will most likely need to install it: $ sudo apt-get install postgresql-plpython-9.1 Install PL/Python on the database (you could consider installing it in your template1 database; in this way, every newly created database will have PL/Python support by default): You could alternatively add PL/Python support to your database, using the createlang shell command (this is the only way if you are using PostgreSQL version 9.1 or lower): $ createlang plpythonu postgis_cookbook $ psql -U me postgis_cookbook postgis_cookbook=# CREATE EXTENSION plpythonu; How to do it... Carry out the following steps: In this tutorial, as with the previous one, you will use a http://openweathermap.org/ web service to get the temperature for a point from the closest weather station. The request you need to run (test it in a browser) is http://api.openweathermap.org/data/2.5/find?lat=55&lon=37&cnt=10&appid=YOURKEY. You should get the following JSON output (the closest weather station's data from which you will read the temperature to the point, with the coordinates of the given longitude and latitude): { message: "", cod: "200", calctime: "", cnt: 1, list: [ { id: 9191, dt: 1369343192, name: "100704-1", type: 2, coord: { lat: 13.7408, lon: 100.5478 }, distance: 6.244, main: { temp: 300.37 }, wind: { speed: 0, deg: 141 }, rang: 30, rain: { 1h: 0, 24h: 3.302, today: 0 } } ] } Create the following PostgreSQL function in Python, using the PL/Python language: CREATE OR REPLACE FUNCTION chp08.GetWeather(lon float, lat float) RETURNS float AS $$ import urllib2 import simplejson as json data = urllib2.urlopen( 'http://api.openweathermap.org/data/ 2.1/find/station?lat=%s&lon=%s&cnt=1' % (lat, lon)) js_data = json.load(data) if js_data['cod'] == '200': # only if cod is 200 we got some effective results if int(js_data['cnt'])>0: # check if we have at least a weather station station = js_data['list'][0] print 'Data from weather station %s' % station['name'] if 'main' in station: if 'temp' in station['main']: temperature = station['main']['temp'] - 273.15 # we want the temperature in Celsius else: temperature = None else: temperature = None return temperature $$ LANGUAGE plpythonu; Now, test your function; for example, get the temperature from the weather station closest to Wat Pho Templum in Bangkok: postgis_cookbook=# SELECT chp08.GetWeather(100.49, 13.74); getweather ------------ 27.22 (1 row) If you want to get the temperature for the point features in a PostGIS table, you can use the coordinates of each feature's geometry: postgis_cookbook=# SELECT name, temperature, chp08.GetWeather(ST_X(the_geom), ST_Y(the_geom)) AS temperature2 FROM chp08.cities LIMIT 5; name | temperature | temperature2 -------------+-------------+-------------- Minneapolis | 275.15 | 15 Saint Paul | 274.15 | 16 Buffalo | 274.15 | 19.44 New York | 280.93 | 19.44 Jersey City | 282.15 | 21.67 (5 rows) Now it would be nice if our function could accept not only the coordinates of a point, but also a true PostGIS geometry as well as an input parameter. For the temperature of a feature, you could return the temperature of the weather station closest to the centroid of the feature geometry. You can easily get this behavior using function overloading. Add a new function, with the same name, supporting a PostGIS geometry directly as an input parameter. In the body of the function, call the previous function, passing the coordinates of the centroid of the geometry. Note that in this case, you can write the function without using Python, with the PL/PostgreSQL language: CREATE OR REPLACE FUNCTION chp08.GetWeather(geom geometry) RETURNS float AS $$ BEGIN RETURN chp08.GetWeather(ST_X(ST_Centroid(geom)), ST_Y(ST_Centroid(geom))); END; $$ LANGUAGE plpgsql; Now, test the function, passing a PostGIS geometry to the function: postgis_cookbook=# SELECT chp08.GetWeather( ST_GeomFromText('POINT(-71.064544 42.28787)')); getweather ------------ 23.89 (1 row) If you use the function on a PostGIS layer, you can pass the feature's geometries to the function directly, using the overloaded function written in the PL/PostgreSQL language: postgis_cookbook=# SELECT name, temperature, chp08.GetWeather(the_geom) AS temperature2 FROM chp08.cities LIMIT 5; name | temperature | temperature2 -------------+-------------+-------------- Minneapolis | 275.15 | 17.22 Saint Paul | 274.15 | 16 Buffalo | 274.15 | 18.89 New York | 280.93 | 19.44 Jersey City | 282.15 | 21.67 (5 rows) In this tutorial, you wrote a Python function in PostGIS, using the PL/Python language. Using Python inside PostgreSQL and PostGIS functions gives you the great advantage of being able to use any Python library you wish. Therefore, you will be able to write much more powerful functions compared to those written using the standard PL/PostgreSQL language. In fact, in this case, you used the urllib2 and simplejson Python libraries to query a web service from within a PostgreSQL function—this would be an impossible operation to do using plain PL/PostgreSQL. You have also seen how to overload functions in order to provide the function's user a different way to access the function, using input parameters in a different way. To get armed with all the tools and instructions you need for managing entire spatial database systems, read PostGIS Cookbook - Second Edition. Top 7 libraries for geospatial analysis Learning R for Geospatial Analysis
Read more
  • 0
  • 0
  • 7074

article-image-time-facebook-twitter-other-social-media-take-responsibility-or-face-regulation
Sugandha Lahoti
01 Aug 2018
9 min read
Save for later

Time for Facebook, Twitter and other social media to take responsibility or face regulation

Sugandha Lahoti
01 Aug 2018
9 min read
Of late, the world has been shaken over the rising number of data related scandals and attacks that have overshadowed social media platforms. This shakedown was experienced in Wall Street last week when tech stocks came crashing down after Facebook’s Q2 earnings call on 25th July and then further down after Twitter’s earnings call on 27th July. Social media regulation is now at the heart of discussions across the tech sector. The social butterfly effect is real 2018 began with the Cambridge Analytica scandal where the data analytics company was alleged to have not only been influencing the outcome of UK and US Presidential elections but also of harvesting copious amounts of data from Facebook (illegally).  Then Facebook fell down the rabbit hole with Muller’s indictment report that highlighted the role social media played in election interference in 2016. ‘Fake news’ on Whatsapp triggered mob violence in India while Twitter has been plagued with fake accounts and tweets that never seem to go away. Fake news and friends crash the tech stock party Last week, social media stocks fell in double digits (Facebook by 20% and Twitter by 21%) bringing down the entire tech sector; a fall that continues to keep tech stocks in a bearish market and haunt tech shareholders even today. Wall Street has been a nervous wreck this week hoping for the bad news to stop spirally downwards with good news from Apple to undo last week’s nightmare. Amidst these reports, lawmakers, regulators and organizations alike are facing greater pressure for regulation of social media platforms. How are lawmakers proposing to regulate social media? Even though lawmakers have started paying increased attention to social networks over the past year, there has been little progress made in terms of how much they actually understand them. This could soon change as Axios’ David McCabe published a policy paper from the office of Senator Mark Warner. This paper describes a comprehensive regulatory policy covering almost every aspect of social networks. The paper-proposal is designed to address three broad categories: combating misinformation, privacy and data protection, and promoting competition in tech space. Misinformation, disinformation, and the exploitation of technology covers ideas such as: Networks are to label automated bots. Platforms are to verify identities, Platforms are to make regular disclosures about how many fake accounts they’ve deleted. Platforms are to create APIs for academic research. Privacy and data protection include policies such as: Create a US version of the GDPR. Designate platforms as information fiduciaries with the legal responsibility of protecting user’s data. Empowering the Federal Trade Commission to make rules around data privacy. Create a legislative ban on dark patterns that trick users into accepting terms and conditions without reading them. Allow the government to audit corporate algorithms. Promoting competition in tech space that requires: Tech companies to continuously disclose to consumers how their data is being used. Social network data to be made portable. Social networks to be interoperable. Designate certain products as essential facilities and demand that third parties get fair access to them. Although these proposals and more of them (British parliamentary committee recommended imposing much stricter guidelines on social networks) remain far from becoming the law, they are an assurance that legal firms and lawmakers are serious about taking steps to ensure that social media platforms don’t go out of hand. Taking measures to ensure data regulations by lawmakers and legal authorities is only effective if the platforms themselves care about the issues themselves and are motivated to behave in the right way. Losing a significant chunk of their user base in EU lately seems to have provided that very incentive. Social network platforms, themselves have now started seeking ways to protecting user data and improve their platforms in general to alleviate some of the problems they helped create or amplify. How is Facebook planning to course correct it’s social media Frankenstein? Last week, Mark Zuckerberg started the fated earnings call by saying, “I want to start by talking about all the investments we've made over the last six months to improve safety, security, and privacy across our services. This has been a lot of hard work, and it's starting to pay off.” He then goes on to elaborate key areas of focus for Facebook in the coming months, the next 1.5 years to be more specific. Ad transparency tools: All ads can be viewed by anyone, even if they are not targeted at them. Facebook is also developing an archive of ads with political or issue content which will be labeled to show who paid for them, what the budget was and how many people viewed the ads, and will also allow one to search ads by an advertiser for the past 7 years. Disallow and report known election interference attempts: Facebook will proactively look for and eliminate fake accounts, pages, and groups that violated their policies. This could minimize election interference, says Zuckerberg. Fight against misinformation: Remove the financial incentives for spammers to create fake news.  Stop pages that repeatedly spread false information from buying ads. Shift from reactive to proactive detection with AI: Use AI to prevent fake accounts that generate a lot of the problematic content from ever being created in the first place.  They can now remove more bad content quickly because we don't have to wait until after it's reported. In Q1, for example, almost 90% of graphic violence content that Facebook removed or added a warning label to was identified using AI. Invest heavily in security and privacy. No further elaboration on this aspect was given on the call. This week, Facebook reported that they’d  detected and removed 32 pages and fake accounts that had engaged in a coordinated inauthentic behavior. These accounts and pages were of a political influence campaign that was potentially built to disrupt the midterm elections. According to Facebook’s Head of Cybersecurity, Nathaniel Gleicher, “So far, the activity encompasses eight Facebook Pages, 17 profiles and seven accounts on Instagram.” Facebook’s action is a change from last year when it was widely criticized for failing to detect Russian interference in the 2016 presidential election. Although the current campaign hasn’t been linked to Russia (yet), Facebook officials pointed out that some of the tools and techniques used by the accounts were similar to those used by the Russian government-linked Internet Research Agency. How Twitter plans to make its platform a better place for real and civilized conversation “We want people to feel safe freely expressing themselves and have launched new tools to address problem behaviors that distort and distract from the public conversation. We’re also continuing to make it easier for people to find and follow breaking news and events…” said  Jack Dorsey, Twitter's CEO, at Q2 2018 Earnings call. The letter to Twitter shareholders further elaborates on this point: We continue to invest in improving the health of the public conversation on Twitter, making the service better by integrating new behavioral signals to remove spammy and suspicious accounts and continuing to prioritize the long-term health of the platform over near-term metrics. We also acquired Smyte, a company that specializes in spam prevention, safety, and security.   Unlike Facebook’s explanatory anecdotal support for the claims made, Twitter provided quantitative evidence to show the seriousness of their endeavor. Here are some key metrics from the shareholders’ letter this quarter. Results from early experiments on using new tools to address behaviors that distort and distract from the public conversation show a 4% drop in abuse reports from search and 8% fewer abuse reports from conversations More than 9 million potentially spammy or automated accounts identified and challenged per week 8k fewer average spam reports per day Removing more than 2x the number of accounts for violating Twitter’s spam policies than they did last year It is clear that Twitter has been quite active when it comes to looking for ways to eliminate toxicity from the website’s network. CEO Jack Dorsey in a series of tweets stated that the company did not always meet users’ expectations. “We aren’t proud of how people have taken advantage of our service, or our inability to address it fast enough, with the company needing a “systemic framework.” Back in March 2018, Twitter invited external experts,  to measure the health of the company in order to encourage a more healthy conversation, debate, and critical thinking. Twitter asked them to create proposals taking inspiration from the concept of measuring conversation health defined by a non-profit firm Cortico. As of yesterday, they now have their dream team of researchers finalized and ready to take up the challenge of identifying echo chambers on Twitter for unhealthy behavior and then translating their findings into practical algorithms down the line. [dropcap]W[/dropcap]ith social media here to stay, both lawmakers and social media platforms are looking for new ways to regulate. Any misstep by these social media sites will have solid repercussions which include not only closer scrutiny by the government and private watchdogs but also losing out on stock value, a bad reputation, as well as being linked to other forms of data misuse and accusations of political bias. Lastly, let’s not forget the responsibility that lies with the ‘social’ side of these platforms. Individuals need to play their part in being proactive in reporting fake news and stories, and they also need to be more selective about the content they share on social. Why Wall Street unfriended Facebook: Stocks fell $120 billion in market value after Q2 2018 earnings call Facebook must stop discriminatory advertising in the US, declares Washington AG, Ferguson Facebook is investigating data analytics firm Crimson Hexagon over misuse of data
Read more
  • 0
  • 0
  • 2500

article-image-transactions-for-async-programming-in-javaee
Aaron Lazar
31 Jul 2018
5 min read
Save for later

Using Transactions with Asynchronous Tasks in JavaEE [Tutorial]

Aaron Lazar
31 Jul 2018
5 min read
Threading is a common issue in most software projects, no matter which language or other technology is involved. When talking about enterprise applications, things become even more important and sometimes harder. Using asynchronous tasks could be a challenge: what if you need to add some spice and add a transaction to it? Thankfully, the Java EE environment has some great features for dealing with this challenge, and this article will show you how. This article is an extract from the book Java EE 8 Cookbook, authored by Elder Moraes. Usually, a transaction means something like code blocking. Isn't it awkward to combine two opposing concepts? Well, it's not! They can work together nicely, as shown here. Adding Java EE 8 dependency Let's first add our Java EE 8 dependency: <dependency> <groupId>javax</groupId> <artifactId>javaee-api</artifactId> <version>8.0</version> <scope>provided</scope> </dependency> Let's first create a User POJO: public class User { private Long id; private String name; public Long getId() { return id; } public void setId(Long id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } public User(Long id, String name) { this.id = id; this.name = name; } @Override public String toString() { return "User{" + "id=" + id + ", name=" + name + '}'; } } And here is a slow bean that will return User: @Stateless public class UserBean { public User getUser(){ try { TimeUnit.SECONDS.sleep(5); long id = new Date().getTime(); return new User(id, "User " + id); } catch (InterruptedException ex) { System.err.println(ex.getMessage()); long id = new Date().getTime(); return new User(id, "Error " + id); } } } Now we create a task to be executed that will return User using some transaction stuff: public class AsyncTask implements Callable<User> { private UserTransaction userTransaction; private UserBean userBean; @Override public User call() throws Exception { performLookups(); try { userTransaction.begin(); User user = userBean.getUser(); userTransaction.commit(); return user; } catch (IllegalStateException | SecurityException | HeuristicMixedException | HeuristicRollbackException | NotSupportedException | RollbackException | SystemException e) { userTransaction.rollback(); return null; } } private void performLookups() throws NamingException{ userBean = CDI.current().select(UserBean.class).get(); userTransaction = CDI.current() .select(UserTransaction.class).get(); } } And finally, here is the service endpoint that will use the task to write the result to a response: @Path("asyncService") @RequestScoped public class AsyncService { private AsyncTask asyncTask; @Resource(name = "LocalManagedExecutorService") private ManagedExecutorService executor; @PostConstruct public void init(){ asyncTask = new AsyncTask(); } @GET public void asyncService(@Suspended AsyncResponse response){ Future<User> result = executor.submit(asyncTask); while(!result.isDone()){ try { TimeUnit.SECONDS.sleep(1); } catch (InterruptedException ex) { System.err.println(ex.getMessage()); } } try { response.resume(Response.ok(result.get()).build()); } catch (InterruptedException | ExecutionException ex) { System.err.println(ex.getMessage()); response.resume(Response.status(Response .Status.INTERNAL_SERVER_ERROR) .entity(ex.getMessage()).build()); } } } To try this code, just deploy it to GlassFish 5 and open this URL: http://localhost:8080/ch09-async-transaction/asyncService How the Asynchronous execution works The magic happens in the AsyncTask class, where we will first take a look at the performLookups method: private void performLookups() throws NamingException{ Context ctx = new InitialContext(); userTransaction = (UserTransaction) ctx.lookup("java:comp/UserTransaction"); userBean = (UserBean) ctx.lookup("java:global/ ch09-async-transaction/UserBean"); } It will give you the instances of both UserTransaction and UserBean from the application server. Then you can relax and rely on the things already instantiated for you. As our task implements a Callabe<V> object that it needs to implement the call() method: @Override public User call() throws Exception { performLookups(); try { userTransaction.begin(); User user = userBean.getUser(); userTransaction.commit(); return user; } catch (IllegalStateException | SecurityException | HeuristicMixedException | HeuristicRollbackException | NotSupportedException | RollbackException | SystemException e) { userTransaction.rollback(); return null; } } You can see Callable as a Runnable interface that returns a result. Our transaction code lives here: userTransaction.begin(); User user = userBean.getUser(); userTransaction.commit(); And if anything goes wrong, we have the following: } catch (IllegalStateException | SecurityException | HeuristicMixedException | HeuristicRollbackException | NotSupportedException | RollbackException | SystemException e) { userTransaction.rollback(); return null; } Now we will look at AsyncService. First, we have some declarations: private AsyncTask asyncTask; @Resource(name = "LocalManagedExecutorService") private ManagedExecutorService executor; @PostConstruct public void init(){ asyncTask = new AsyncTask(); } We are asking the container to give us an instance from ManagedExecutorService, which It is responsible for executing the task in the enterprise context. Then we call an init() method, and the bean is constructed (@PostConstruct). This instantiates the task. Now we have our task execution: @GET public void asyncService(@Suspended AsyncResponse response){ Future<User> result = executor.submit(asyncTask); while(!result.isDone()){ try { TimeUnit.SECONDS.sleep(1); } catch (InterruptedException ex) { System.err.println(ex.getMessage()); } } try { response.resume(Response.ok(result.get()).build()); } catch (InterruptedException | ExecutionException ex) { System.err.println(ex.getMessage()); response.resume(Response.status(Response. Status.INTERNAL_SERVER_ERROR) .entity(ex.getMessage()).build()); } } Note that the executor returns Future<User>: Future<User> result = executor.submit(asyncTask); This means this task will be executed asynchronously. Then we check its execution status until it's done: while(!result.isDone()){ try { TimeUnit.SECONDS.sleep(1); } catch (InterruptedException ex) { System.err.println(ex.getMessage()); } } And once it's done, we write it down to the asynchronous response: response.resume(Response.ok(result.get()).build()); The full source code of this recipe is at Github. So now, using Transactions with Asynchronous Tasks in JavaEE isn't such a daunting task, is it? If you found this tutorial helpful and would like to learn more, head on to this book Java EE 8 Cookbook. Oracle announces a new pricing structure for Java Design a RESTful web API with Java [Tutorial] How to convert Java code into Kotlin
Read more
  • 0
  • 0
  • 4458

article-image-ansible-2-automate-networking-tasks-on-google-cloud
Vijin Boricha
31 Jul 2018
8 min read
Save for later

Ansible 2 for automating networking tasks on Google Cloud Platform [Tutorial]

Vijin Boricha
31 Jul 2018
8 min read
Google Cloud Platform is one of the largest and most innovative cloud providers out there. It is used by various industry leaders such as Coca-Cola, Spotify, and Philips. Amazon Web Services and Google Cloud are always involved in a price war, which benefits consumers greatly. Google Cloud Platform covers 12 geographical regions across four continents with new regions coming up every year. In this tutorial, we will learn about Google compute engine and network services and how Ansible 2 can be leveraged to automate common networking tasks. This is an excerpt from Ansible 2 Cloud Automation Cookbook written by Aditya Patawari, Vikas Aggarwal.  Managing network and firewall rules By default, inbound connections are not allowed to any of the instances. One way to allow the traffic is by allowing incoming connections to a certain port of instances carrying a particular tag. For example, we can tag all the webservers as http and allow incoming connections to port 80 and 8080 for all the instances carrying the http tag. How to do it… We will create a firewall rule with source tag using the gce_net module: - name: Create Firewall Rule with Source Tags gce_net: name: my-network fwname: "allow-http" allowed: tcp:80,8080 state: "present" target_tags: "http" subnet_region: us-west1 service_account_email: "{{ service_account_email }}" project_id: "{{ project_id }}" credentials_file: "{{ credentials_file }}" tags: - recipe6 Using tags for firewalls is not possible all the time. A lot of organizations whitelist internal IP ranges or allow office IPs to reach the instances over the network. A simple way to allow a range of IP addresses is to use a source range: - name: Create Firewall Rule with Source Range gce_net: name: my-network fwname: "allow-internal" state: "present" src_range: ['10.0.0.0/16'] subnet_name: public-subnet allowed: 'tcp' service_account_email: "{{ service_account_email }}" project_id: "{{ project_id }}" credentials_file: "{{ credentials_file }}" tags: - recipe6 How it works... In step 1, we have created a firewall rule called allow-http to allow incoming requests to TCP port 80 and 8080. Since our instance app is tagged with http, it can accept incoming traffic to port 80 and 8080. In step 2, we have allowed all the instances with IP 10.0.0.0/16, which is a private IP address range. Along with connection parameters and the source IP address CIDR, we have defined the network name and subnet name. We have allowed all TCP connections. If we want to restrict it to a port or a range of ports, then we can use tcp:80 or tcp:4000-5000 respectively. Managing load balancer An important reason to use a cloud is to achieve scalability at a relatively low cost. Load balancers play a key role in scalability. We can attach multiple instances behind a load balancer to distribute the traffic between the instances. Google Cloud load balancer also supports health checks which helps to ensure that traffic is sent to healthy instances only. How to do it… Let us create a load balancer and attach an instance to it: - name: create load balancer and attach to instance gce_lb: name: loadbalancer1 region: us-west1 members: ["{{ zone }}/app"] httphealthcheck_name: hc httphealthcheck_port: 80 httphealthcheck_path: "/" service_account_email: "{{ service_account_email }}" project_id: "{{ project_id }}" credentials_file: "{{ credentials_file }}" tags: - recipe7 For creating a load balancer, we need to supply a comma separated list of instances. We also need to provide health check parameters including a name, a port and the path on which a GET request can be sent. Managing GCE images in Ansible 2 Images are a collection of a boot loader, operating system, and a root filesystem. There are public images provided by Google and various open source communities. We can use these images to create an instance. GCE also provides us capability to create our own image which we can use to boot instances. It is important to understand the difference between an image and a snapshot. A snapshot is incremental but it is just a disk snapshot. Due to its incremental nature, it is better for creating backups. Images consist of more information such as a boot loader. Images are non-incremental in nature. However, it is possible to import images from a different cloud provider or datacenter to GCE. Another reason we recommend snapshots for backup is that taking a snapshot does not require us to shut down the instance, whereas building an image would require us to shut down the instance. Why build images at all? We will discover that in subsequent sections. How to do it… Let us create an image for now: - name: stop the instance gce: instance_names: app zone: "{{ zone }}" machine_type: f1-micro image: centos-7 state: stopped service_account_email: "{{ service_account_email }}" credentials_file: "{{ credentials_file }}" project_id: "{{ project_id }}" disk_size: 15 metadata: "{{ instance_metadata }}" tags: - recipe8 - name: create image gce_img: name: app-image source: app zone: "{{ zone }}" state: present service_account_email: "{{ service_account_email }}" pem_file: "{{ credentials_file }}" project_id: "{{ project_id }}" tags: - recipe8 - name: start the instance gce: instance_names: app zone: "{{ zone }}" machine_type: f1-micro image: centos-7 state: started service_account_email: "{{ service_account_email }}" credentials_file: "{{ credentials_file }}" project_id: "{{ project_id }}" disk_size: 15 metadata: "{{ instance_metadata }}" tags: - recipe8 How it works... In these tasks, we are stopping the instance first and then creating the image. We just need to supply the instance name while creating the image, along with the standard connection parameters. Finally, we start the instance back. The parameters of these tasks are self-explanatory. Creating instance templates Instance templates define various characteristics of an instance and related attributes. Some of these attributes are: Machine type (f1-micro, n1-standard-1, custom) Image (we created one in the previous tip, app-image) Zone (us-west1-a) Tags (we have a firewall rule for tag http) How to do it… Once a template is created, we can use it to create a managed instance group which can be auto-scale based on various parameters. Instance templates are typically available globally as long as we do not specify a restrictive parameter like a specific subnet or disk: - name: create instance template named app-template gce_instance_template: name: app-template size: f1-micro tags: http,http-server image: app-image state: present subnetwork: public-subnet subnetwork_region: us-west1 service_account_email: "{{ service_account_email }}" credentials_file: "{{ credentials_file }}" project_id: "{{ project_id }}" tags: - recipe9 We have specified the machine type, image, subnets, and tags. This template can be used to create instance groups. Creating managed instance groups Traditionally, we have managed virtual machines individually. Instance groups let us manage a group of identical virtual machines as a single entity. These virtual machines are created from an instance template, like the one which we created in the previous tip. Now, if we have to make a change in instance configuration, that change would be applied to all the instances in the group. How to do it… Perhaps, the most important feature of an instance group is auto-scaling. In event of high resource requirements, the instance group can scale up to a predefined number automatically: - name: create an instance group with autoscaling gce_mig: name: app-mig zone: "{{ zone }}" service_account_email: "{{ service_account_email }}" credentials_file: "{{ credentials_file }}" project_id: "{{ project_id }}" state: present size: 2 named_ports: - name: http port: 80 template: app-template autoscaling: enabled: yes name: app-autoscaler policy: min_instances: 2 max_instances: 5 cool_down_period: 90 cpu_utilization: target: 0.6 load_balancing_utilization: target: 0.8 tags: - recipe10 How it works... The preceding task creates an instance group with an initial size of two instances, defined by size. We have named port 80 as HTTP. This can be used by other GCE components to route traffic. We have used the template that we created in the previous recipe. We also enable autoscaling with a policy to allow scaling up to five instances. At any given point, at least two instances would be running. We are scaling on two parameters, cpu_utilization, where 0.6 would trigger scaling after the utilization exceeds 60% and load_balancing_utilization where the scaling will trigger after 80% of the requests per minutes capacity is reached. Typically, when an instance is booted, it might take some time for initialization and startup. Data collected during that period might not make much sense. The parameter, cool_down_period, indicates that we should start collecting data from the instance after 90 seconds and should not trigger scaling based on data before. We learnt a few networking tricks to manage public cloud infrastructure effectively. You can know more about building the public cloud infrastructure by referring to this book Ansible 2 Cloud Automation Cookbook. Why choose Ansible for your automation and configuration management needs? Getting Started with Ansible 2 Top 7 DevOps tools in 2018
Read more
  • 0
  • 0
  • 4193

article-image-leaders-successful-agile-enterprises-share-in-common
Packt Editorial Staff
30 Jul 2018
11 min read
Save for later

What leaders at successful agile Enterprises share in common

Packt Editorial Staff
30 Jul 2018
11 min read
Adopting agile ways of working is easier said than done. Firms like Barclays, C.H.Robinson, Ericsson, Microsoft, and Spotify are considered as agile enterprises and are operating entrepreneurially on a large scale. Do you think the leadership of these firms have something in common? Let us take a look at it in this article. The leadership of a firm has a very high bearing on the extent of Enterprise Agility which the company can achieve. Leaders are in a position to influence just about every aspect of a business, including vision, mission, strategy, structure, governance, processes, and more importantly, the culture of the enterprise and the mindset of the employees. This article is an extract from the Enterprise Agility written by Sunil Mundra. In this article we’ll explore the personal traits of leaders that are critical for Enterprise Agility. Personal traits are by definition intrinsic in nature. They enable the personal development of an individual and are also enablers for certain behaviors. We explore the various personal traits in detail. #1 Willingness to expand mental models Essentially, a mental model is an individual's perception of reality and how something works in that reality. A mental model represents one way of approaching a situation and is a form of deeply-held belief. The critical point is that a mental model represents an individual's view, which may not be necessarily true. Leaders must also consciously let go of mental models that are no longer relevant today. This is especially important for those leaders who have spent a significant part of their career leading enterprises based on mechanistic modelling, as these models will create impediments for Agility in "living" businesses. For example, using monetary rewards as a primary motivator may work for physical work, which is repetitive in nature. However, it does not work as a primary motivator for knowledge workers, for whom intrinsic motivators, namely, autonomy, mastery, and purpose, are generally more important than money. Examining the values and assumptions underlying a mental model can help in ascertaining the relevance of that model. #2 Self-awareness Self-awareness helps leaders to become cognizant of their strengths and weaknesses. This will enable the leaders to consciously focus on utilizing their strengths and leveraging the strengths of their peers and teams, in areas where they are not strong. Leaders should validate the view of strengths and weaknesses by seeking feedback regularly from people that they work with. According to a survey of senior executives, by Cornell's School of Industrial and Labor Relations: "Leadership searches give short shrift to 'self-awareness,' which should actually be a top criterion. Interestingly, a high self-awareness score was the strongest predictor of overall success. This is not altogether surprising as executives who are aware of their weaknesses are often better able to hire subordinates who perform well in categories in which the leader lacks acumen. These leaders are also more able to entertain the idea that someone on their team may have an idea that is even better than their own." Self-awareness, a mostly underrated trait, is a huge enabler for enhancing other personal traits. #3 Creativity Since emergence is a primary property of complexity, leaders will often be challenged to deal with unprecedented circumstances emerging from within the enterprise and also in the external environment. This implies that what may have worked in the past is less likely to work in the new circumstances, and new approaches will be needed to deal with them. Hence, the ability to think creatively, that is, "out of the box," for coming up with innovative approaches and solutions is critical. The creativity of an individual will have its limitations, and hence leaders must harness the creativity of a broader group of people in the enterprise. A leader can be a huge enabler to this by ideating jointly with a group of people and also by facilitating discussions by challenging status quo and spurring the teams to suggest improvements. Leaders can also encourage innovation through experimentation. With the fast pace of change in the external environment, and consequently the continuous evolution of businesses, leaders will often find themselves out of their comfort zone. Leaders will therefore have to get comfortable with being uncomfortable. It will be easier for leaders to think more creatively once they accept this new reality. #4 Emotional intelligence Emotional intelligence (EI), also known as emotional quotient (EQ), is defined by Wikipedia as "the capability of individuals to recognize their own emotions and those of others, discern between different feelings and label them appropriately, use emotional information to guide thinking and behavior, and manage and/or adjust emotions to adapt to environments or achieve one's goal/s". [iii] EI is made up of four core skills: Self-awareness Social awareness Self-management Relationship management The importance of EI in people-centric enterprises, especially for leaders, cannot be overstated. While people in a company may be bound by purpose and by being a part of a team, people are inherently different from each other in terms of personality types and emotions. This can have a significant bearing on how people in a business deal with and react to circumstances, especially adverse ones. Having high EI enables leaders to understand people "from the inside." This helps leaders to build better rapport with people, thereby enabling them to bring out the best in employees and support them as needed. #5 Courage An innovative approach to dealing with an unprecedented circumstance will, by definition, carry some risk. The hypothesis about the appropriateness of that approach can only be validated by putting it to the test against reality. Leaders will therefore need to be courageous as they take the calculated risky bets, strike hard, and own the outcome of those bets. According to Guo Xiao, the President and CEO of ThoughtWorks, "There are many threats—and opportunities—facing businesses in this age of digital transformation: industry disruption from nimble startups, economic pressure from massive digital platforms, evolving security threats, and emerging technologies. Today's era, in which all things are possible, demands a distinct style of leadership. It calls for bold individuals who set their company's vision and charge ahead in a time of uncertainty, ambiguity, and boundless opportunity. It demands courage." Taking risks does not mean being reckless. Rather, leaders need to take calculated risks, after giving due consideration to intuition, facts, and opinions. Despite best efforts and intentions, some decisions will inevitably go wrong. Leaders must have the courage and humility to admit that the decision went wrong and own the outcomes of that decision, and not let these failures deter them from taking risks in the future. #6 Passion for learning Learnability is the ability to upskill, reskill, and deskill. In today's highly dynamic era, it is not what one knows, or what skills one has, that matters as much as the ability to quickly adapt to a different skill set. It is about understanding what is needed to optimize success and what skills and abilities are necessary, from a leadership perspective, to make the enterprise as a whole successful. Leaders need to shed inhibitions about being seen as "novices" while they acquire and practice new skills. The fact that leaders are willing to acquire new skills can be hugely impactful in terms of encouraging others in the enterprise to do the same. This is especially important in terms of bringing in and encouraging the culture of learnability across the business. #7 Awareness of cognitive biases Cognitive biases are flaws in thinking that can lead to suboptimal decisions. Leaders need to become aware of these biases so that they can objectively assess whether their decisions are being influenced by any biases. Cognitive biases lead to shortcuts in decision-making. Essentially, these biases are an attempt by the brain to simplify information processing. Leaders today are challenged with an overload of information and also the need to make decisions quickly. These factors can contribute to decisions and judgements being influenced by cognitive biases. Over decades, psychologists have discovered a huge number of biases. However, the following biases are more important from decision-making perspective: Confirmation bias This is the tendency of selectively seeking and holding onto information to reaffirm what you already believe to be true. For example, a leader believes that a recently launched product is doing well, based on the initial positive response. He has developed a bias that this product is successful. However, although the product is succeeding in attracting new customers, it is also losing existing customers. The confirmation bias is making the leader focus only on data pertaining to new customers, so he is ignoring data related to the loss of existing customers. Bandwagon effect bias Bandwagon effect bias, also known as "herd mentality," encourages doing something because others are doing it. The bias creates a feeling of not wanting to be left behind and hence can lead to irrational or badly-thought-through decisions. Enterprises launching the Agile transformation initiative, without understanding the implications of the long and difficult journey ahead, is an example of this bias. "Guru" bias Guru bias leads to blindly relying on an expert's advice. This can be detrimental, as the expert could be wrong in their assessment and therefore the advice could also be wrong. Also, the expert might give advice which is primarily furthering his or her interests over the interests of the enterprise. Projection bias Projection bias leads the person to believe that other people have understood and are aligned with their thinking, while in reality this may not be true. This bias is more prevalent in enterprises where employees are fearful of admitting that they have not understood what their "bosses" have said, asking questions to clarify or expressing disagreement. Stability bias Stability bias, also known as "status quo" bias, leads to a belief that change will lead to unfavorable outcomes, that is, the risk of loss is greater than the possibility of benefit. It makes a person believe that stability and predictability lead to safety. For decades, the mandate for leaders was to strive for stability and hence, many older leaders are susceptible to this bias. Leaders must encourage others in the enterprise to challenge biases, which can uncover "blind spots" arising from them. Once decisions are made, attention should be paid to information coming from feedback. #8 Resilience Resilience is the capacity to quickly recover from difficulties. Given the turbulent business environment, rapidly changing priorities, and the need to take calculated risks, leaders are likely to encounter difficult and challenging situations quite often. Under such circumstances, having resilience will help the leader to "take knocks on the chin" and keep moving forward. Resilience is also about maintaining composure when something fails, analyzing the failure with the team in an objective manner and leaning from that failure. The actions of leaders are watched by the people in the enterprise even more closely in periods of crisis and difficulty, and hence leaders showing resilience go a long way in increasing resilience across the company. #9 Responsiveness Responsiveness, from the perspective of leadership, is the ability to quickly grasp and respond to both challenges and opportunities. Leaders must listen to feedback coming from customers and the marketplace, learn from it, and adapt accordingly. Leaders must be ready to enable the morphing of the enterprise's offerings in order to stay relevant for customers and also to exploit opportunities. This implies that leaders must be willing to adjust the "pivot" of their offerings based on feedback, for example, the journey of Amazon Web Services, which was an internal system but has now grown into a highly successful business. Other prominent examples are Twitter, which was an offshoot of Odeo, a website focused on sound and podcasting, and PayPal's move from transferring money via PalmPilots to becoming a highly robust online payment service. We discovered that leaders are the primary catalysts for any enterprise aspiring to enhance its Agility. Leaders need specific capabilities, which are over and above the standard leadership capabilities, in order to take the business on the path of enhanced Enterprise Agility. These capabilities comprise of personal traits and behaviors that are intrinsic in nature and enable leadership Agility, which is the foundation of Enterprise Agility. Want to know more about how an enterprise can thrive in a dynamic business environment, check out the book Enterprise Agility. Skill Up 2017: What we learned about tech pros and developers 96% of developers believe developing soft skills is important Soft skills every data scientist should teach their child
Read more
  • 0
  • 1
  • 4102

article-image-how-does-elasticsearch-work-tutorial
Savia Lobo
30 Jul 2018
12 min read
Save for later

How does Elasticsearch work? [Tutorial]

Savia Lobo
30 Jul 2018
12 min read
Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before.  Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. In this article, we will briefly discuss how Elasticsearch works internally and explain the basic query APIs.  All the data in Elasticsearch is internally stored in  Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. This Elasticsearch tutorial is an excerpt taken from the book,'Learning Elasticsearch' written by Abhishek Andhavarapu. Inverted index in Elasticsearch Inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on. We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear. Document1: Fear leads to anger Document2: Anger leads to hate Document3: Hate leads to suffering In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same. Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value. Term Document Fear 1 Anger 1,2 Hate 2,3 Suffering 3 Leads 1,2,3 Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don't use the exact words. Now let's say we encountered a document containing the following: Yosemite national park may be closed for the weekend due to forecast of substantial rainfall We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown: Document Query rainfall rain To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections. Stemming Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain. We can configure stemming in Elasticsearch using Analyzers. Synonyms Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words. More examples: Pen, Pen[s] -> Pen Eat, Eating  -> Eat Phrase search As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below: Term Document Anger 1,2 Leads 1,2,3 From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here: Term Document Fear 1:1 Anger 1:3, 2:1 Hate 2:3, 3:1 Suffering 3:3 Leads 1:2, 2:2, 3:2 Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query. Term Document anger 1:3, 2:1 leads 1:2, 2:2 Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index. Scalability and availability in Elasticsearch Let's say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create an index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below. There are type of shards in Elasticsearch - primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time. As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance. Now, we will discuss the relation between node, index and shard. Relation between node, index, and shard Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences. Three shards with zero replicas We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows: In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards. Six shards with zero replicas Now, let's recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes. The distribution of shards for an index with six shards is as follows: The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book. Six shards with one replica Let's now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure. The solid border represents primary shards, and replicas are the dotted squares: As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index. In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch. Distributed search One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards. Let's take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica: Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I'm currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed. Failure handling in Elasticsearch Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let's say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas: As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2. Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here: The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let's say Node2, which contains the primary shard S1, goes down as shown here: Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster. Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch. We discussed inverted indexes, relation between nodes, index and shard, distributed search and how failures are handled automatically in Elasticsearch. Check out this book, 'Learning Elasticsearch' to know about handling document relationships, working with geospatial data, and much more. How to install Elasticsearch in Ubuntu and Windows Working with Kibana in Elasticsearch 5.x CRUD (Create Read, Update and Delete) Operations with Elasticsearch
Read more
  • 0
  • 2
  • 26331
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-firefox-nightly-browser-debugging-your-app-is-now-fun-with-mozillas-new-time-travel-feature
Natasha Mathur
30 Jul 2018
3 min read
Save for later

Firefox Nightly browser: Debugging your app is now fun with Mozilla’s new ‘time travel’ feature

Natasha Mathur
30 Jul 2018
3 min read
Earlier this month, Mozilla announced a fancy new feature called “Time Travel debugging” for its Firefox Nightly web browser at the JSConf EU 2018.  With time travel debugging, you can easily track the bugs in your code or app as it lets you pause and rewind to the exact time when your app broke down. Time travel debugging technology is particularly useful for local web development where it allows you to pause and step forward or backward, pause and rewind to a previous state, rewind to the time a console message was logged and rewind to the time where an element had a certain style. It is also great for times where you might want to save user recordings or view a test recording when the testing fails. With time travel debugging, you can record a tab on your browser and later replay it using WebReplay, an experimental project which allows you to record, rewind and replay the processes for the web. According to Jason Laster, a Senior Software Engineer at Mozilla,“ with time travel, we have a full recording of time, you can jump to any point in the path and see it immediately, you don’t have to refresh or re-click or pause or look at logs”. Here’s a video of Jason Laster talking about the potential of time travel debugging. JSConf  He also mentioned how time travel is “not a new thing” and he was inspired by Dan Abramov, creator of Redux when he showcased Redux at JSConfEU saying how he wanted “time travel” to “reduce his action over time”. With Redux, you get a slider that shows you all the actions over time and as you’re moving, you get to see the UI update as well. In fact, Mozilla rebuilt the debugger in order to use React and redux for its time travel feature. Their debugger comes equipped with Redux dev tools, which shows a list of all the actions for the debugger. So, the dev tools show you the state of the app, sources, and the pause data. Finally, Laster added how “this is just the beginning” and that “they hope to pull this off well in the future”. To use this new time travel debugging feature, you must install the Firefox Nightly browser first. For more details on the new feature, check out the official documentation. Mozilla is building a bridge between Rust and JavaScript Firefox has made a password manager for your iPhone Firefox 61 builds on Firefox Quantum, adds Tab Warming, WebExtensions, and TLS 1.3  
Read more
  • 0
  • 0
  • 3777

article-image-setting-gradle-properties-to-build-a-project
Savia Lobo
30 Jul 2018
10 min read
Save for later

Setting Gradle properties to build a project [Tutorial]

Savia Lobo
30 Jul 2018
10 min read
A Gradle script is a program. We use a Groovy DSL to express our build logic. Gradle has several useful built-in methods to handle files and directories as we often deal with files and directories in our build logic. In today's post, we will take a look at how to set Gradle properties in a project build.  We will also see how to use the Gradle Wrapper task to distribute a configurable Gradle with our build scripts. This article is an excerpt taken from, 'Gradle Effective Implementations Guide - Second Edition' written by Hubert Klein Ikkink.  Setting Gradle project properties In a Gradle build file, we can access several properties that are defined by Gradle, but we can also create our own properties. We can set the value of our custom properties directly in the build script and we can also do this by passing values via the command line. The default properties that we can access in a Gradle build are displayed in the following table: NameTypeDefault valueprojectProjectThe project instance.nameStringThe name of the project directory. The name is read-only.pathStringThe absolute path of the project.descriptionStringThe description of the project.projectDirFileThe directory containing the build script. The value is read-only.buildDirFileThe directory with the build name in the directory, containing the build script.rootDirFileThe directory of the project at the root of a project structure.groupObjectNot specified.versionObjectNot specified.antAntBuilderAn AntBuilder instance. The following build file has a task of showing the value of the properties: version = '1.0' group = 'Sample' description = 'Sample build file to show project properties' task defaultProperties << { println "Project: $project" println "Name: $name" println "Path: $path" println "Project directory: $projectDir" println "Build directory: $buildDir" println "Version: $version" println "Group: $project.group" println "Description: $project.description" println "AntBuilder: $ant" } When we run the build, we get the following output: $ gradle defaultProperties :defaultProperties Project: root project 'props' Name: defaultProperties Path: :defaultProperties Project directory: /Users/mrhaki/gradle-book/Code_Files/props Build directory: /Users/mrhaki/gradle-book/Code_Files/props/build Version: 1.0 Group: Sample Description: Sample build file to show project properties AntBuilder: org.gradle.api.internal.project.DefaultAntBuilder@3c95cbbd BUILD SUCCESSFUL Total time: 1.458 secs Defining custom properties in script To add our own properties, we have to define them in an  ext{} script block in a build file. Prefixing the property name with ext. is another way to set the value. To read the value of the property, we don't have to use the ext. prefix, we can simply refer to the name of the property. The property is automatically added to the internal project property as well. In the following script, we add a customProperty property with a String value custom. In the showProperties task, we show the value of the property: // Define new property. ext.customProperty = 'custom' // Or we can use ext{} script block. ext { anotherCustomProperty = 'custom' } task showProperties { ext { customProperty = 'override' } doLast { // We can refer to the property // in different ways: println customProperty println project.ext.customProperty println project.customProperty } } After running the script, we get the following output: $ gradle showProperties :showProperties override custom custom BUILD SUCCESSFUL Total time: 1.469 secs Defining properties using an external file We can also set the properties for our project in an external file. The file needs to be named gradle.properties, and it should be a plain text file with the name of the property and its value on separate lines. We can place the file in the project directory or Gradle user home directory. The default Gradle user home directory is $USER_HOME/.gradle. A property defined in the properties file, in the Gradle user home directory, overrides the property values defined in a properties file in the project directory. We will now create a gradle.properties file in our project directory, with the following contents. We use our build file to show the property values: task showProperties { doLast { println "Version: $version" println "Custom property: $customProperty" } } If we run the build file, we don't have to pass any command-line options, Gradle will use gradle.properties to get values of the properties: $ gradle showProperties :showProperties Version: 4.0 Custom property: Property value from gradle.properties BUILD SUCCESSFUL Total time: 1.676 secs Passing properties via the command line Instead of defining the property directly in the build script or external file, we can use the -P command-line option to add an extra property to a build. We can also use the -P command-line option to set a value for an existing property. If we define a property using the -P command-line option, we can override a property with the same name defined in the external gradle.properties file. The following build script has a showProperties task that shows the value of an existing property and a new property: task showProperties { doLast { println "Version: $version" println "Custom property: $customProperty" } } Let's run our script and pass the values for the existing version property and the non-existent  customProperty: $ gradle -Pversion=1.1 -PcustomProperty=custom showProperties :showProperties Version: 1.1 Custom property: custom BUILD SUCCESSFUL Total time: 1.412 secs Defining properties via system properties We can also use Java system properties to define properties for our Gradle build. We use the -D command-line option just like in a normal Java application. The name of the system property must start with org.gradle.project, followed by the name of the property we want to set, and then by the value. We can use the same build script that we created before: task showProperties { doLast { println "Version: $version" println "Custom property: $customProperty" } } However, this time we use different command-line options to get a result: $ gradle -Dorg.gradle.project.version=2.0 -Dorg.gradle.project.customProperty=custom showProperties :showProperties Version: 2.0 Custom property: custom BUILD SUCCESSFUL Total time: 1.218 secs Adding properties via environment variables Using the command-line options provides much flexibility; however, sometimes we cannot use the command-line options because of environment restrictions or because we don't want to retype the complete command-line options each time we invoke the Gradle build. Gradle can also use environment variables set in the operating system to pass properties to a Gradle build. The environment variable name starts with ORG_GRADLE_PROJECT_ and is followed by the property name. We use our build file to show the properties: task showProperties { doLast { println "Version: $version" println "Custom property: $customProperty" } } Firstly, we set ORG_GRADLE_PROJECT_version and ORG_GRADLE_PROJECT_customProperty environment variables, then we run our showProperties task, as follows: $ ORG_GRADLE_PROJECT_version=3.1 ORG_GRADLE_PROJECT_customProperty="Set by environment variable" gradle showProp :showProperties Version: 3.1 Custom property: Set by environment variable BUILD SUCCESSFUL Total time: 1.373 secs Using the Gradle Wrapper Normally, if we want to run a Gradle build, we must have Gradle installed on our computer. Also, if we distribute our project to others and they want to build the project, they must have Gradle installed on their computers. The Gradle Wrapper can be used to allow others to build our project even if they don't have Gradle installed on their computers. The wrapper is a batch script on the Microsoft Windows operating systems or shell script on other operating systems that will download Gradle and run the build using the downloaded Gradle. By using the wrapper, we can make sure that the correct Gradle version for the project is used. We can define the Gradle version, and if we run the build via the wrapper script file, the version of Gradle that we defined is used. Creating wrapper scripts To create the Gradle Wrapper batch and shell scripts, we can invoke the built-in wrapper task. This task is already available if we have installed Gradle on our computer. Let's invoke the wrapper task from the command-line: $ gradle wrapper :wrapper BUILD SUCCESSFUL Total time: 0.61 secs After the execution of the task, we have two script files—gradlew.bat and gradlew—in the root of our project directory. These scripts contain all the logic needed to run Gradle. If Gradle is not downloaded yet, the Gradle distribution will be downloaded and installed locally. In the gradle/wrapper directory, relative to our project directory, we find the gradle-wrapper.jar and gradle-wrapper.properties files. The gradle-wrapper.jar file contains a couple of class files necessary to download and invoke Gradle. The gradle-wrapper.properties file contains settings, such as the URL, to download Gradle. The gradle-wrapper.properties file also contains the Gradle version number. If a new Gradle version is released, we only have to change the version in the gradle-wrapper.properties file and the Gradle Wrapper will download the new version so that we can use it to build our project. All the generated files are now part of our project. If we use a version control system, then we must add these files to the version control. Other people that check out our project can use the gradlew scripts to execute tasks from the project. The specified Gradle version is downloaded and used to run the build file. If we want to use another Gradle version, we can invoke the wrapper task with the --gradle-version option. We must specify the Gradle version that the Wrapper files are generated for. By default, the Gradle version that is used to invoke the wrapper task is the Gradle version used by the wrapper files. To specify a different download location for the Gradle installation file, we must use the --gradle-distribution-url option of the wrapper task. For example, we could have a customized Gradle installation on our local intranet, and with this option, we can generate the Wrapper files that will use the Gradle distribution on our intranet. In the following example, we generate the wrapper files for Gradle 2.12 explicitly: $ gradle wrapper --gradle-version=2.12 :wrapper BUILD SUCCESSFUL Total time: 0.61 secs Customizing the Gradle Wrapper If we want to customize properties of the built-in wrapper task, we must add a new task to our Gradle build file with the org.gradle.api.tasks.wrapper.Wrapper type. We will not change the default wrapper task, but create a new task with new settings that we want to apply. We need to use our new task to generate the Gradle Wrapper shell scripts and support files. We can change the names of the script files that are generated with the scriptFile property of the Wrapper task. To change the name and location of the generated JAR and properties files, we can change the jarFile property: task createWrapper(type: Wrapper) { // Set Gradle version for wrapper files. gradleVersion = '2.12' // Rename shell scripts name to // startGradle instead of default gradlew. scriptFile = 'startGradle' // Change location and name of JAR file // with wrapper bootstrap code and // accompanying properties files. jarFile = "${projectDir}/gradle-bin/gradle-bootstrap.jar" } If we run the createWrapper task, we get a Windows batch file and shell script and the Wrapper bootstrap JAR file with the properties file is stored in the gradle-bin directory: $ gradle createWrapper :createWrapper BUILD SUCCESSFUL Total time: 0.605 secs $ tree . . ├── gradle-bin │ ├── gradle-bootstrap.jar │ └── gradle-bootstrap.properties ├── startGradle ├── startGradle.bat └── build.gradle 2 directories, 5 files To change the URL from where the Gradle version must be downloaded, we can alter the distributionUrl property. For example, we could publish a fixed Gradle version on our company intranet and use the distributionUrl property to reference a download URL on our intranet. This way we can make sure that all developers in the company use the same Gradle version: task createWrapper(type: Wrapper) { // Set URL with custom Gradle distribution. distributionUrl = 'http://intranet/gradle/dist/gradle-custom- 2.12.zip' } We discussed the Gradle properties and how to use the Gradle Wrapper to allow users to build our projects even if they don't have Gradle installed. We discussed how to customize the Wrapper to download a specific version of Gradle and use it to run our build. If you've enjoyed reading this post, do check out our book 'Gradle Effective Implementations Guide - Second Edition' to know more about how to use Gradle for Java Projects. Top 7 Python programming books you need to read 4 operator overloading techniques in Kotlin you need to know 5 Things you need to know about Java 10
Read more
  • 0
  • 0
  • 44557

article-image-deepcube-a-new-deep-reinforcement-learning-approach-solves-the-rubiks-cube-with-no-human-help
Savia Lobo
29 Jul 2018
4 min read
Save for later

DeepCube: A new deep reinforcement learning approach solves the Rubik’s cube with no human help

Savia Lobo
29 Jul 2018
4 min read
Humans have been excellent players in most of the gameplays be it indoor or outdoors. However, over the recent years we have been increasingly coming across machines that are playing and winning popular board games Go and Chess against humans using machine learning algorithms. If you think machines are only good at solving the black and whites, you are wrong. The recent achievement of a machine trying to solve a complex game (a Rubik’s cube) is DeepCube. Rubik cube is a challenging piece of puzzle that’s captivated everyone since childhood. Solving it is a brag-worthy accomplishment for most adults. A group of UC Irvine researchers have now developed a new algorithm (used by DeepCube) known as Autodidactic Iteration, which can solve a Rubik’s cube with no human assistance. The Erno Rubik’s cube conundrum Rubik’s cube, a popular three-dimensional puzzle was developed by Erno Rubik in the year 1974. Rubik worked for a month to figure out the first algorithm to solve the cube. Researchers at the UC Irvine state that “Since then, the Rubik’s Cube has gained worldwide popularity and many human-oriented algorithms for solving it have been discovered. These algorithms are simple to memorize and teach humans how to solve the cube in a structured, step-by-step manner.” After the cube became popular among mathematicians and computer scientists, questions around how to solve the cube with least possible turns became mainstream. In 2014, it was proved that the least number of steps to solve the cube puzzle was 26. More recently, computer scientists have tried to find ways for machines to solve the Rubik’s cube. As a first step, they tried and tested ways to use the same successful approach tried in the games Go and Chess. However, this approach did not work well for the Rubik’s cube. The approach: Rubik vs Chess and Go Algorithms used in Go and Chess are fed with rules of the game and then they play against themselves. The deep learning machine here is rewarded based on its performance at every step it takes. Reward process is considered as important as it helps the machine to distinguish between a good and a bad move. Following this, the machine starts playing well i.e it learns how to play well. On the other hand, the rewards in the case of Rubik’s cube are nearly hard to determine. This is because there are random turns in the cube and it is hard to judge whether the new configuration is any closer to a solution. The random turns can be unlimited and hence earning an end-state reward is very rare. Both Chess and Go have a large search space but each move can be evaluated and rewarded accordingly. This isn’t the case for Rubik’s cube! UC Irvine researchers have found a way for machines to create its own set of rewards in the Autodidactic Iteration method for DeepCube. Autodidactic Iteration: Solving the Rubik’s Cube without human Knowledge DeepCube’s Autodidactic Iteration (ADI) is a form of deep learning known as deep reinforcement learning (DRL). It combines classic reinforcement learning, deep learning, and Monte Carlo Tree Search (MCTS). When DeepCube gets an unsolved cube, it decides whether the specific move is an improvement on the existing configuration. To do this, it must be able to evaluate the move. The algorithm, Autodidactic iteration starts with the finished cube and works backwards to find a configuration that is similar to the proposed move. Although this process is imperfect, deep learning helps the system figure out which moves are generally better than others. Researchers trained a network using ADI for 2,000,000 iterations. They further reported, “The network witnessed approximately 8 billion cubes, including repeats, and it trained for a period of 44 hours. Our training machine was a 32-core Intel Xeon E5-2620 server with three NVIDIA Titan XP GPUs.” After training, the network uses a standard search tree to hunt for suggested moves for each configuration. The researchers in their paper said, “Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge.” Researchers also wrote, “DeepCube is able to teach itself how to reason in order to solve a complex environment with only one reward state using pure reinforcement learning.” Furthermore, this approach will have a potential to provide approximate solutions to a broad class of combinatorial optimization problems. To explore Deep Reinforcement Learning check out our latest releases, Hands-On Reinforcement Learning with Python and Deep Reinforcement Learning Hands-On. How greedy algorithms work Creating a reference generator for a job portal using Breadth First Search (BFS) algorithm Anatomy of an automated machine learning algorithm (AutoML)    
Read more
  • 0
  • 0
  • 6203

article-image-wireshark-analyze-malicious-emails-in-pop-imap-smtp
Vijin Boricha
29 Jul 2018
10 min read
Save for later

Wireshark for analyzing issues and malicious emails in POP, IMAP, and SMTP [Tutorial]

Vijin Boricha
29 Jul 2018
10 min read
One of the contributing factors in the evolution of digital marketing and business is email. Email allows users to exchange real-time messages and other digital information such as files and images over the internet in an efficient manner. Each user is required to have a human-readable email address in the form of [email protected]. There are various email providers available on the internet, and any user can register to get a free email address. There are different email application-layer protocols available for sending and receiving mails, and the combination of these protocols helps with end-to-end email exchange between users in the same or different mail domains. In this article, we will look at the normal operation of email protocols and how to use Wireshark for basic analysis and troubleshooting. This article is an excerpt from Network Analysis using Wireshark 2 Cookbook - Second Edition written by Nagendra Kumar Nainar, Yogesh Ramdoss, Yoram Orzach. The three most commonly used application layer protocols are POP3, IMAP, and SMTP: POP3: Post Office Protocol 3 (POP3) is an application layer protocol used by email systems to retrieve mail from email servers. The email client uses POP3 commands such as LOGIN, LIST, RETR, DELE, QUIT to access and manipulate (retrieve or delete) the email from the server. POP3 uses TCP port 110 and wipes the mail from the server once it is downloaded to the local client. IMAP: Internet Mail Access Protocol (IMAP) is another application layer protocol used to retrieve mail from the email server. Unlike POP3, IMAP allows the user to read and access the mail concurrently from more than one client device. With current trends, it is very common to see users with more than one device to access emails (laptop, smartphone, and so on), and the use of IMAP allows the user to access mail any time, from any device. The current version of IMAP is 4 and it uses TCP port 143. SMTP: Simple Mail Transfer Protocol (SMTP) is an application layer protocol that is used to send email from the client to the mail server. When the sender and receiver are in different email domains, SMTP helps to exchange the mail between servers in different domains. It uses TCP port 25: As shown in the preceding diagram, SMTP is the email client used to send the mail to the mail server, and POP3 or IMAP is used to retrieve the email from the server. The email server uses SMTP to exchange the mail between different domains. In order to maintain the privacy of end users, most email servers use different encryption mechanisms at the transport layer. The transport layer port number will differ from the traditional email protocols if they are used over secured transport layer (TLS). For example, POP3 over TLS uses TCP port 995, IMAP4 over TLS uses TCP port 993, and SMTP over TLS uses port 465. Normal operation of mail protocols As we saw above, the common mail protocols for mail client to server and server to server communication are POP3, SMTP, and IMAP4. Another common method for accessing emails is web access to mail, where you have common mail servers such as Gmail, Yahoo!, and Hotmail. Examples include Outlook Web Access (OWA) and RPC over HTTPS for the Outlook web client from Microsoft. In this recipe, we will talk about the most common client-server and server-server protocols, POP3 and SMTP, and the normal operation of each protocol. Getting ready Port mirroring to capture the packets can be done either on the email client side or on the server side. How to do it... POP3 is usually used for client to server communications, while SMTP is usually used for server to server communications. POP3 communications POP3 is usually used for mail client to mail server communications. The normal operation of POP3 is as follows: Open the email client and enter the username and password for login access. Use POP as a display filter to list all the POP packets. It should be noted that this display filter will only list packets that use TCP port 110. If TLS is used, the filter will not list the POP packets. We may need to use tcp.port == 995 to list the POP3 packets over TLS. Check the authentication has been passed correctly. In the following screenshot, you can see a session opened with a username that starts with doronn@ (all IDs were deleted) and a password that starts with u6F. To see the TCP stream shown in the following screenshot, right-click on one of the packets in the stream and choose Follow TCP Stream from the drop-down menu: Any error messages in the authentication stage will prevent communications from being established. You can see an example of this in the following screenshot, where user authentication failed. In this case, we see that when the client gets a Logon failure, it closes the TCP connection: Use relevant display filters to list the specific packet. For example, pop.request.command == "USER" will list the POP request packet with the username and pop.request.command == "PASS" will list the POP packet carrying the password. A sample snapshot is as follows: During the mail transfer, be aware that mail clients can easily fill a narrow-band communications line. You can check this by simply configuring the I/O graphs with a filter on POP. Always check for common TCP indications: retransmissions, zero-window, window-full, and others. They can indicate a busy communication line, slow server, and other problems coming from the communication lines or end nodes and servers. These problems will mostly cause slow connectivity. When the POP3 protocol uses TLS for encryption, the payload details are not visible. We explain how the SSL captures can be decrypted in the There's more... section. IMAP communications IMAP is similar to POP3 in that it is used to retrieve the mail from the server by the client. The normal behavior of IMAP communication is as follows: Open the email client and enter the username and password for the relevant account. Compose a new message and send it from any email account. Retrieve the email on the client that is using IMAP. Different clients may have different ways of retrieving the email. Use the relevant button to trigger it. Check you received the email on your local client. SMTP communications SMTP is commonly used for the following purposes: Server to server communications, in which SMTP is the mail protocol that runs between the servers In some clients, POP3 or IMAP4 are configured for incoming messages (messages from the server to the client), while SMTP is configured for outgoing messages (messages from the client to the server) The normal behavior of SMTP communication is as follows: The local email client resolves the IP address of the configured SMTP server address. This triggers a TCP connection to port number 25 if SSL/TLS is not enabled. If SSL/TLS is enabled, a TCP connection is established over port 465. It exchanges SMTP messages to authenticate with the server. The client sends AUTH LOGIN to trigger the login authentication. Upon successful login, the client will be able to send mails. It sends SMTP message such as "MAIL FROM:<>", "RCPT TO:<>" carrying sender and receiver email addresses. Upon successful queuing, we get an OK response from the SMTP server. The following is a sample SMTP message flow between client and server: How it works... In this section, let's look into the normal operation of different email protocols with the use of Wireshark. Mail clients will mostly use POP3 for communication with the server. In some cases, they will use SMTP as well. IMAP4 is used when server manipulation is required, for example, when you need to see messages that exist on a remote server without downloading them to the client. Server to server communication is usually implemented by SMTP. The difference between IMAP and POP is that in IMAP, the mail is always stored on the server. If you delete it, it will be unavailable from any other machine. In POP, deleting a downloaded email may or may not delete that email on the server. In general, SMTP status codes are divided into three categories, which are structured in a way that helps you understand what exactly went wrong. The methods and details of SMTP status codes are discussed in the following section. POP3 POP3 is an application layer protocol used by mail clients to retrieve email messages from the server. A typical POP3 session will look like the following screenshot: It has the following steps: The client opens a TCP connection to the server. The server sends an OK message to the client (OK Messaging Multiplexor). The user sends the username and password. The protocol operations begin. NOOP (no operation) is a message sent to keep the connection open, STAT (status) is sent from the client to the server to query the message status. The server answers with the number of messages and their total size (in packet 1042, OK 0 0 means no messages and it has a total size of zero) When there are no mail messages on the server, the client send a QUIT message (1048), the server confirms it (packet 1136), and the TCP connection is closed (packets 1137, 1138, and 1227). In an encrypted connection, the process will look nearly the same (see the following screenshot). After the establishment of a connection (1), there are several POP messages (2), TLS connection establishment (3), and then the encrypted application data: IMAP The normal operation of IMAP is as follows: The email client resolves the IP address of the IMAP server: As shown in the preceding screenshot, the client establishes a TCP connection to port 143 when SSL/TSL is disabled. When SSL is enabled, the TCP session will be established over port 993. Once the session is established, the client sends an IMAP capability message requesting the server sends the capabilities supported by the server. This is followed by authentication for access to the server. When the authentication is successful, the server replies with response code 3 stating the login was a success: The client now sends the IMAP FETCH command to fetch any mails from the server. When the client is closed, it sends a logout message and clears the TCP session. SMTP The normal operation of SMTP is as follows: The email client resolves the IP address of the SMTP server: The client opens a TCP connection to the SMTP server on port 25 when SSL/TSL is not enabled. If SSL is enabled, the client will open the session on port 465: Upon successful TCP session establishment, the client will send an AUTH LOGIN message to prompt with the account username/password. The username and password will be sent to the SMTP client for account verification. SMTP will send a response code of 235 if authentication is successful: The client now sends the sender's email address to the SMTP server. The SMTP server responds with a response code of 250 if the sender's address is valid. Upon receiving an OK response from the server, the client will send the receiver's address. SMTP server will respond with a response code of 250 if the receiver's address is valid. The client will now push the actual email message. SMTP will respond with a response code of 250 and the response parameter OK: queued. The successfully queued message ensures that the mail is successfully sent and queued for delivery to the receiver address. We have learned how to analyse issues in POP, IMAP, and SMTP  and malicious emails. Get to know more about  DNS Protocol Analysis and FTP, HTTP/1, AND HTTP/2 from our book Network Analysis using Wireshark 2 Cookbook - Second Edition. What’s new in Wireshark 2.6? Analyzing enterprise application behavior with Wireshark 2 Capturing Wireshark Packets
Read more
  • 0
  • 0
  • 29151
article-image-creating-effective-dashboards-using-splunk-tutorial
Sunith Shetty
28 Jul 2018
10 min read
Save for later

Creating effective dashboards using Splunk [Tutorial]

Sunith Shetty
28 Jul 2018
10 min read
Splunk is easy to use for developing a powerful analytical dashboard with multiple panels. A dashboard with too many panels, however, will require scrolling down the page and can cause the viewer to miss crucial information. An effective dashboard should generally meet the following conditions: Single screen view: The dashboard fits in a single window or page, with no scrolling Multiple data points: Charts and visualizations should display a number of data points Crucial information highlighted: The dashboard points out the most important information, using appropriate titles, labels, legends, markers, and conditional formatting as required Created with the user in mind: Data is presented in a way that is meaningful to the user Loads quickly: The dashboard returns results in 10 seconds or less Avoid redundancy: The display does not repeat information in multiple places In this tutorial, we learn to create different types of dashboards using Splunk. We will also discuss how to gather business requirements for your dashboards. Types of Splunk dashboards There are three kinds of dashboards typically created with Splunk: Dynamic form-based dashboards Real-time dashboards Dashboards as scheduled reports Dynamic form-based dashboards allow Splunk users to modify the dashboard data without leaving the page. This is accomplished by adding data-driven input fields (such as time, radio button, textbox, checkbox, dropdown, and so on) to the dashboard. Updating these inputs changes the data based on the selections. Dynamic form-based dashboards have existed in traditional business intelligence tools for decades now, so users who frequently use them will be familiar with changing prompt values on the fly to update the dashboard data. Real-time dashboards are often kept on a big panel screen for constant viewing, simply because they are so useful. You see these dashboards in data centers, network operations centers (NOCs), or security operations centers (SOCs) with constant format and data changing in real time. The dashboard will also have indicators and alerts for operators to easily identify and act on a problem. Dashboards like this typically show the current state of security, network, or business systems, using indicators for web performance and traffic, revenue flow, login failures, and other important measures. Dashboards as scheduled reports may not be exposed for viewing; however, the dashboard view will generally be saved as a PDF file and sent to email recipients at scheduled times. This format is ideal when you need to send information updates to multiple recipients at regular intervals, and don't want to force them to log in to Splunk to capture the information themselves. We will create the first two types of dashboards, and you will learn how to use the Splunk dashboard editor to develop advanced visualizations along the way. Gathering business requirements As a Splunk administrator, one of the most important responsibilities is to be responsible for the data. As a custodian of data, a Splunk admin has significant influence over how to interpret and present information to users. It is common for the administrator to create the first few dashboards. A more mature implementation, however, requires collaboration to create an output that is beneficial to a variety of user requirements and may be completed by a Splunk development resource with limited administrative rights. Make it a habit to consistently request users input regarding the Splunk delivered dashboards and reports and what makes them useful. Sit down with day-to-day users and layout, on a drawing board, for example, the business process flows or system diagrams to understand how the underlying processes and systems you're trying to measure really work. Look for key phrases like these, which signify what data is most important to the business: If this is broken, we lose tons of revenue... This is a constant point of failure... We don't know what's going on here... If only I can see the trend, it will make my work easier... This is what my boss wants to see... Splunk dashboard users may come from many areas of the business. You want to talk to all the different users, no matter where they are on the organizational chart. When you make friends with the architects, developers, business analysts, and management, you will end up building dashboards that benefit the organization, not just individuals. With an initial dashboard version, ask for users thoughts as you observe them using it in their work and ask what can be improved upon, added, or changed. We hope that at this point, you realize the importance of dashboards and are ready to get started creating some, as we will do in the following sections. Dynamic form-based dashboard In this section, we will create a dynamic form-based dashboard in our Destinations app to allow users to change input values and rerun the dashboard, presenting updated data. Here is a screenshot of the final output of this dynamic form-based dashboard: Let's begin by creating the dashboard itself and then generate the panels: Go the search bar in the Destinations app Run this search command: SPL> index=main status_type="*" http_uri="*" server_ip="*" | top status_type, status_description, http_uri, server_ip Be careful when copying commands with quotation marks. It is best to type in the entire search command to avoid problems. Go to Save As | Dashboard Panel Fill in the information based on the following screenshot: Click on Save Close the pop-up window that appears (indicating that the dashboard panel was created) by clicking on the X in the top-right corner of the window Creating a Status Distribution panel We will go to the after all the panel searches have been generated. Let's go ahead and create the second panel: In the search window, type in the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | top status_type You will save this as a dashboard panel in the newly created dashboard. In the Dashboard option, click on the Existing button and look for the new dashboard, as seen here. Don't forget to fill in the Panel Title as Status Distribution: Click on Save when you are done and again close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Status Types Over Time panel Now, we'll move on to create the third panel: Type in the following search command and be sure to run it so that it is the active search: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count by http_status_code You will save this as a Dynamic Form-based Dashboard panel as well. Type in Status Types Over Time in the Panel Title field: Click on Save and close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Hits vs Response Time panel Now, on to the final panel. Run the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count, avg(http_response_time) as response_time Save this dashboard panel as Hits vs Response Time: Arrange the dashboard We'll move on to look at the dashboard we've created and make a few changes: Click on the View Dashboard button. If you missed out on the View Dashboard button, you can find your dashboard by clicking on Dashboards in the main navigation bar. Let's edit the panel arrangement. Click on the Edit button. Move the Status Distribution panel to the upper-right row. Move the Hits vs Response Time panel to the lower-right row. Click on Save to save your layout changes. Look at the following screenshot. The dashboard framework you've created should now look much like this. The dashboard probably looks a little plainer than you expected it to. But don't worry; we will improve the dashboard visuals one panel at a time: Panel options in dashboards In this section, we will learn how to alter the look of our panels and create visualizations. Go to the edit dashboard mode by clicking on the Edit button. Each dashboard panel will have three setting options to work with: edit search, select visualization, and visualization format options. They are represented by three drop-down icons: The Edit Search window allows you to modify the search string, change the time modifier for the search, add auto-refresh and progress bar options, as well as convert the panel into a report: The Select Visualization dropdown allows you to change the type of visualization to use for the panel, as shown in the following screenshot: Finally, the Visualization Options dropdown will give you the ability to fine-tune your visualization. These options will change depending on the visualization you select. For a normal statistics table, this is how it will look: Pie chart – Status Distribution Go ahead and change the Status Distribution visualization panel to a pie chart. You do this by selecting the Select Visualization icon and selecting the Pie icon. Once done, the panel will look like the following screenshot: Stacked area chart – Status Types Over Time We will change the view of the Status Types Over Time panel to an area chart. However, by default, area charts will not be stacked. We will update this through adjusting the visualization options: Change the Status Types Over Time panel to an Area Chart using the same Select Visualization button as the prior pie chart exercise. Make the area chart stacked using the Format Visualization icon. In the Stack Mode section, click on Stacked. For Null Values, select Zero. Use the chart that follows for guidance: Click on Apply. The panel will change right away. Remove the _time label as it is already implied. You can do this in the X-Axis section by setting the Title to None. Close the Format Visualization window by clicking on the X in the upper-right corner: Here is the new stacked area chart panel: Column with overlay combination chart – Hits vs Response Time When representing two or more kinds of data with different ranges, using a combination chart—in this case combining a column and a line—can tell a bigger story than one metric and scale alone. We'll use the Hits vs Response Time panel to explore the combination charting options: In the Hits vs Response Time panel, change the chart panel visualization to Column In the Visualization Options window, click on Chart Overlay In the Overlay selection box, select response_time Turn on View as Axis Click on X-Axis from the list of options on the left of the window and change the Title to None Click on Legend from the list of options on the left Change the Legend Position to Bottom Click on the X in the upper-right-hand corner to close the Visualization Options window The new panel will now look similar to the following screenshot. From this and the prior screenshot, you can see there was clearly an outage in the overnight hours: Click on Done to save all the changes you made and exit the Edit mode The dashboard has now come to life. This is how it should look now: To summarize we saw how to create different types of dashboards. To know more about core Splunk functionalities to transform machine data into powerful insights, check out this book Splunk 7 Essentials, Third Edition. Splunk leverages AI in its monitoring tools Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace Create a data model in Splunk to enable interactive reports and dashboards
Read more
  • 0
  • 0
  • 15782

article-image-23andme-share-client-genetic-data-with-gsk-drug-target-discovery
Sugandha Lahoti
28 Jul 2018
3 min read
Save for later

23andMe shares 5mn client genetic data with GSK for drug target discovery, a machine learning application in genetics research

Sugandha Lahoti
28 Jul 2018
3 min read
Genetics company 23andMe, which uses machine learning algorithms for human genome analysis, has entered into a four year collaboration with pharmaceutical giant GlaxoSmithKline. They will now share their 5 million client genetic data with GSK to advance research into treatments of diseases. This collaboration will be used to identify novel drug targets, tackle new subsets of disease and enable rapid progression of clinical programs. The 12 years old firm has already published more than 100 scientific papers based on its customers' data. All activities within the collaboration will initially be co-funded, with either company having certain rights to reduce its funding share. "The goal of the collaboration is to gather insights and discover novel drug targets driving disease progression and develop therapies," GlaxoSmithKline said in a press release. GSK is also reported to have invested $300 million in 23andMe. During the four year collaboration GSK will use 23andMe’s database and statistical analytics for drug target discovery. This collaboration will be used to design GSK’s LRRK2 inhibitor, which is in development for the potential treatment for Parkinson’s disease. 23andMe’s database of consented customers who have a LRRK2 variant status will be used to accelerate the progress of this programme. Together, GSK and 23andMe will target and recruit patients with defined LRRK2 mutations in order to reach clinical proof of concept. 23andMe have made it quite clear that participating in this program is voluntary and requires clients to affirmatively consent to participate. However not everyone is clear of how this would work. First, the company has specified that any research involving customer data that has already been performed or published prior to receipt of withdrawal request will not be reversed. This may have a negative effect as people are generally not aware of all the privacy policies and generally don’t read the Terms of Service. Moreover, as Peter Pitts, president of the Center for Medicine in the Public Interest, notes, “If a person's DNA is used in research, that person should be compensated. Customers shouldn’t be paying for the privilege of 23andMe working with a for-profit company in a for-profit research project.” Both the companies have sworn to provide maximum data protection for their employees. In a blog post, they note, “The continued protection of customers’ data and privacy is the highest priority for both GSK and 23andMe. Both companies have stringent security protections in place when it comes to collecting, storing and transferring information about research participants.” You can read more about the news, on a blog by 23andMe founder, Anne Wojcicki. 6 use cases of Machine Learning in Healthcare Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions NIPS 2017 Special: How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey
Read more
  • 0
  • 0
  • 2884

article-image-apache-druid-hadoop-data-visualizations-tutorial
Sunith Shetty
27 Jul 2018
9 min read
Save for later

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Sunith Shetty
27 Jul 2018
9 min read
Apache Druid is a distributed, high-performance columnar store. Druid allows us to store both real-time and historical data that is time series in nature. It also provides fast data aggregation and flexible data exploration. The architecture supports storing trillions of data points on petabyte sizes. In this tutorial, we will explore Apache Druid components and how it can be used to visualize data in order to build the analytics that drives the business decisions. In this article we will understand how to set up Apache Druid in Hadoop to visualize data. In order to understand more about the Druid architecture, you may refer to this white paper. This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop. Apache Druid components Let's take a quick look at the different components of the Druid cluster: ComponentDescriptionDruid BrokerThese are the nodes that are aware of where the data lies in the cluster. These nodes are contacted by the applications/clients to get the data within Druid.Druid CoordinatorThese nodes manage the data (they load, drop, and load-balance it) on the historical nodes.Druid OverlordThis component is responsible for accepting tasks and returning the statuses of the tasks.Druid RouterThese nodes are needed when the data volume is in terabytes or higher range. These nodes route the requests to the brokers.Druid HistoricalThese nodes store immutable segments and are the backbone of the Druid cluster. They serve load segments, drop segments, and serve queries on segments' requests. Other required components The following table presents a couple of other required components: ComponentDescriptionZookeeperApache Zookeeper is a highly reliable distributed coordination serviceMetadata StorageMySQL and PostgreSQL are the popular RDBMSes used to keep track of all segments, supervisors, tasks, and configurations Apache Druid installation Apache Druid can be installed either in standalone mode or as part of a Hadoop cluster. In this section, we will see how to install Druid via Apache Ambari. Add service First, we invoke the Actions drop-down below the list of services in the Hadoop cluster. The screen looks like this: Select Druid and Superset In this setup, we will install both Druid and Superset at the same time. Superset is the visualization application that we will learn about in the next step. The selection screen looks like this: Click on Next when both the services are selected. Service placement on servers In this step, we will be given a choice to select the servers on which the application has to be installed. I have selected node 3 for this purpose. You can select any node you wish. The screen looks something like this: Click on Next when when the changes are done. Choose Slaves and Clients Here, we are given a choice to select the nodes on which we need the Slaves and Clients for the installed components. I have left the options that are already selected for me: Service configurations In this step, we need to select the databases, usernames, and passwords for the metadata store used by the Druid and Superset applications. Feel free to choose the default ones. I have given MySQL as the backend store for both of them. The screen looks like this: Once the changes look good, click on the Next button at the bottom of the screen. Service installation In this step, the applications will be installed automatically and the status will be shown at the end of the plan. Click on Next once the installation is complete. Changes to the current screen look like this: Installation summary Once everything is successfully completed, we are shown a summary of what has been done. Click on Complete when done: Sample data ingestion into Druid Once we have all the Druid-related applications running in our Hadoop cluster, we need a sample dataset that we must load in order to run some analytics tasks. Let's see how to load sample data. Download the Druid archive from the internet: [druid@node-3 ~$ curl -O http://static.druid.io/artifacts/releases/druid-0.12.0-bin.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 222M 100 222M 0 0 1500k 0 0:02:32 0:02:32 --:--:-- 594k Extract the archive: [druid@node-3 ~$ tar -xzf druid-0.12.0-bin.tar.gz Copy the sample Wikipedia data to Hadoop: [druid@node-3 ~]$ cd druid-0.12.0 [druid@node-3 ~/druid-0.12.0]$ hadoop fs -mkdir /user/druid/quickstart [druid@node-3 ~/druid-0.12.0]$ hadoop fs -put quickstart/wikiticker-2015-09-12-sampled.json.gz /user/druid/quickstart/ Submit the import request: [druid@node-3 druid-0.12.0]$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task;echo {"task":"index_hadoop_wikiticker_2018-03-16T04:54:38.979Z"} After this step, Druid will automatically import the data into the Druid cluster and the progress can be seen in the overlord console. The interface is accessible via http://<overlord-ip>:8090/console.html. The screen looks like this: Once the ingestion is complete, we will see the status of the job as SUCCESS. In case of FAILED imports, please make sure that the backend that is configured to store the Metadata for the Druid cluster is up and running.Even though Druid works well with the OpenJDK installation, I have faced a problem with a few classes not being available at runtime. In order to overcome this, I have had to use Oracle Java version 1.8 to run all Druid applications. Now we are ready to start using Druid for our visualization tasks. MySQL database with Apache Druid We will use a MySQL database to store the data. Apache Druid allows us to read the data present in an RDBMS system such as MySQL. Sample database The employees database is a standard dataset that has a sample organization and their employee, salary, and department data. We will see how to set it up for our tasks. This section assumes that the MySQL database is already configured and running. Download the sample dataset Download the sample dataset from GitHub with the following command on any server that has access to the MySQL database: [user@master ~]$ sudo yum install git -y [user@master ~]$ git clone https://github.com/datacharmer/test_db Cloning into 'test_db'... remote: Counting objects: 98, done. remote: Total 98 (delta 0), reused 0 (delta 0), pack-reused 98 Unpacking objects: 100% (98/98), done. Copy the data to MySQL In this step, we will import the contents of the data in the files to the MySQL database: [user@master test_db]$ mysql -u root < employees.sql INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff NULL Verify integrity of the tables This is an important step, just to make sure that all of the data we have imported is correctly stored in the database. The summary of the integrity check is shown as the verification happens: [user@master test_db]$ mysql -u root -t < test_employees_sha.sql +----------------------+ | INFO | +----------------------+ | TESTING INSTALLATION | +----------------------+ +--------------+------------------+------------------------------------------+ | table_name | expected_records | expected_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+ +--------------+------------------+------------------------------------------+ | table_name | found_records | found_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+ +--------------+---------------+-----------+ | table_name | records_match | crc_match | +--------------+---------------+-----------+ | employees | OK | ok | | departments | OK | ok | | dept_manager | OK | ok | | dept_emp | OK | ok | | titles | OK | ok | | salaries | OK | ok | +--------------+---------------+-----------+ +------------------+ | computation_time | +------------------+ | 00:00:11 | +------------------+ +---------+--------+ | summary | result | +---------+--------+ | CRC | OK | | count | OK | +---------+--------+ Now the data is correctly loaded in the MySQL database called employees. Single Normalized Table In data warehouses, its a standard practice to have normalized tables when compared to many small related tables. Lets create a single normalized table that contains details of employees, salaries, departments MariaDB [employees]> create table employee_norm as select e.emp_no, e.birth_date, CONCAT_WS(' ', e.first_name, e.last_name) full_name , e.gender, e.hire_date, s.salary, s.from_date, s.to_date, d.dept_name, t.title from employees e, salaries s, departments d, dept_emp de, titles t where e.emp_no = t.emp_no and e.emp_no = s.emp_no and d.dept_no = de.dept_no and e.emp_no = de.emp_no and s.to_date < de.to_date and s.to_date < t.to_date order by emp_no, s.from_date; Query OK, 3721923 rows affected (1 min 7.14 sec) Records: 3721923 Duplicates: 0 Warnings: 0 MariaDB [employees]> select * from employee_norm limit 1G *************************** 1. row *************************** emp_no: 10001 birth_date: 1953-09-02 full_name: Georgi Facello gender: M hire_date: 1986-06-26 salary: 60117 from_date: 1986-06-26 to_date: 1987-06-26 dept_name: Development title: Senior Engineer 1 row in set (0.00 sec) MariaDB [employees]> Once we have normalized data, we will see how to use the data from this table to generate rich visualisations. To summarize, we walked through Hadoop application such as Apache Druid that is used to visualize data and learned how to use them with RDBMses such as MySQL. We also saw a sample database to help us understand the application better. To know more about how to visualize data using Apache Superset and learn how to use them with data in RDBMSes such as MySQL, do checkout this book Modern Big Data Processing with Hadoop. What makes Hadoop so revolutionary? Top 8 ways to improve your data visualizations What is Seaborn and why should you use it for data visualization?
Read more
  • 0
  • 0
  • 7918
article-image-automl-build-machine-learning-pipeline-tutorial
Sunith Shetty
27 Jul 2018
15 min read
Save for later

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Sunith Shetty
27 Jul 2018
15 min read
Many moving parts have to be tied together for an ML model to execute and produce results successfully. This process of tying together different pieces of the ML process is known as pipelines. A pipeline is a generalized concept but a very important concept for a Data Scientist. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods. In this tutorial, we see how to create our own AutoML pipelines. You will understand how to build pipelines in order to handle the model building process. Each stage of a pipeline is fed processed data from its preceding stage; that is, the output of a processing unit is supplied as an input to its next step. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines form a crucial element for building an AutoML system. The code files for this article are available on Github. This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning. Getting to know machine learning pipelines Usually, an ML algorithm needs clean data to detect some patterns in the data and make predictions over a new dataset. However, in real-world applications, the data is often not ready to be directly fed into an ML algorithm. Similarly, the output from an ML model is just numbers or characters that need to be processed for performing some actions in the real world. To accomplish that, the ML model has to be deployed in a production environment. This entire framework of converting raw data to usable information is performed using a ML pipeline. The following is a high-level illustration of an ML pipeline: We will break down the blocks illustrated in the preceding figure as follows: Data Ingestion: It is the process of obtaining data and importing data for use. Data can be sourced from multiple systems, such as Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, and web applications. The data extraction can be in the real time or batches. Sometimes, acquiring the data is a tricky part and is one of the most challenging steps as we need to have a good business and data understanding abilities. Data Preparation: There are several methods to preprocess the data to a suitable form for building models. Real-world data is often skewed—there is missing data, which is sometimes noisy. It is, therefore, necessary to preprocess the data to make it clean and transformed, so it's ready to be run through the ML algorithms. ML model training: It involves the use of various ML techniques to understand essential features in the data, make predictions, or derive insights out of it. Often, the ML algorithms are already coded and available as API or programming interfaces. The most important responsibility we need to take is to tune the hyperparameters. The use of hyperparameters and optimizing them to create a best-fitting model are the most critical and complicated parts of the model training phase. Model Evaluation: There are various criteria using which a model can be evaluated. It is a combination of statistical methods and business rules. In an AutoML pipeline, the evaluation is mostly based on various statistical and mathematical measures. If an AutoML system is developed for some specific business domain or use cases, then the business rules can also be embedded into the system to evaluate the correctness of a model. Retraining: The first model that we create for a use case is not often the best model. It is considered as a baseline model, and we try to improve the model's accuracy by training it repetitively. Deployment: The final step is to deploy the model that involves applying and migrating the model to business operations for their use. The deployment stage is highly dependent on the IT infrastructure and software capabilities an organization has. As we see, there are several stages that we will need to perform to get results out of an ML model. The scikit-learn provides us a pipeline functionality that can be used to create several complex pipelines. While building an AutoML system, pipelines are going to be very complex, as many different scenarios have to be captured. However, if we know how to preprocess the data, utilizing an ML algorithm and applying various evaluation metrics, a pipeline is a matter of giving a shape to those pieces. Let's design a very simple pipeline using scikit-learn. Simple ML pipeline We will first import a dataset known as Iris, which is already available in scikit-learn's sample dataset library (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset consists of four features and has 150 rows. We will be developing the following steps in a pipeline to train our model using the Iris dataset. The problem statement is to predict the species of an Iris data using four different features: In this pipeline, we will use a MinMaxScaler method to scale the input data and logistic regression to predict the species of the Iris. The model will then be evaluated based on the accuracy measure: The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task. The only addition is the Pipeline method from sklearn.pipeline. This will provide us with necessary methods needed to create an ML pipeline: from sklearn.datasets import load_iris from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline The next step is to load the iris data and split it into training and test dataset. In this example, we will use 80% of the dataset to train the model and the remaining 20% to test the accuracy of the model. We can use the shape function to view the dimension of the dataset: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) X_train.shape The following result shows the training dataset having 4 columns and 120 rows, which equates to 80% of the Iris dataset and is as expected: Next, we print the dataset to take a glance at the data: print(X_train) The preceding code provides the following output: The next step is to create a pipeline. The pipeline object is in the form of (key, value) pairs. Key is a string that has the name for a particular step, and value is the name of the function or actual method. In the following code snippet, we have named the MinMaxScaler() method as minmax and LogisticRegression(random_state=42) as lr: pipe_lr = Pipeline([('minmax', MinMaxScaler()), ('lr', LogisticRegression(random_state=42))]) Then, we fit the pipeline object—pipe_lr—to the training dataset: pipe_lr.fit(X_train, y_train) When we execute the preceding code, we get the following output, which shows the final structure of the fitted model that was built: The last step is to score the model on the test dataset using the score method: score = pipe_lr.score(X_test, y_test) print('Logistic Regression pipeline test accuracy: %.3f' % score) As we can note from the following results, the accuracy of the model was 0.900, which is 90%: In the preceding example, we created a pipeline, which constituted of two steps, that is, minmax scaling and LogisticRegression. When we executed the fit method on the pipe_lr pipeline, the MinMaxScaler performed a fit and transform method on the input data, and it was passed on to the estimator, which is a logistic regression model. These intermediate steps in a pipeline are known as transformers, and the last step is an estimator. Transformers are used for data preprocessing and has two methods, fit and transform. The fit method is used to find parameters from the training data, and the transform method is used to apply the data preprocessing techniques to the dataset. Estimators are used for creating machine learning model and has two methods, fit and predict. The fit method is used to train a ML model, and the predict method is used to apply the trained model on a test or new dataset. This concept is summarized in the following figure: We have to call only the pipeline's fit method to train a model and call the predict method to create predictions. Rest all functions that is, Fit and Transform are encapsulated in the pipeline's functionality and executed as shown in the preceding figure. Sometimes, we will need to write some custom functions to perform custom transformations. The following section is about function transformer that can assist us in implementing this custom functionality. FunctionTransformer A FunctionTransformer is used to define a user-defined function that consumes the data from the pipeline and returns the result of this function to the next stage of the pipeline. This is used for stateless transformations, such as taking the square or log of numbers, defining custom scaling functions, and so on. In the following example, we will build a pipeline using the CustomLog function and the predefined preprocessing method StandardScaler: We import all the required libraries as we did in our previous examples. The only addition here is the FunctionTransformer method from the sklearn.preprocessing library. This method is used to execute a custom transformer function and stitch it together to other stages in a pipeline: import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import StandardScaler In the following code snippet, we will define a custom function, which returns the log of a number X: def CustomLog(X): return np.log(X) Next, we will define a data preprocessing function named PreprocData, which accepts the input data (X) and target (Y) of a dataset. For this example, the Y is not necessary, as we are not going to build a supervised model and just demonstrate a data preprocessing pipeline. However, in the real world, we can directly use this function to create a supervised ML model. Here, we use a make_pipeline function to create a pipeline. We used the pipeline function in our earlier example, where we have to define names for the data preprocessing or ML functions. The advantage of using a make_pipeline function is that it generates the names or keys of a function automatically: def PreprocData(X, Y): pipe = make_pipeline( FunctionTransformer(CustomLog),StandardScaler() ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y) pipe.fit(X_train, Y_train) return pipe.transform(X_test), Y_test As we are ready with the pipeline, we can load the Iris dataset. We print the input data X to take a look at the data: iris = load_iris() X, Y = iris.data, iris.target print(X) The preceding code prints the following output: Next, we will call the PreprocData function by passing the iris data. The result returned is a transformed dataset, which has been processed first using our CustomLog function and then using the StandardScaler data preprocessing method: X_transformed, Y_transformed = PreprocData(X, Y) print(X_transformed) The preceding data transformation task yields the following transformed data results: We will now need to build various complex pipelines for an AutoML system. In the following section, we will create a sophisticated pipeline using several data preprocessing steps and ML algorithms. Complex ML pipeline In this section, we will determine the best classifier to predict the species of an Iris flower using its four different features. We will use a combination of four different data preprocessing techniques along with four different ML algorithms for the task. The following is the pipeline design for the job: We will proceed as follows: We start with importing the various libraries and functions that are required for the task: from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import svm from sklearn import tree from sklearn.pipeline import Pipeline Next, we load the Iris dataset and split it into train and test datasets. The X_train and Y_train dataset will be used for training the different models, and X_test and Y_test will be used for testing the trained model: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) Next, we will create four different pipelines, one for each model. In the pipeline for the SVM model, pipe_svm, we will first scale the numeric inputs using StandardScaler and then create the principal components using Principal Component Analysis (PCA). Finally, a Support Vector Machine (SVM) model is built using this preprocessed dataset. Similarly, we will construct a pipeline to create the KNN model named pipe_knn. Only StandardScaler is used to preprocess the data before executing the KNeighborsClassifier to create the KNN model. Then, we create a pipeline for building a decision tree model. We use the StandardScaler and MinMaxScaler methods to preprocess the data to be used by the DecisionTreeClassifier method. The last model created using a pipeline is the random forest model, where only the StandardScaler is used to preprocess the data to be used by the RandomForestClassifier method. The following is the code snippet for creating these four different pipelines used to create four different models: # Construct svm pipeline pipe_svm = Pipeline([('ss1', StandardScaler()), ('pca', PCA(n_components=2)), ('svm', svm.SVC(random_state=42))]) # Construct knn pipeline pipe_knn = Pipeline([('ss2', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6, metric='euclidean'))]) # Construct DT pipeline pipe_dt = Pipeline([('ss3', StandardScaler()), ('minmax', MinMaxScaler()), ('dt', tree.DecisionTreeClassifier(random_state=42))]) # Construct Random Forest pipeline num_trees = 100 max_features = 1 pipe_rf = Pipeline([('ss4', StandardScaler()), ('pca', PCA(n_components=2)), ('rf', RandomForestClassifier(n_estimators=num_trees, max_features=max_features))]) Next, we will need to store the name of pipelines in a dictionary, which would be used to display results: pipe_dic = {0: 'K Nearest Neighbours', 1: 'Decision Tree', 2:'Random Forest', 3:'Support Vector Machines'} Then, we will list the four pipelines to execute those pipelines iteratively: pipelines = [pipe_knn, pipe_dt,pipe_rf,pipe_svm] Now, we are ready with the complex structure of the whole pipeline. The only things that remain are to fit the data to the pipeline, evaluate the results, and select the best model. In the following code snippet, we fit each of the four pipelines iteratively to the training dataset: # Fit the pipelines for pipe in pipelines: pipe.fit(X_train, y_train) Once the model fitting is executed successfully, we will examine the accuracy of the four models using the following code snippet: # Compare accuracies for idx, val in enumerate(pipelines): print('%s pipeline test accuracy: %.3f' % (pipe_dic[idx], val.score(X_test, y_test))) We can note from the following results that the k-nearest neighbors and decision tree models lead the pack with a perfect accuracy of 100%. This is too good to believe and might be a result of using a small data set and/or overfitting: We can use any one of the two winning models, k-nearest neighbors (KNN) or decision tree model, for deployment. We can accomplish this using the following code snippet: best_accuracy = 0 best_classifier = 0 best_pipeline = '' for idx, val in enumerate(pipelines): if val.score(X_test, y_test) > best_accuracy: best_accuracy = val.score(X_test, y_test) best_pipeline = val best_classifier = idx print('%s Classifier has the best accuracy of %.2f' % (pipe_dic[best_classifier],best_accuracy)) As the accuracies were similar for k-nearest neighbor and decision tree, KNN was chosen to be the best model, as it was the first model in the pipeline. However, at this stage, we can also use some business rules or access the execution cost to decide the best model: To summarize, we learned about building pipelines for ML systems.  The concepts that we described in this article gave you a foundation for creating pipelines. To have a clearer understanding of the different aspects of Automated Machine Learning, and how to incorporate automation tasks using practical datasets, do checkout the book Hands-On Automated Machine Learning. Read more What is Automated Machine Learning (AutoML)? 5 ways Machine Learning is transforming digital marketing How to improve interpretability of machine learning systems
Read more
  • 0
  • 0
  • 8075

article-image-why-wall-street-unfriended-facebook-stocks-lost-over-120-billion-in-market-value-after-q2-2018-earnings-call
Natasha Mathur
27 Jul 2018
6 min read
Save for later

Why Wall Street unfriended Facebook: Stocks fell $120 billion in market value after Q2 2018 earnings call

Natasha Mathur
27 Jul 2018
6 min read
After been found guilty of providing discriminatory advertisements on its platform earlier this week, Facebook hit yet another wall yesterday as its stock closed falling down by 18.96% on Thursday with shares selling at $176.26. This means that the company lost around $120 billion in market value overnight, making it the largest loss of value ever in a day for a US-traded company since Intel Corp’s two-decade-old crash. Intel had lost a little over $18 billion in one day, 18 years back. Despite the 41.24% revenue growth compared to last year, this was Facebook’s biggest stock market drop ever. Here’s the stock chart from NASDAQ showing the figures:   Facebook’s market capitalization was worth $629.6 on Wednesday. As soon as Facebook’s Earnings calls concluded by the end of market trading on Thursday, it’s worth dropped to $510 billion after the close. Also, as Facebook’s market shares continued to drop down during Thursday’s market, it left its CEO, Mark Zuckerberg with less than $70 billion, wiping out nearly $17 billion of his personal stake, according to Bloomberg. Also, he was demoted from the third to the sixth position on Bloomberg’s Billionaires Index. Active user growth starting to stagnate in mature markets According to David Wehner, CFO at Facebook, “the Daily active users count on Facebook reached 1.47 billion, up 11% compared to last year, led by growth in India, Indonesia, and the Philippines. This number represents approximately 66% of the 2.23 billion monthly active users in Q2”. Facebook’s daily active users He also mentioned that  “MAUs (monthly active users) were up 228M or 11% compared to last year. It is worth noting that MAU and DAU in Europe were both down slightly quarter-over-quarter due to the GDPR rollout, consistent with the outlook we gave on the Q1 call”. Facebook’s Monthly Active users In fact, Facebook has implemented several privacy policy changes in the last few months. This is due to the European Union's General Data Protection Regulation ( GDPR ) as the company's earnings report revealed the effects of the GDPR rules. Revenue Growth Rate is falling too Speaking of revenue expectations, Wehner gave investors a heads up that revenue growth rates will decline in the third and fourth quarters. Wehner states that the company’s “total revenue growth rate decelerated approximately 7 percentage points in Q2 compared to Q1. Our total revenue growth rates will continue to decelerate in the second half of 2018, and we expect our revenue growth rates to decline by high single-digit percentages from prior quarters sequentially in both Q3 and Q4.”  Facebook reiterated further that these numbers won’t get better anytime soon.                                                 Facebook’s Q2 2018 revenue Wehner further spoke explained the reasons for the decline in revenue,“There are several factors contributing to that deceleration..we expect the currency to be a slight headwind in the second half ...we plan to grow and promote certain engaging experiences like Stories that currently have lower levels of monetization. We are also giving people who use our services more choices around data privacy which may have an impact on our revenue growth”. Let’s look at other performance indicators Other financial highlights of Q2 2018 are as follows: Mobile advertising revenue represented 91% of advertising revenue for q2 2018, which is up from approx. 87% of the advertising revenue in Q2 2017. Capital expenditures for Q2 2018 were $3.46 billion which is up from $1.4 billion in Q2 2017. Headcount was 30,275 around June 30, which is an increase of 47% year-over-year. Cash, Cash equivalents, and marketable securities were $42.3 billion at the end of Q2 2018, an increase from $35.45 billion at the end of the Q2 2017. Wehner also mentioned that the company “continue to expect that full-year 2018 total expenses will grow in the range of 50-60% compared to last year. In addition to increases in core product development and infrastructure -- growth is driven by increasing investments -- safety & security, AR/VR, marketing, and content acquisition”. Another reason for the overall loss is that Facebook has been dealing with criticism for quite some time now over its content policies, its issues regarding user’s private data and its changing rules for advertisers. In fact, it is currently investigating data analytics firm Crimson Hexagon over misuse of data. Mark Zuckerberg also said over a conference call with financial analysts that Facebook has been investing heavily in “safety, security, and privacy” and that how they’re “investing - in security that it will start to impact our profitability, we’re starting to see that this quarter - we run this company for the long term, not for the next quarter”. Here’s what the public feels about the recent wipe-out: https://twitter.com/TeaPainUSA/status/1022586648155054081 https://twitter.com/alistairmilne/status/1022550933014753280 So, why did Facebook’s stocks crash? As we can see, Facebook’s performance itself in Q2 2018 has been better than its performance last year for the same quarter as far as revenue goes. Ironically, scandals and lawsuits have had little impact on Facebook’s growth. For example, Facebook recovered from the Cambridge Analytica scandal fully within two months as far share prices are concerned. The Mueller indictment report released earlier this month managed to arrest growth for merely a couple of days before the company bounced back. The discriminatory advertising verdict against Facebook, had no impact on its bullish growth earlier this week. This brings us to conclude that the public sentiments and market reactions against Facebook have very different underlying reasons. The market’s strong reactions are mainly due to concerns over the active user growth slowdown, the lack of monetization opportunities on the more popular Instagram platform, and Facebook’s perceived lack of ability to evolve successfully to new political and regulatory policies such as the GDPR. Wall Street has been indifferent to Facebook’s long list of scandals, in some ways, enabling the company’s ‘move fast and break things’ approach. In his earnings call on Thursday, Zuckerberg hinted that Facebook may not be keen on ‘growth at all costs’ by saying things like “we’re investing so much in security that it will significantly impact our profitability” and then Wehner adding, “Looking beyond 2018, we anticipate that total expense growth will exceed revenue growth in 2019.” And that has got Wall street unfriending Facebook with just a click of the button! Is Facebook planning to spy on you through your mobile’s microphones? Facebook to launch AR ads on its news feed to let you try on products virtually Decoding the reasons behind Alphabet’s record high earnings in Q2 2018  
Read more
  • 0
  • 0
  • 2672