Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-getting-started-with-data-storytelling
Aaron Lazar
28 Jan 2018
11 min read
Save for later

Getting Started with Data Storytelling

Aaron Lazar
28 Jan 2018
11 min read
[box type="note" align="" class="" width=""]This article has been taken from the book Principles of Data Science, written by Sinan Ozdemir. It aims to practically introduce you to the different ways in which you can communicate or visualize your data to tell stories effectively.[/box] Communication matters Being able to conduct experiments and manipulate data in a coding language is not enough to conduct practical and applied data science. This is because data science is, generally, only as good as how it is used in practice. For instance, a medical data scientist might be able to predict the chance of a tourist contracting Malaria in developing countries with >98% accuracy, however, if these results are published in a poorly marketed journal and online mentions of the study are minimal, their groundbreaking results that could potentially prevent deaths would never see the true light of day. For this reason, communication of results through data storytelling is arguably as important as the results themselves. A famous example of poor management of distribution of results is the case of Gregor Mendel. Mendel is widely recognized as one of the founders of modern genetics. However, his results (including data and charts) were not well adopted until after his death. Mendel even sent them to Charles Darwin, who largely ignored Mendel's papers, which were written in unknown Moravian journals. Generally, there are two ways of presenting results: verbal and visual. Of course, both the verbal and visual forms of communication can be broken down into dozens of subcategories, including slide decks, charts, journal papers, and even university lectures. However, we can find common elements of data presentation that can make anyone in the field more aware and effective in their communication skills. Let's dive right into effective (and ineffective) forms of communication, starting with visuals. We’ll look at four basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots. Scatter plots A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation. For example, we can look at two variables: average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance. The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on an average, in a day against a company-standard work performance metric: import pandas as pd hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5] This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day. work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72] This line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100. For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100: df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_ performance':work_performance}) Here, we are creating a Dataframe in order to ease our exploratory data analysis and make it easier to make a scatter plot: df.plot(x='hours_tv_watched', y='work_performance', kind='scatter') Now, we are actually making our scatter plot. In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person's work performance metric: Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance. Of course, as we are now experts in statistics from the last two chapters, we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association between but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have. Line graphs Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables. As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme—he wondered if he could find a relationship between the TV show, The X-Files, and the amount of UFO sightings in the U.S.. He then found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when the X-files were released: It appears to be clear that right after 1993, the year of the X-Files premier, the number of UFO sightings started to climb drastically. This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author's intent, which is to show a relationship between the number of UFO sightings and the X-files premier. On the other hand, the following is a less impressive line chart: This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph—we have time on the bottom x axis and a quantitative value on the vertical y axis. The (not so) subtle difference here is that the three points are equally spaced out on the x axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separates the last two points. Bar charts We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x axis does not represent a quantitative variable, in fact, when using a bar chart, the x axis is generally a categorical variable, while the y axis is quantitative. Note that, for this code, I am using the World Health Organization's report on alcohol consumption around the world by country: drinks = pd.read_csv('data/drinks.csv') drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent') plt.xlabel('Continent') plt.ylabel('Count') The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the least: In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown: drinks.groupby('continent').beer_servings.mean().plot(kind='bar') Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values. We can also use bar charts to graph variables that change over time, like a line graph. Histograms Histograms show the frequency distribution of a single quantitative variable by splitting up the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x axis is a bin (subrange) of values and the y axis is a count. As an example, I will import a store's daily number of unique customers, as shown: rossmann_sales = pd.read_csv('data/rossmann.csv') rossmann_sales.head() Note how we have multiple store data (by the first Store column). Let's subset this data for only the first store, as shown: first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1] Now, let's plot a histogram of the first store's customer count: first_rossmann_sales['Customers'].hist(bins=20) plt.xlabel('Customer Bins') plt.ylabel('Count') The x axis is now categorical in that each category is a selected range of values, for example, 600-620 customers would potentially be a category. The y axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700. Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on. Box plots Box plots are also used to show a distribution of values. They are created by plotting the five number summary, as follows: The minimum value The first quartile (the number that separates the 25% lowest values from the rest) The median The third quartile (the number that separates the 25% highest values from the rest) The maximum value In Pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile. The following is a series of box plots showing the distribution of beer consumption according to continents: drinks.boxplot(column='beer_servings', by='continent') Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America. Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot. Getting back to the customer data, let's look at the same store customer numbers, but using a box plot: first_rossmann_sales.boxplot(column='Customers', vert=False) This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both the graphs one after the other: Note how the x axis for each graph are the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center of the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people's biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers. Note that we can get the exact numbers to construct a box plot using the describe feature in Pandas, as shown: first_rossmann_sales['Customers'].describe() min 0.000000 25% 463.000000 50% 529.000000 75% 598.750000 max 1130.000000 There we have it! We just learned data storytelling through various techniques like scatter plots, line graphs, bar charts, histograms and box plots. Now you’ve got the power to be creative in the way you tell tales of your data! If you found our article useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.    
Read more
  • 0
  • 0
  • 2237

article-image-how-to-integrate-sharepoint-with-sql-server-reporting-services
Kunal Chaudhari
27 Jan 2018
5 min read
Save for later

How to integrate SharePoint with SQL Server Reporting Services

Kunal Chaudhari
27 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you get up and running with the latest enhancements and advanced query and reporting feature in SQL Server 2016.[/box] Today we will learn the steps to integrate SharePoint in the SQL Server Reporting services. We will create a Reporting Services SharePoint application, and set it up in a way that we are able to view reports when they are uploaded to SharePoint. Getting ready For this, all you'll need is a SharePoint instance you can work with. Do make sure you have an administrative access to the SharePoint site. If you have an Azure account, free or paid, you could set up a test instance of SharePoint and use it to follow the instructions in this article. Note the setup of such an Azure instance is outside the scope of this article. In this article, we assume you are using an on premise SharePoint installation. How to do it… Open the SharePoint 2016 Central Administration web page. Click on Manage service applications under the Application Management area: 3. The Service Applications tab now appears at the top of the page. Click on the New menu: 4. In the menu, find and click on the option for SQL Server Reporting Services Service Application: 5. You'll now need to fill out the information for the service application. Start at the top by giving it a good name, here we are using SSRS_SharePoint. 6. Presumably this is a new install, so you'll have to take the Create new application pool option. Give it an appropriate name; in this example, we used SSRS_SharePoint_Pool. 7. Select a security account to run under. Here we selected an account set up by our Active Directory administrator, which has permissions to SQL Server where SSRS is installed. 8. Enter the name of the server which has SQL Server 2016 Reporting Services installed. In this example, our machine is ACSrv. 9. By default, SharePoint will create a name for the database that includes a GUID (a long string of letters and numbers). You should absolutely rename this to eliminate the GUID, but ensure the database name will be unique. In this example, we used ReportingService_SharePoint. 10. Review the information so that it resembles the following figure, but don't hit OK quite yet as there are few more pieces of information to fill out. Scroll down in the dialog to continue: 11. After the database name, you'll need to indicate the authentication method. Assuming the credentials you entered for the security account came from your Active Directory administrator, you can take the default of Windows authentication. 12. Place a check mark beside the instance of SharePoint to associate this SSRS application with. Here there is only one, SharePoint – 80. 13. Click OK to continue. Assuming all goes well, you should see the following confirmation dialog. If so, click OK to proceed: 14. Now that SharePoint is configured, you'll now need to provide additional information to SQL Server. That is the purpose of this final screen, Provision Subscriptions and Alerts. Select the Download Script button, and save the generated SQL file: 15. Pass the SQL file to a database administrator to execute, or open it in SSMS and execute it yourself, assuming you have administrative rights on the SQL Server. SharePoint uses the concept of Service Applications to manage items which run under the hood of SharePoint. SQL Server Reporting Services is one such service application. By integrating it as a service application, end users can upload, modify, and view SSRS reports right within SharePoint. We began by generating a new Service Application, and picking Reporting Services from the list. We then needed to let SharePoint know where the SQL Server would be used to host both the database, as well as have a copy of Reporting Services for SharePoint installed. In addition, we also needed to provide security credentials for SharePoint to use to communicate with SQL Server. As the final step, we needed to configure SQL Server to now work with SharePoint. This was the purpose of the Provision Subscriptions and Alerts screen. Note there is an option to fill out a user name and credential; clicking OK would then have immediately executed scripts against the target SQL Server. In most mid-to large-size corporations, however, there will be controls in place to prevent this type of thing. Most companies will require a DBA to review scripts, or at the very least you'll want to keep a copy of the script in your source control system to be able to track what changes were made to a SQL Server. Hence, we suggest taking the action laid out in this article, namely downloading the script and executing it manually in the SQL Server Management Studio. To test your setup, we suggest creating a new report with embedded data sources and datasets. Upload that report to the server, and attempt to execute; it should display correctly if your install went well. If you enjoyed this excerpt, check out the book SQL Server 2016 Reporting Services Cookbook to know more about handling security and configuring email with SharePoint using Reporting Services.    
Read more
  • 0
  • 0
  • 6653

article-image-how-to-build-a-gaussian-mixture-model
Gebin George
27 Jan 2018
8 min read
Save for later

How to build a Gaussian Mixture Model

Gebin George
27 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster),  np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars  and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named  with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters))     category = pm.Categorical('category', p=p, shape=n_total)    means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[p])    trace_kg = pm.sample(10000, step=[step1, step2])      chain_kg = trace_kg[1000:]       varnames_kg = ['p']    pm.traceplot(chain_kg, varnames_kg)   Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total)    means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters)    sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix)    step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[means, sd, p])    trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames)   mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']:    sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of  and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.    
Read more
  • 0
  • 0
  • 4182

article-image-working-with-kibana-in-elasticsearch-5-x
Savia Lobo
26 Jan 2018
9 min read
Save for later

Working with Kibana in Elasticsearch 5.x

Savia Lobo
26 Jan 2018
9 min read
[box type="note" align="" class="" width=""]Below given post is a book excerpt from Mastering Elasticsearch 5.x written by  Bharvi Dixit. This book introduces you to the new features of Elasticsearch 5.[/box] The following article showcases Kibana, a tool belongs to the Elastic Stack, and used for visualization and exploration of data residing in Elasticsearch. One can install Kibana and start to explore Elasticsearch indices in minutes — no code, no additional infrastructure required. If you have been using an older version of Kibana, you will notice that it has transformed altogether in terms of functionality. Note: This URL has all the latest changes done in Kibana 5.0: https://www.elastic.co/guide/en/kibana/current/breaking-changes- 5.0.html. Installing Kibana Similar to other Elastic Stack tools, you can visit the following URL to download Kibana 5.0.0, as per your operating system distribution: https://www.elastic.co/downloads/past-releases/kibana-5-0-0 An example of downloading and installing Kibana from the Debian package. First of all, download the package: https://artifacts.elastic.co/downloads/kibana/kibana-5.0.0-amd64.deb Then install it using the following command: sudo dpkg -i kibana-5.0.0-amd64.deb Kibana configuration Once installed, you can find the Kibana configuration file, kibana.yml, inside the/etc/kibana/ directory. All the settings related to Kibana are done only in this file. There is a big list of configuration options available inside the Kibana settings which you can learn about here: https://www.elastic.co/guide/en/kibana/current/settings.html. Starting Kibana Kibana can be started using the following command and it will be started on port 5601 bounded on localhost by default: sudo service kibana start Exploring and visualizing data on Kibana Now all the components of Elastic Stack are installed and configured, we can start exploring the awesomeness of Kibana visualizations. Kibana 5.x is supported on almost all of the latest major web browsers, including Internet Explorer 11+. To load Kibana, you just need to type localhost:5601 in your web browser. You will see different options available in the left panel of the screen, as shown in following figure: These different options are used for the following purposes: Discover: Used for data exploration where you get the access of each field along with a default time. Visualize: Used for creating visualizations of the data in your Elasticsearch indices. You can then build dashboards that display related visualizations. Dashboard: Used to display a collection of saved visualizations. Timelion: A time series data visualizer that enables you to combine totally independent data sources within a single visualization. It is based on simple expression  language. Management: A place where you perform your runtime configuration of Kibana, including both the initial setup and ongoing configuration of index patterns, advanced settings that tweak the behaviors of Kibana itself and saved objects. Dev Tools: Contains the console which is based on the Sense plugin and allows you to write Elasticsearch commands in one tab and see the responses of those commands in the other tab. Understanding the Kibana Management screen The Management screen has three tabs available: Index Patterns: For selecting and configuring index names Saved Objects: Where all of your saved visualizations, searches, and dashboards are located Advanced Settings: Contains advanced settings of Kibana: As you can see on the management screen, the very first tab is for Index Patterns. Kibana is asking you to configure an index pattern so that it can load all the mappings and settings from the defined index. It defaults to logstash-*; you can add as many index patterns or absolute index names as you want and can select them while creating the visualization. Since we do have an index already available with the logstash-* pattern, when you click on the Time-field name drop-down list, you will find that it will show you two fields, @timestamp and received_at, which are of the date type, as shown in following screenshot: We will select the @timestamp field and hit the Create button. As soon as you do it, the following screen appears: In the above screenshot, you can see that Kibana has loaded all the mappings from our Logstash index. In addition, you can see three labels in blue (for marking this index as the default), yellow (for reloading the mappings; this is needed if you have updated the mapping after selecting the index pattern), and red (for deleting this index pattern altogether from Kibana). The second tab on the management screen is about saved objects, which contain all of your saved visualizations, searches, and dashboards as you can see in the following screenshot. Please note that you can see the imported dashboards and visualizations from Metricbeat here, which we have done a while ago. The third option is for Advanced Settings and you should not play with the settings shown on this page if you are not aware of the tweaking effects. Discovering data on Kibana When you move to the Discover page of Kibana, you will see a screen similar to the following: Setting the time range and auto-refresh interval Please note that Kibana by default loads the data of the last 15 minutes, which you change by clicking on the clock sign which you can find in the top-right corner of the screen and selecting the desired time range. We have shown it in the following screenshot: One more thing to take look out for is that, after clicking on this clock sign, apart from time- based settings, you will see one more option in the top corner with the name Auto-refresh. This setting tells Kibana how often it needs to query Elasticsearch. When you click on this setting, you will get the option to choose either to completely turn off the auto-refresh or select the desired time interval. Adding fields for exploration and using the search panel As you can see in the following screenshot, you have all your fields available inside your index. On the Visualization screen, by default Kibana shows the timestamp and _source field but you can add your selected fields from the left panel by just moving the cursor on them and then clicking Add. Similarly, if you want to remove the field from the column, just move the cursor to the field's name on the column heading and click on the cross icon. In addition, Kibana also provides you with a search panel in which you can write queries. For example, in the following screenshot, I have searched for the logstash keyword inside the syslog_message field. When you hit the search button, the search text gets highlighted inside the rendered responses: Exploring more options on the Visualization page On Kibana, you will see lots of small arrow signs to open or collapse the sections/settings. You will see one of these arrows in the following image, in the bottom-left corner (I have also added a custom text on the image just beside the arrow): When you click on this arrow, the time series histogram gets hidden and you get to see the following screen, which contains multiple properties such as Table, which contains the histogram data in tabular format; Request, which contains the actual JSON query sent to Elasticsearch; Response, which contains the JSON response returned from Elasticsearch; and Statistics, which shows the query execution time and number of hits matching the query: Using the Dashboard screen to create/load dashboards When you click on the Dashboard panel, you first get a blank screen with some options, such as New for creating a dashboard and Open to open an existing dashboard, along with some more options. If you are creating a dashboard from scratch, you will have to add the built visualizations onto it and then save it using some name. But since we already have a dashboard available which we imported using Metricbeat, we will click Open and you will see something similar to the following screenshot on your Kibana page: Please note that if you do not have Apache installed on your system, selecting the first option, Metricbeat – Apache HTTPD server status, will load a blank dashboard. You can select any other title; for example, if you select the second option, you will see a dashboard similar to the following: Editing an existing visualization When you move the cursor on the visualizations presented on the dashboard, you will notice that a pencil sign appears, as shown in the following screenshot: When you click on that pencil sign, it will open that particular visualization inside the visualization editor panel, as shown in the following screenshot. Here you can edit the properties and either override the same visualization or save it using some other name: Please note that if you want to create a visualization from scratch, just click on the Visualize option on the left-hand side and it will guide you through the steps of creating the visualization. Kibana provides almost 10 types of visualizations. To get the details about working with each type of visualization, please follow the official documentation of Kibana on this link: https://www.elastic.co/guide/en/kibana/master/createvis.html. Using Sense Inside the Dev-Tools option, you can find the console for Kibana, which was previously known as Sense Editor. This is one of the most wonderful tools to help you speed up the learning curve of Elasticsearch since it provides auto-suggestions for all the endpoints and queries, as shown in the following screenshot: You will see that the Kibana Console is divided into two parts; the left part is where you write your queries/requests, and after clicking the green arrow, the response from Elasticsearch is rendered inside the right-hand panel:   To summarize we explained how to work with the Kibana tool in Elasticsearch 5.x. We explored installation of  Kibana, Kibana configuration, and moving ahead with exploring and visualizing data using Kibana. If you enjoyed this excerpt, and want to get an understanding of how you can scale your ElasticSearch cluster to contextualize it and improve its performance, check out the book Mastering Elasticsearch 5.x.    
Read more
  • 0
  • 0
  • 3742

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which
Gebin George
26 Jan 2018
5 min read
Save for later

What are discriminative and generative models and when to use which?

Gebin George
26 Jan 2018
5 min read
[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of  for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda:     mus = pm.Normal('mus', mu=0, sd=10, shape=2)    sigmas = pm.Uniform('sigmas', 0, 10)  setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50])      versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:])      bd = pm.Deterministic('bd', (mus[0]+mus[1])/2)    start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python  to solve complex statistical problems with Bayesian Framework and Python.    
Read more
  • 0
  • 0
  • 3460

article-image-how-to-execute-a-search-query-in-elasticsearch
Sugandha Lahoti
25 Jan 2018
9 min read
Save for later

How to execute a search query in ElasticSearch

Sugandha Lahoti
25 Jan 2018
9 min read
[box type="note" align="" class="" width=""]This post is an excerpt from a book authored by Alberto Paro, titled Elasticsearch 5.x Cookbook. It has over 170 advance recipes to search, analyze, deploy, manage, and monitor data effectively with Elasticsearch 5.x[/box] In this article we see how to execute and view a search operation in ElasticSearch. Elasticsearch was born as a search engine. It’s main purpose is to process queries and give results. In this article, we'll see that a search in Elasticsearch is not only limited to matching documents, but it can also calculate additional information required to improve the search quality. All the codes in this article are available on PacktPub or GitHub. These are the scripts to initialize all the required data. Getting ready You will need an up-and-running Elasticsearch installation. To execute curl via a command line, you will also need to install curl for your operating system. To correctly execute the following commands you will need an index populated with the chapter_05/populate_query.sh script available in the online code. The mapping used in all the article queries and searches is the following: { "mappings": { "test-type": { "properties": { "pos": { "type": "integer", "store": "yes" }, "uuid": { "store": "yes", "type": "keyword" }, "parsedtext": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text" }, "name": { "term_vector": "with_positions_offsets", "store": "yes", "fielddata": true, "type": "text", "fields": { "raw": { "type": "keyword" } } }, "title": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text", "fielddata": true, "fields": { "raw": { "type": "keyword" } } } } }, "test-type2": { "_parent": { "type": "test-type" } } } } How to do it To execute the search and view the results, we will perform the following steps: From the command line, we can execute a search as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}' In this case, we have used a match_all query that means return all the documents.    If everything works, the command will return the following: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {"position": 2, "parsedtext": "Bill Testere nice guy", "name": "Bill Baloney", "uuid": "22222"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "3", "_score" : 1.0, "_source" : {"position": 3, "parsedtext": "Bill is notn nice guy", "name": "Bill Clinton", "uuid": "33333"} } ] } }    These results contain a lot of information: took is the milliseconds of time required to execute the query. time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results. _shards is the status of shards divided into: total, which is the number of shards. successful, which is the number of shards in which the query was successful. failed, which is the number of shards in which the query failed, because some error or exception occurred during the query. hits are the results which are composed of the following: total is the number of documents that match the query. max_score is the match score of first document. It is usually one, if no match scoring was computed, for example in sorting or filtering. Hits which is a list of result documents. The resulting document has a lot of fields that are always available and others that depend on search parameters. The most important fields are as follows: _index: The index field contains the document _type: The type of the document _id: This is the ID of the document _source(this is the default field returned, but it can be disabled): the document source _score: This is the query score of the document sort: If the document is sorted, values that are used for sorting highlight: Highlighted segments if highlighting was requested fields: Some fields can be retrieved without needing to fetch all the source objects How it works The HTTP method to execute a search is GET (although POST also works); the REST endpoints are as follows: http://<server>/_search http://<server>/<index_name(s)>/_search http://<server>/<index_name(s)>/<type_name(s)>/_search Note: Not all the HTTP clients allow you to send data via a GET call, so the best practice, if you need to send body data, is to use the POST call. Multi indices and types are comma separated. If an index or a type is defined, the search is limited only to them. One or more aliases can be used as index names. The core query is usually contained in the body of the GET/POST call, but a lot of options can also be expressed as URI query parameters, such as the following: q: This is the query string to do simple string queries, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=uuid:11111' df: This is the default field to be used within the query, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? df=uuid&q=11111' from(the default value is 0): The start index of the hits. size(the default value is 10): The number of hits to be returned. analyzer: The default analyzer to be used. default_operator(the default value is OR): This can be set to AND or OR. explain: This allows the user to return information about how the score is calculated, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&explain=true' stored_fields: These allows the user to define fields that must be returned, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&stored_fields=name' sort(the default value is score): This allows the user to change the documents in  order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? sort=name.raw:desc' timeout(not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits accumulated are returned. search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elas ticsearch/reference/current/search-request-search-type.html. track_scores(the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort, because sorting by default prevents the return of a match score. pretty (the default value is false): If true, the results will be pretty printed. Generally, the query, contained in the body of the search, is a JSON object. The body of the search is the core of Elasticsearch's search functionalities; the list of search capabilities extends in every release. For the current version (5.x) of Elasticsearch, the available parameters are as follows: query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios. from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10). Note: The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries. sort: This allows the user to change the order of the matched documents. post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values. _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*) or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ search-request-source-filtering.html). fielddata_fields: This allows the user to return a field data representation of the field. stored_fields: This controls the fields to be returned. Note: Returning only the required fields reduces the network and memory usage, improving the performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources. aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter. index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices. highlighting: This allows the user to define fields and settings to be used for calculating a query abstract. version(the default value false): This adds the version of a document in the results. rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter. min_score: If this is given, all the result documents that have a score lower than this value are rejected. explain: This returns information on how the TD/IF score is calculated for a particular document. script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google- like do you mean functionality. search_type: This defines how Elasticsearch should process a query. scroll: This controls the scrolling in scroll/scan queries. The scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor. _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query. search_after: This allows the user to skip results using the most efficient way of scrolling. preference: This allows the user to select which shard/s to use for executing the query. We saw how to execute a search in ElasticSearch and also learnt about how it works. To know more on how to perform other operations in ElasticSearch check out the book Elasticsearch 5.x Cookbook.  
Read more
  • 0
  • 0
  • 6187
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-implementing-principal-component-analysis-r
Amey Varangaonkar
24 Jan 2018
6 min read
Save for later

Implementing Principal Component Analysis with R

Amey Varangaonkar
24 Jan 2018
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box] In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis. Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises of the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Eigenvalue percentage of variance cumulative percentage of variance: Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components.  Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Scree plot If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base) To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages. If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.
Read more
  • 0
  • 0
  • 5203

article-image-implement-named-entity-recognition-ner-using-opennlp-and-java
Pravin Dhandre
22 Jan 2018
5 min read
Save for later

Implement Named Entity Recognition (NER) using OpenNLP and Java

Pravin Dhandre
22 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box] In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results. To begin with, let's understand what Named Entity Recognition (NER) is all about. It is  referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb. In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names. Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences: Jim headed north. Dakota headed south. If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present. Using OpenNLP to perform NER We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java: try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions } An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence: TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names: TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm); To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method: String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence); The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here: Span[] spans = nf.find(tokens); This array holds information about person entities found in the sentence. We then display this information as shown here: for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”: [1..2) person - Wilson [4..5) person - Mary Once these entities have been extracted, we can use them for specialized analysis. Identifying location entities We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model: try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions } With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name: [5..7) location - Oklahoma Suppose we use the following sentence: sentence = "Pond Creek is located north of Oklahoma City."; Then we get this output: [1..2) location - Creek [6..8) location - Oklahoma Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.   With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.  
Read more
  • 0
  • 0
  • 7656

article-image-why-has-vuejs-become-so-popular
Amit Kothari
19 Jan 2018
5 min read
Save for later

Why has Vue.js become so popular?

Amit Kothari
19 Jan 2018
5 min read
The JavaScript ecosystem is full of choices, with many good web development frameworks and libraries to choose from. One of these frameworks is Vue.js, which is gaining a lot of popularity these days. In this post, we’ll explore why you should use Vue.js, and what makes it an attractive option for your next web project. For the latest Vue.js eBooks and videos, visit our Vue.js page. What is Vue.js? Vue.js is a JavaScript framework for building web interfaces. Vue has been gaining a lot of popularity recently. It ranks number one among the 5 web development tools that will matter in 2018. If you take a look at its GitHub page you can see just how popular it has become – the community has grown at an impressive rate. As a modern web framework, Vue ticks a lot of boxes. It uses a virtual DOM for better performance. A virtual DOM is an abstraction of the real DOM; this means it is lightweight and faster to work with. Vue is also reactive and declarative. This is useful because declarative rendering allows you to create visual elements that update automatically based on the state/data changes. One of the most exciting things about Vue is that it supports the component-based approach of building web applications. Its single file components, which are independent and loosely coupled, allow better reuse and faster development. It’s a tool that can significantly impact how you do things. What are the benefits of using Vue.js? Every modern web framework has strong benefits – if they didn’t, no one would use them after all. But here are some of the reasons why Vue.js is a good web framework that can help you tackle many of today’s development challenges. Check out this post to know more on how to install and use Vue.js for web development Good documentation. One of the things that are important when starting with a new framework is its documentation. Vue.js documentation is very well maintained; it includes a simple but comprehensive guide and well-documented APIs. Learning curve. Another thing to look for when picking a new framework is the learning curve involved. Compared to many other frameworks, Vue's concepts and APIs are much simpler and easier to understand. Also, it is built on top of classic web technologies like JavaScript, HTML, and CSS. This results in a much gentler learning curve. Unlike other frameworks which require further knowledge of different technologies - Angular requires TypeScript for example, and React uses JSX, with Vue we can build a sophisticated app by using HTML-based templates, plain JavaScript, and CSS. Less opinionated, more flexible. Vue is also pretty flexible compared to other popular web frameworks. The core library focuses on the ‘view’ part, using a modular approach that allows you to pick your own solution for other issues. While we can use other libraries for things like state management and routing, Vue offers officially supported companion libraries, which are kept up to date with the core library. This includes Vuex, which is an Elm, Flux, and Redux inspired state management solution, and vue-router, Vue's official routing library, which is powerful and incredibly easy to use with Vue.js. But because Vue is so flexible if you wanted to use Redux instead of Vuex, you can do just that. Vue even supports JSX and TypeScript. And if you like taking a CSS-in-JS approach, many other popular libraries also support Vue. Performance. One of the main reasons many teams are using Vue is because of its performance. Vue is small and even with minimal optimization effort performs better than many other frameworks. This is largely due to its lightweight virtual DOM implementation. Check out the JavaScript frameworks performance benchmark for a useful performance comparison. Tools. Along with a number of companion libraries, Vue also offers really good tools that offer a great development experience. Vue-CLI is Vue’s command line tool. Simple yet powerful, it provides different templates, allows project customization and makes starting a new Vue project incredibly easy. Vue also provides its own dev tools for Chrome (vue-devtools), which allows you to inspect the component tree and Vuex state, view events and even time travel. This makes the debugging process pretty easy. Vue also supports hot reload. Hot reload is great because instead of needing to reload a whole page, it allows you to simply reload only the updated component while maintaining the app's current state. Community. No framework can succeed without community support and, as we’ve seen already, Vue has a very active and constantly growing community. The framework is already adopted by many big companies, and its growth is only going to continue. While it is a great option for web development, Vue is also collaborating with Weex, a platform for building cross-platform mobile apps. Weex is backed by the Alibaba group, which is one of the largest e-commerce businesses in the world. Although Weex is not as mature as other app frameworks like React native, it does allow you to build a UI with Vue, which can be rendered natively on iOS and Android. Vue.js offers plenty of benefits. It performs well and is very easy to learn. However, it is, of course important to pick the right tool for the job, and one framework may work better than the other based on the project requirements and personal preferences. With this in mind, it’s worth comparing Vue.js with other frameworks. Are you considering using Vue.js? Do you already use it? Tell us about your experience! You can get started with building your first Vue.js 2 web application from this post.
Read more
  • 0
  • 3
  • 8082

article-image-gitlab-new-devops-solution
Erik Kappelman
17 Jan 2018
5 min read
Save for later

GitLab's new DevOps solution

Erik Kappelman
17 Jan 2018
5 min read
Can it be real? The complete DevOps toolchain integrated into one tool, one UI and one process? GitLab seems to think so. GitLab has already made huge strides in terms of centralizing the DevOps process into a single tool. Up until now, most of the focus has been on creating a seamless development system and operations have not been as important. What’s new is the extension of the tool to include the operating side of DevOps as well as the development side. Let's talk a little bit about what DevOps is in order to fully appreciate the advances offered by GitLab. DevOps is basically a holistic approach to software development, quality assurance, and operations. While each of these elements of software creation is distinct, they are all heavily reliant on the other elements to be effective. The DevOps approach is to acknowledge this interdependence and then try to leverage the interdepence to increase productivity and to enhance the final user experience. Two of the most talked about elements of DevOps are continous integration and continuous deployment. Continuous integration and deployment Continuous integration and deployment are aimed at continuously integrating changes to a codebase, potentially from multiple sources, and then continuously deploying these changes into production. These tools require a pretty sophisticated automation and testing framework in order to be really effective. There are plenty of tools for one or the other, but the notion behind GitLab is essentially that if you can affect both of these processes from the same UI, these processes would be that much more efficient. GitLab has shown this to be true.  There is also the human side to consider, that is, coming up with what tasks need to be performed, assigning these tasks to developers and monitoring their progress. GitLab offers tools that help streamline this process as well. You can track issues, create issue boards to organize workflow and these issue boards can be sliced a number of different ways so that most imaginable human organizational needs can be met. Monitoring and delivery So far, we’ve seen that DevOps is about bringing everything together into a smooth process, and GitLab wants that process to occur in one place. GitLab can help you from planning to deployment and everywhere in between. But, GitLab isn’t satisfied with stopping at deployment, and they shouldn’t be. When we think about the three legs of DevOps, development, operations, and quality assurance and testing, what I’ve said about GitLab really only applies to the development leg. This is an unfortunately common problem with DevOps tools and organizational strategies. They seem to cater to developers and basically no one else. Maybe devs complain the most, I don’t know. GitLab has basically solved the DevOps problems between planning and deployment and, naturally, wants to move on to the monitoring and delivery of applications. This is a really exciting direction. After all, software is ultimately about making things happen. Sometimes it's easy to lose sight of this and only focus on the tools that make the software. It is sometimes tempting to view software development as being inherently important, but it's really not; it's a process of making stuff for people to use. If you get too far away from that truth, things can get sticky. I think this is part of the reason the Ops side of DevOps is often overlooked. Operations is concerned with managing the software out there in the wild. This includes dealing with network and hardware considerations and end users. GitLab wants operations to take place using the same UI as development. And why not? It’s the same application isn’t it? And in addition to technical performance, what about how the users are interacting with the application? If the application is somehow monetized, why shouldn’t that information also be available in the same UI as everything else having to do with this application? Again, it's still the same application. One tool to rule them all If you take a minute to step back and appreciate the vision of GitLab’s current direction, I think you can see why this is so exciting. If GitLab is successful in the long-term of extending out their reach into every element of an application's lifecycle including user interactions, productivity would skyrocket.  This idea isn’t really new. The ‘one tool to rule them all’ isn’t even that imaginative of a concept. It's just that no one has ever really created this ‘one tool.’ I believe we are about to enter, or have already entered, a DevOps space race. I believe GitLab is comfortably leading the pack, but they will need to keep working hard if they want it to stay that way. I believe we will be getting the one tool to rule them all, and I believe it is going to be soon. The way things are looking, GitLab is going to be the one to bring it to us, but only time will tell. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 5002
article-image-running-parallel-data-operations-using-java-streams
Pravin Dhandre
15 Jan 2018
8 min read
Save for later

Running Parallel Data Operations using Java Streams

Pravin Dhandre
15 Jan 2018
8 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.
Read more
  • 0
  • 0
  • 5835

article-image-create-standard-java-http-client-elasticsearch
Sugandha Lahoti
12 Jan 2018
6 min read
Save for later

How to create a standard Java HTTP Client in ElasticSearch

Sugandha Lahoti
12 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from a book written by Alberto Paro, titled Elasticsearch 5.x Cookbook. This book is your one-stop guide to mastering the complete ElasticSearch ecosystem with comprehensive recipes on what’s new in Elasticsearch 5.x.[/box] In this article we see how to create a standard Java HTTP Client in ElasticSearch. All the codes used in this article are available on GitHub. There are scripts to initialize all the required data. An HTTP client is one of the easiest clients to create. It's very handy because it allows for the calling, not only of the internal methods as the native protocol does, but also of third- party calls implemented in plugins that can be only called via HTTP. Getting Ready You need an up-and-running Elasticsearch installation. You will also need a Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, must be installed. The code for this recipe is in the chapter_14/http_java_client directory. How to do it For creating a HTTP client, we will perform the following steps: For these examples, we have chosen the Apache HttpComponents that is one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable the compilation in your Maven pom.xml project just add the following code: <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> If we want to instantiate a client and fetch a document with a get method the code will look like the following: import org.apache.http.*; Import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.*; public class App { private static String wsUrl = "http://127.0.0.1:9200"; public static void main(String[] args) { CloseableHttpClient client = HttpClients.custom() .setRetryHandler(new MyRequestRetryHandler()).build(); HttpGet method = new HttpGet(wsUrl+"/test-index/test- type/1"); // Execute the method. try { CloseableHttpResponse response = client.execute(method); if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { System.err.println("Method failed: " + response.getStatusLine()); }else{ HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); System.out.println(responseBody); } } catch (IOException e) { System.err.println("Fatal transport error: " + e.getMessage()); e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } }    The result, if the document will be: {"_index":"test-index","_type":"test- type","_id":"1","_version":1,"exists":true, "_source" : {...}} How it works We perform the previous steps to create and use an HTTP client: The first step is to initialize the HTTP client object. In the previous code this is done via the following code: CloseableHttpClient client = HttpClients.custom().setRetryHandler(new MyRequestRetryHandler()).build(); Before using the client, it is a good practice to customize it; in general the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications; the IP network protocol is never 100% reliable, so it automatically retries an action if something goes bad (HTTP connection closed, server overhead, and so on). In the previous code, we defined an HttpRequestRetryHandler, which monitors the execution and repeats it three times before raising an error. After having set up the client we can define the method call. In the previous example we want to execute the GET REST call. The used   method will be for HttpGet and the URL will be item index/type/id. To initialize the method, the code is: HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1 To improve the quality of our REST call it's a good practice to add extra  controls to the method, such as authentication and custom headers. The Elasticsearch server by default doesn't require authentication, so we need to provide some security layer at the top of our architecture. A typical scenario is using your HTTP client with the search guard plugin or the shield plugin, which is part of X-Pack which allows the Elasticsearch REST to be extended with authentication and SSL. After one of these plugins is installed and configured on the server, the following code adds a host entry that allows the credentials to be provided only if context calls are targeting that host. The authentication is simply basicAuth, but works very well for non-complex deployments: HttpHost targetHost = new HttpHost("localhost", 9200, "http"); CredentialsProvider credsProvider = new BasicCredentialsProvider(); credsProvider.setCredentials( new AuthScope(targetHost.getHostName(), targetHost.getPort()), new UsernamePasswordCredentials("username", "password")); // Create AuthCache instance AuthCache authCache = new BasicAuthCache(); // Generate BASIC scheme object and add it to local auth cache BasicScheme basicAuth = new BasicScheme(); authCache.put(targetHost, basicAuth); // Add AuthCache to the execution context HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credsProvider); The create context must be used in executing the call: response = client.execute(method, context); Custom headers allow for passing extra information to the server for executing a call. Some examples could be API keys, or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call informing the server that our client accepts encoding: Accept-Encoding, gzip: request.addHeader("Accept-Encoding", "gzip"); After configuring the call with all the parameters, we can fire up the request: response = client.execute(method, context); Every response object must be validated on its return status: if the call is OK, the return status should be 200. In the previous code the check is done in the if statement: if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) If the call was OK and the status code of the response is 200, we can read the answer: HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); The response is wrapped in HttpEntity, which is a stream. The HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we'd need to create some code to read from the string and build the string. Obviously, all the read parts of the call are wrapped in a try-catch block to collect all possible errors due to networking errors. See Also The Apache HttpComponents at http://hc.apache.org/ for a complete reference and more examples about this library The search guard plugin to provide authenticated Elasticsearch access at https://github.com/floragunncom/search-guard or the Elasticsearch official shield plugin at https://www.elastic.co/products/x-pack. We saw a simple recipe to create a standard Java HTTP client in Elasticsearch. If you enjoyed this excerpt, check out the book Elasticsearch 5.x Cookbook to learn how to create an HTTP Elasticsearch client, a native client and perform other operations in ElasticSearch.          
Read more
  • 0
  • 0
  • 5232

article-image-getting-started-soa-and-wso2
Packt
11 Jan 2018
11 min read
Save for later

Getting Started with SOA and WSO2

Packt
11 Jan 2018
11 min read
In this article by Fidel Prieto Estrada and Ramón Garrido, authors of the book WSO2: Developer’s Guide, we will discuss the facts or problems that large companies with a huge IT system had to face, and that finally gave rise to the SOA approach. (For more resources related to this topic, see here.) Once we know what we are talking about, we will introduce the WSO2 technology and describe the role it plays in SOA, which will be followed by the installation and configuration of the WSO2 products we will use. So, in this article, we willlearn about the basic knowledge of SOA. Service-oriented architecture (SOA) is a style, an approach to design software in a different way from the standard. SOA is not a technology; it is a paradigm, a design style. There comes a time when a company grows and grows, which means that its IT system also becomes bigger and bigger, fetching a huge amount of data that it has to share with other companies. This typical data may be, for example, any of the following: Sales data Employees data Customer data Business information In this environment, each information need of the company's applications is satisfied by a direct link to the system that owns the required information. So, when a company becomes a large corporation, with many departments and complex business logic, the IT system becomes a spaghetti dish: Insert Image B06549_01_01.png Spaghetti dish The spaghetti dish is a comparison widely used to describe how complex the integration links between applications may become in this large corporation. In this comparison, each spaghetti represents the link between two applications in order to share any kind of information. Thus, when the number of applications needed for our business rises, the amount of information shared is larger as well. So, if we draw the map that represents all the links between the whole set of applications, the image will be quite similar to a spaghetti dish. Take a look at the following diagram: Insert Image B06549_01_02.png Spaghetti integrations by Oracle (https://image.slidesharecdn.com/2012-09-20-aspire-oraclesoawebinar-finalversion-160109031240/95/maximizing-oracle-apps-capabilities-using-oracle-soa-7-           638.jpg?cb=1452309418) The preceding diagram represents an environment that is closed, monolithic, and inefficient,with the following features: The architecture is split into blocks divided by business areas. Each area is close to the rest of the areas, so interaction between them is quite difficult. These isolated blocks are hard to maintain. Each block was managed by just one provider, which knew that business area deeply. It is difficult for the company to change the provider that manages each business area due to the risk involved. The company cannot protect itself against the abuses of the provider. The provider may commit many abuses, such as raising the provided service fare, violatingservice level agreement (SLA), breaching the schedule, and many others we can imagine. In these situations, the company lacks instruments to fight them because if the business area managed by the provider stops working, the impact on the company profits is much larger than when assumingthat the provider abuses. The provider has deeper knowledge of the customer business than the customer itself. The maintenance cost is high due to the complexity of the network for many reasons; consider the following example: It is difficultto perform impact analysis when a new functionality is needed, which means high cost and long time to evaluate any fix, and higher cost of each fix in turn. The complex interconnection network is difficult to know in depth. Finding the cause of a failure or malfunction may become quite a task. When a system is down, most of the others may be down as well. A business process is used to involve different databases and applications. Thus, when a user has to run a business process in the company, he needs to use different applications, access different networks, and log in with different credentials in each one; this makes the business quite inefficient, making simple tasks take too much time. When a system in your puzzle uses an obsolete technology, which is quite common with legacy systems, you will always be tied to it and to the incompatibility issues with brand new technologies, for instance. Managing a fine-grained security policy that manages who has access to each piece of data is simply an uthopy. Something must to be done to face all these problems and SOA is the one to put this in order. SOA is the final approach after the previous attempt to try to tidy up this chaos. We can take a look at the SOA origin in the white paper,The 25-year history of SOA, by ErikTownsend(http://www.eriktownsend.com/white-papers/technology). It is quite an interesting read, where Erik establishes the origin of the manufacturing industry. I agree to that idea, and it is easy to see how the improvements in the manufacturing industry, or other industries, are applied to the IT world; take these examples: Hardware bus in motherboards are being used for decades, and now we can also find software bus, Enterprise Service Bus (ESB) in a company. The hardware bus connects hardware devices such as microprocessor, memory, or hard drive; the software bus connects applications. Hardware router in a network routes small fragments of data between different nets to lead these packets to the destination net. The message router software, which implements the message router enterprise integration pattern, routes data objects between applications. We create software factories to develop software using the same paradigm as a manufacturing industry. Lean IT is a trending topic nowadays. It tries, roughly speaking, to optimize the IT processes by removing the muda (Japanese word meaning wastefulness, uselessness). It is based on the benefits of the lean manufacturing applied by Toyota in the '70s, after the oil crisis, which led it to the top position in the car manufacturing industry. We find an analogy between what object-oriented language means to programming and what SOA represents to system integrations as well. We can also find analogies between ITILv3 and SOA. The way ITILv3 manages the company services can be applied to manage the SOA services at many points. ITILv3 deals with the services that a company offers and how to manage them, and SOA deals with the service that a company offers to expose data from one system to the rest of them. Both the conceptions are quite similar if we think of the ITILv3 company as the IT department and ofthe company's service as the SOA service. There is another quite interesting read--Note on Distributed Computing from Sun Microsystem Laboratories published in 1994. In this reading,four membersof Sun Microsystems discuss the problems that a company faces when it expands, and the system that made up the IT core of the company and its need to share information. You canfind this reading athttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.7969&rep=rep1&type=pdf. In the early '90s, when companies were starting to computerize, they needed to share information from one system to another, whichwas not an easy task at all. There was a discussion on how to handle the local and remote information as well as which technology to use to share that information. The Network File System(NFS), by IBM was a good attempt to share that information, but there was still a lot of work left to do.After NFS, other approaches came,such as CORBA and Microsoft DCOM, but they still keep the dependencies between the whole set of applications connected. Refer to the following diagram: Insert Image B06549_01_03.png The SOA approach versus CORBA and DCOM Finally, with the SOA approach, by the end of the '90s, independent applications where able to share their data avoiding dependencies. This data interchange is done using services. An SOA service is a data interchange need between different systems that accomplish some rules. These rules are the so-called SOA principles that we will explain as we move on. SOA Principles The SOA Principles are the rules that we always have to keep in mind when taking any kind of decisions in an SOA organization, such as the following: Analyzing proposal for services Deciding whether to add a new functionality to a service or to split it into two services Solving performance issues Designing new services There is no industry agreement about the SOA principles, and some of them publish their own principles. Now, we will go through the principles that will help us in understanding its importance: Service Standardization:Services must comply with communication and design agreements defined for the catalog they belong to. These include both high-level specifications and low level details, such as those mentioned here: Service name Functional details Input data Output data Protocols Security Service loose coupling: Services in the catalog must be independent from each other. The only thing a service should know about the rest of the services in the catalog is that they exist. The way to achieve this is by defining service contracts so that when a service needs to use another one, it has to just use that service contract. Service abstraction: The service should be a black box just defined by its contracts. The contract specifies the input and output parameters with no information about how the process is performed at all. This reduces the coupling with other services to a minimum. Service reusability: This is the most important principle and means that services must be conceived to be reused by the maximum number of consumers. The service must be reused in any context and by any consumer, not only by the application that originated the need for the service. Other applications in the company must be able to consume that service and even other systems outside the company in case the service is published, for example, for the citizenship. To achieve this, obviously the service must be independent from any technology and must not be coupled to a specific business process. If we have a service working in a context, and it is needed to serve in a wider context, the right choice is to modify the service for it to be able to be consumed in both the contexts. Service autonomy: A service must have a high degree of control over the runtime environment and over the logic it represents. The more control a service has over the underlying resources, the less dependencies it has and the more predictable and reliable it is. Resources maybe hardware resources or software resources, for example, the network is a hardware resource, and a database table is a software resource. It would be ideal to have a service with exclusive ownership over the resources, but with an equilibrated amount of control that allows it to minimize the dependencies on shared resources. Service statelessness: Services must have no state, that is, a service does not retain information about the data processed. All thedata needed comes from the input parameters every time it is consumed. The information needed during the process dies when the process ends. Managing the whole amount of state information will put its availability in serious trouble. Service discovery: With a goal to maximize the reutilization, services must be able to be discovered. Everyone should know the service list and their detailed information. To achieve that aim, services will have metadata to describe them, which will be stored in a repository or catalog. This metadata information must be accessed easily and automatically (programmatically) using, for example,Universal Description, Discovery, and Integration (UDDI). Thus, we avoid building or applying for a new service when we already have a service, or several ones, providing that information by composition. Service composability: Service with more complex requirements must use other existing services to achieve that aim instead of implementing the same logic that is already available in other services. Service granularity: Services must offer a relevant piece of business. The functionality of the service must not be so simple that the output of the service always needs to be complemented with another service'sfunctionality. Likewise, the functionality of the service must not be so complex that none of the services in the company uses the whole set of information returned by the service. Service normalization: Like in other areas such as database design, services must be decomposed, avoiding redundant logic. This principle may be omitted in some cases due to, for example, performance issues, where the priority is quick response for the business. Vendor independent: As we discussed earlier, services must not be attached to any technology. The service definition must be technology independent, and any vendor-specific feature must not affect the design of the service. Summary In this article, we discussed the issues that gave rise to SOA, described its main principles, and explained how to make our standard organization in an SOA organization. In order to achieve this aim, we named the WSO2 product we need as WSO2 Enterprise Integrator. Finally, we learned how to install, configure, and start it up. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 1275
article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 5471

article-image-r-perfect-statistical-analysis
Amarabha Banerjee
11 Jan 2018
7 min read
Save for later

Why R is perfect for Statistical Analysis

Amarabha Banerjee
11 Jan 2018
7 min read
[box type="note" align="" class="" width=""]This article is taken from Machine Learning with R  written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box] In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently. Introduction The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language. To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics: Descriptive statistics: These are used to summarize the characteristics of the data. The user can use mean and standard deviation to describe numerical data, and use frequency and percentages to describe categorical data Inferential statistics: Based on the pattern within a sample data, the user can infer the characteristics of the population. The methods related to inferential statistics are for hypothesis testing, data estimation, data correlation, and relationship modeling. Inference can be further extended to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student's t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson's Chi-squared Test, One-way ANOVA, and Two-way ANOVA. Data sampling with R Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R. Perform the following steps to understand data sampling in R: To generate random samples of a given population, the user can simply use the sample function: > sample(1:10) R and Statistics [ 111 ] To specify the number of items returned, the user can set the assigned value to the size argument: > sample(1:10, size = 5) Moreover, the sample can also generate Bernoulli trials by specifying replace = TRUE (default is FALSE): > sample(c(0,1), 10, replace = TRUE) If we want to do a coin flipping trail, where the outcome is Head or Tail, we can use: > outcome <- c("Head","Tail") > sample(outcome, size=1) To generate result for 100 times, we can use: > sample(outcome, size=100, replace=TRUE) The sample can be useful when we want to select random data from datasets, selecting 10 observations from AirPassengers: > sample(AirPassengers, size=10) How it works As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only). Operating a probability distribution in R Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R. Perform the following steps: For a normal distribution, the user can use dnorm, which will return the height of a normal curve at 0: > dnorm(0) Output: [1] 0.3989423 Then, the user can change the mean and the standard deviation in the argument: > dnorm(0,mean=3,sd=5) Output: [1] 0.06664492 Next, plot the graph of a normal distribution with the curve function: > curve(dnorm,-3,3) In contrast to dnorm, which returns the height of a normal curve, the pnorm function can return the area under a given value: > pnorm(1.5) Output: [1] 0.9331928 Alternatively, to get the area over a certain value, you can specify the option, lower.tail, as FALSE: > pnorm(1.5, lower.tail=FALSE) Output: [1] 0.0668072 To plot the graph of pnorm, the user can employ a curve function: > curve(pnorm(x), -3,3) To calculate the quantiles for a specific distribution, you can use qnorm. The function, qnorm, can be treated as the inverse of pnorm, which returns the Zscore of a given probability: > qnorm(0.5) Output: [1] 0 > qnorm(pnorm(0)) Output: [1] 0 To generate random numbers from a normal distribution, one can use the rnorm function and specify the number of generated numbers. Also, one can define optional arguments, such as the mean and standard deviation: > set.seed(50) > x = rnorm(100,mean=3,sd=5) > hist(x) To calculate the uniform distribution, the runif function generates random numbers from a uniform distribution. The user can specify the range of the generated numbers by specifying variables, such as the minimum and maximum. For the following example, the user generates 100 random variables from 0 to 5: > set.seed(50) > y = runif(100,0,5) > hist(y) Lastly, if you would like to test the normality of the data, the most widely used test for this is the Shapiro-Wilks test. Here, we demonstrate how to perform a test of normality on samples from both the normal and uniform distributions, respectively: > shapiro.test(x) Output: Shapiro-Wilk normality test data: x W = 0.9938, p-value = 0.9319 > shapiro.test(y) Shapiro-Wilk normality test data: y W = 0.9563, p-value = 0.002221 How it works In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions. After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density. In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value. Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution. We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it. If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 1421