Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-training-rnns-time-series-forecasting
Aaron Lazar
22 Nov 2017
7 min read
Save for later

Training RNNs for Time Series Forecasting

Aaron Lazar
22 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This tutorial has been taken from the book Practical Time Series Analysis by Dr. PKS Prakash and Dr. Avishek Pal.[/box] RNNs are notoriously difficult to be trained. Vanilla RNNs suffer from vanishing and exploding gradients that give erratic results during training. As a result, RNNs have difficulty in learning long-range dependencies. For time series forecasting, going too many timesteps back in the past would be problematic. To address this problem, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are special types of RNNs, have been introduced. We will use LSTM and GRU to develop the time series forecasting models. Before that, let's review how RNNs are trained using Backpropagation Through Time (BPTT), a variant of the backpropagation algorithm. We will find out how vanishing and exploding gradients arise during BPTT. Let's consider the computational graph of the RNN that we have been using for the time series forecasting. The gradient computation is shown in the following figure: Figure: Back-Propagation Through Time for deep recurrent neural network with p timesteps For weights U, there is one path to compute the partial derivative, which is given as follows: However, due to the sequential structure of the RNN, there are multiple paths connecting the weights and the loss and the loss. Hence, the partial derivative is the sum of partial derivatives along the individual paths that start at the loss node and ends at every timestep node in the computational graph: The technique of computing the gradients for the weights by summing over paths connecting the loss node and every timestep node is backpropagation through time, which is a special case of the original backpropagation algorithm. The problem of vanishing gradients in long-range RNNs is due to the multiplicative terms in the BPTT gradient computations. Now let's examine a multiplicative term from one of the preceding equations. The gradients along the computation path connecting the loss node and ith timestep is This chain of gradient multiplication is ominously long to model long-range dependencies and this is where the problem of vanishing gradient arises. The activation function for the internal state [0,1) is either a tanh or sigmoid. The first derivative of tanh is which is bound in (0,¼]. In the sigmoid function, the first-order derivative is which is bound in si . Hence, the gradients W  are positive fractions. For long-range timesteps, multiplying these fractional gradients diminishes the final product to zero and there is no gradient flow from a long-range timestep. Due to the negligibly low values of the gradients, the weights do not update and hence the neurons are said to be saturated. It is noteworthy that U, ∂si/∂s(i-1), and ∂si/∂W are matrices and therefore the partial derivatives t and st are computed on matrices. The final output is computed through matrix multiplications and additions. The first derivative on matrices is called Jacobian. If any one element of the Jacobian matrix is a fraction, then for a long-range RNN, we would see a vanishing gradient. On the other hand, if an element of the Jacobian is greater than one, the training process suffers from exploding gradients. Solving the long-range dependency problem We have seen in the previous section that it is difficult for vanilla RNNs to effectively learn long-range dependencies due to vanishing and exploding gradients. To address this issue, Long Short Term Memory network was developed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Gated Recurrent Unit was introduced in 2014 and gives a simpler version of LSTM. Let's review how LSTM and GRU solve the problem of learning long-range dependencies. Long Short Term Memory (LSTM) LSTM introduces additional computations in each timestep. However, it can still be treated as a black box unit that, for timestep (ht), returns the state internal (ft) and this is forwarded to the next timestep. However, internally, these vectors are computed differently. LSTM introduces three new gates: the input (ot) forget (gt), and output (ct) gates. Every timestep also has an internal hidden state (ht) and internal memory These new units are computed as follows: Now, let's understand the computations. The gates are generated through sigmoid activations ht that limits their values within ft. Hence, these act as gates by letting out only a fraction of the value when multiplied by another variable. The input gate ot controls the fraction of the newly computed input to keep. The forget gate ct determines the effect of the previous time step and the output gate ft controls how much of the internal state to let out. The internal hidden state is calculated from the input to the current timestep and output of the previous timestep. Note that this is the same as computing the internal state in a vanilla RNN. ht is the internal memory unit of the current timestep and considers the memory from the previous step but downsized by a fraction st and the effect of the internal hidden state but mixed with the input gate ot. Finally,  zt would be passed to the next timestep and computed from the current internal memory and the output gate rt. The input, forget, and output gates are used to selectively include the previous memory and the current hidden state that is computed in the same manner as in vanilla RNNs. The gating mechanism of LSTM allows memory transfer from over long-range timesteps. Gated Recurrent Units (GRU) GRU is simpler than LSTM and has only two internal gates, namely, the update gate (zt) and the reset gate (rt). The computations of the update and reset gates are as follows: zt = σ(Wzxt + Uzst-1) rt = σ(Wrxt + Urst-1) The state st of the timestep t is computed using the input xt, state st-1 from the previous timestep, the update, and the reset gates: The update being computed by a sigmoid function determines how much of the previous step's memory is to be retained in the current timestep. The reset gate controls how to combine the previous memory with the current step's input. Compared to LSTM, which has three gates, GRU has two. It does not have the output gate and the internal memory, both of which are present in LSTM. The update gate in GRU determines how to combine the previous memory with the current memory and combines the functionality achieved by the input and forget gates of LSTM. The reset gate, which combines the effect of the previous memory and the current input, is directly applied to the previous memory. Despite a few differences in how memory is transmitted along the sequence, the gating mechanisms in both LSTM and GRU are meant to learn long-range dependencies in data. So, which one do we use - LSTM or GRU? Both LSTM and GRU are capable of handling memory over long RNNs. However, a common question is which one to use? LSTM has been long preferred as the first choice for language models, which is evident from their extensive use in language translation, text generation, and sentiment classification. GRU has the distinct advantage of fewer trainable weights as compared to LSTM. It has been applied to tasks where LSTM has previously dominated. However, empirical studies show that neither approach outperforms the other in all tasks. Tuning model hyperparameters such as the dimensionality of the hidden units improves the predictions of both. A common rule of thumb is to use GRU in cases having less training data as it requires less number of trainable weights. LSTM proves to be effective in case of large datasets such as the ones used to develop language translation models. This article is an excerpt from the book Practical Time Series Analysis. For more techniques, grab the book now!
Read more
  • 0
  • 0
  • 3478

article-image-implementing-k-nearest-neighbors-algorithm-python
Aaron Lazar
17 Nov 2017
7 min read
Save for later

Implementing the k-nearest neighbors algorithm in Python

Aaron Lazar
17 Nov 2017
7 min read
[box type="note" align="" class="" width=""]The following is an excerpt from Dávid Natingga's Data Science Algorithms in a Week. [/box] The nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbor algorithm is the class with the highest representation among the k-closest neighbors. In this short tutorial, we will cover the basics of the k-NN algorithm - understanding it and its implementation with a simple example: Mary and her temperature preferences. So let’s get right to it, shall we? Mary and her temperature preferences As an example, if we know that our friend Mary feels cold when it is 10 degrees Celsius, but warm when it is 25 degrees Celsius, then in a room where it is 22 degrees Celsius, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10. Suppose we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary was asked if she felt warm or cold: We could represent the data in a graph, as follows: Now, suppose we would like to find out how Mary feels at the temperature 16 degrees Celsius with a wind speed of 3km/h using the 1-NN algorithm: For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of a neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined to be dMan=|x1- x2|+|y1- y2|. Let us label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify: We can see that the closest neighbor with a known class is the one with the temperature 15 (blue) degrees Celsius and the wind speed 5km/h. Its distance from the questioned point is three units. Its class is blue (cold). The closest red (warm) neighbor is away four units from the questioned point. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the questioned point should be blue (cold). By applying this procedure to every data point, we can complete the graph as follows: Note that sometimes a data point can be distanced from two known classes with the same distance: for example, 20 degrees Celsius and 6km/h. In such situations, we would prefer one class over the other or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm. Implementation of k-nearest neighbors algorithm in Python We’ll implement the k-NN algorithm in Python to find Mary's temperature preference: # source_code/1/mary_and_temperature_preferences/knn_to_data.py # Applies the knn algorithm to the input data. # The input text file is assumed to be of the format with one line per # every data entry consisting of the temperature in degrees Celsius, # wind speed and then the classification cold/warm. import sys sys.path.append('..') sys.path.append('../../common') import knn # noqa import common # noqa # Program start # E.g. "mary_and_temperature_preferences.data" input_file = sys.argv[1] # E.g. "mary_and_temperature_preferences_completed.data" output_file = sys.argv[2] k = int(sys.argv[3]) x_from = int(sys.argv[4]) x_to = int(sys.argv[5]) y_from = int(sys.argv[6]) y_to = int(sys.argv[7]) data = common.load_3row_data_to_dic(input_file) new_data = knn.knn_to_2d_data(data, x_from, x_to, y_from, y_to, k) common.save_3row_data_from_dic(output_file, new_data) # source_code/common/common.py # ***Library with common routines and functions*** def dic_inc(dic, key): if key is None: Pass if dic.get(key, None) is None: dic[key] = 1 Else: dic[key] = dic[key] + 1 # source_code/1/knn.py # ***Library implementing knn algorihtm*** def info_reset(info): info['nbhd_count'] = 0 info['class_count'] = {} # Find the class of a neighbor with the coordinates x,y. # If the class is known count that neighbor. def info_add(info, data, x, y): group = data.get((x, y), None) common.dic_inc(info['class_count'], group) info['nbhd_count'] += int(group is not None) # Apply knn algorithm to the 2d data using the k-nearest neighbors with # the Manhattan distance. # The dictionary data comes in the form with keys being 2d coordinates # and the values being the class. # x,y are integer coordinates for the 2d data with the range # [x_from,x_to] x [y_from,y_to]. def knn_to_2d_data(data, x_from, x_to, y_from, y_to, k): new_data = {} info = {} # Go through every point in an integer coordinate system. for y in range(y_from, y_to + 1): for x in range(x_from, x_to + 1): info_reset(info) # Count the number of neighbors for each class group for # every distance dist starting at 0 until at least k # neighbors with known classes are found. for dist in range(0, x_to - x_from + y_to - y_from): # Count all neighbors that are distanced dist from # the point [x,y]. if dist == 0: info_add(info, data, x, y) Else: for i in range(0, dist + 1): info_add(info, data, x - i, y + dist - i) info_add(info, data, x + dist - i, y - i) for i in range(1, dist): info_add(info, data, x + i, y + dist - i) info_add(info, data, x - dist + i, y - i) # There could be more than k-closest neighbors if the # distance of more of them is the same from the point # [x,y]. But immediately when we have at least k of # them, we break from the loop. if info['nbhd_count'] >= k: Break class_max_count = None # Choose the class with the highest count of the neighbors # from among the k-closest neighbors. for group, count in info['class_count'].items(): if group is not None and (class_max_count is None or count > info['class_count'][class_max_count]): class_max_count = group new_data[x, y] = class_max_count return new_data Input: The program above will use the file below as the source of the input data. The file contains the table with the known data about Mary's temperature preferences: # source_code/1/mary_and_temperature_preferences/ marry_and_temperature_preferences.data 10 0 cold 25 0 warm 15 5 cold 20 3 warm 18 7 cold 20 10 cold 22 5 warm 24 6 warm Output: We run the implementation above on the input file mary_and_temperature_preferences.data using the k-NN algorithm for k=1 neighbors. The algorithm classifies all the points with the integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10), so with the a of (25+1) * (10+1) = 286 integer points (adding one to count points on boundaries). Using the wc command, we find out that the output file contains exactly 286 lines - one data item per point. Using the head command, we display the first 10 lines from the output file. We visualize all the data from the output file in the next section: $ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10 $ wc -l mary_and_temperature_preferences_completed.data 286 mary_and_temperature_preferences_completed.data $ head -10 mary_and_temperature_preferences_completed.data 7 3 cold 6 9 cold 12 1 cold 16 6 cold 16 9 cold 14 4 cold 13 4 cold 19 4 warm 18 4 cold 15 1 cold So, there you have it! The K Nearest Neighbors algorithm explained and implemented in Python. I hope you enjoyed this tutorial and found it interesting. If you want more, go ahead and purchase Dávid Natingga's Data Science Algorithms in a Week, from which the tutorial has been extracted.
Read more
  • 0
  • 0
  • 4843

article-image-visualizing-3d-plots-matplotlib-2-0
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing 3D plots in Matplotlib 2.0

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example.[/box] By transitioning to the three-dimensional space, you may enjoy greater creative freedom when creating visualizations. The extra dimension can also accommodate more information in a single plot. However, some may argue that 3D is nothing more than a visual gimmick when projected to a 2D surface (such as paper) as it would obfuscate the interpretation of data points. In Matplotlib version 2, despite significant developments in the 3D API, annoying bugs or glitches still exist. We will discuss some workarounds toward the end of this article. More powerful Python 3D visualization packages do exist (such as MayaVi2, Plotly, and VisPy), but it's good to use Matplotlib's 3D plotting functions if you want to use the same package for both 2D and 3D plots, or you would like to maintain the aesthetics of its 2D plots. For the most part, 3D plots in Matplotlib have similar structures to 2D plots. As such, we will not go through every 3D plot type in this section. We will put our focus on 3D scatter plots and bar charts. 3D scatter plot Let's try to create a 3D scatter plot. Before doing that, we need some data points in three dimensions (x, y, z): import pandas as pd source = "https://raw.githubusercontent.com/PointCloudLibrary/data/master/tutorials/ ism_train_cat.pcd" cat_df = pd.read_csv(source, skiprows=11, delimiter=" ", names=["x","y","z"], encoding='latin_1') cat_df.head() To declare a 3D plot, we first need to import the Axes3D object from the mplot3d extension in mpl_toolkits, which is responsible for rendering 3D plots in a 2D plane. After that, we need to specify projection='3d' when we create subplots: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z) plt.show() Behold, the mighty sCATter plot in 3D. Cats are currently taking over the internet. According to the New York Times, cats are "the essential building block of the Internet" (https://www.nytimes.com/2014/07/23/upshot/what-the-internet-can-see-from-your-cat-pictures.html). Undoubtedly, they deserve a place in this chapter as well. Contrary to the 2D version of scatter(), we need to provide X, Y, and Z coordinates when we are creating a 3D scatter plot. Yet the parameters that are supported in 2D scatter() can be applied to 3D scatter() as well: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Change the size, shape and color of markers ax.scatter(cat_df.x, cat_df.y, cat_df.z, s=4, c="g", marker="o") plt.show() To change the viewing angle and elevation of the 3D plot, we can make use of view_init(). The azim parameter specifies the azimuth angle in the X-Y plane, while elev specifies the elevation angle. When the azimuth angle is 0, the X-Y plane would appear to the north from you. Meanwhile, an azimuth angle of 180 would show you the south side of the X-Y plane: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z,s=4, c="g", marker="o") # elev stores the elevation angle in the z plane azim stores the # azimuth angle in the x,y plane ax.view_init(azim=180, elev=10) plt.show() 3D bar chart We introduced candlestick plots for showing Open-High-Low-Close (OHLC) financial data. In addition, a 3D bar chart can be employed to show OHLC across time. The next figure shows a typical example of plotting a 5-day OHLC bar chart: import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D # Get 1 and every fifth row for the 5-day AAPL OHLC data ohlc_5d = stock_df[stock_df["Company"]=="AAPL"].iloc[1::5, :] fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8, width=5) plt.show() The method for setting ticks and labels is similar to other Matplotlib plotting functions: fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) # Set the z-axis label ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Caveats to consider while visualizing 3D plots in Matplotlib Due to the lack of a true 3D graphical rendering backend (such as OpenGL) and proper algorithm for detecting 3D objects' intersections, the 3D plotting capabilities of Matplotlib are not great but just adequate for typical applications. In the official Matplotlib FAQ (https://matplotlib.org/mpl_toolkits/mplot3d/faq.html), the author noted that 3D plots may not look right at certain angles. Besides, we also reported that mplot3d would failed to clip bar charts if zlim is set (https://github.com/matplotlib/matplotlib/ issues/8902; see also https://github.com/matplotlib/matplotlib/issues/209). Without improvements in the 3D rendering backend, these issues are hard to fix. To better illustrate the latter issue, let's try to add ax.set_zlim3d(bottom=110, top=150) right above plt.tight_layout() in the previous 3D bar chart: Clearly, something is going wrong, as the bars overshoot the lower boundary of the axes. We will try to address the latter issue through the following workaround: # FuncFormatter to add 110 to the tick labels def major_formatter(x, pos): return "{}".format(x+110) fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) # Truncate the y-values by 110 ax.bar(xs, ys-110, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') # Set the z-axis label ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) ax.zaxis.set_major_formatter(FuncFormatter(major_formatter)) ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Basically, we truncated the y values by 110, and then we used a tick formatter (major_formatter) to shift the tick value back to the original. For 3D scatter plots, we can simply remove the data points that exceed the boundary of set_zlim3d() in order to generate a proper figure. However, these workarounds may not work for every 3D plot type. Conclusion We didn't go into too much detail of the 3D plotting capability of Matplotlib, as it is yet to be polished. For simple 3D plots, Matplotlib already suffices. The learning curve can be reduced if we use the same package for both 2D and 3D plots. You are advised to take a look at MayaVi2, Plotly, and VisPy if you require more powerful 3D plotting functions. If you enjoyed this excerpt, be sure to check out the book it is from.
Read more
  • 0
  • 0
  • 12003
Visually different images

article-image-visualizing-univariate-distribution-seaborn
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing univariate distribution in Seaborn

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example. [/box] Seaborn by Michael Waskom is a statistical visualization library that is built on top of Matplotlib. It comes with handy functions for visualizing categorical variables, univariate distributions, and bivariate distributions. In this article, we will visualize univariate distribution in Seaborn. Visualizing univariate distribution Seaborn makes the task of visualizing the distribution of a dataset much easier. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset. Let's see how it distributes among different countries in 2017 by plotting a bar plot: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Population Bar chart sns.barplot(x="AgeGrp",y="Value", hue="Sex", data = current_population) # Use Matplotlib functions to label axes rotate tick labels ax = plt.gca() ax.set(xlabel="Age Group", ylabel="Population (thousands)") ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) plt.title("Population Barchart (USA)") # Show the figure plt.show() Bar chart in Seaborn The seaborn.barplot() function shows a series of data points as rectangular bars. If multiple points per group are available, confidence intervals will be shown on top of the bars to indicate the uncertainty of the point estimates. Like most other Seaborn functions, various input data formats are supported, such as Python lists, Numpy arrays, pandas Series, and pandas DataFrame. A more traditional way to show the population structure is through the use of a population pyramid. So what is a population pyramid? As its name suggests, it is a pyramid-shaped plot that shows the age distribution of a population. It can be roughly classified into three classes, namely constrictive, stationary, and expansive for populations that are undergoing negative, stable, and rapid growth respectively. For instance, constrictive populations have a lower proportion of young people, so the pyramid base appears to be constricted. Stable populations have a more or less similar number of young and middle-aged groups. Expansive populations, on the other hand, have a large proportion of youngsters, thus resulting in pyramids with enlarged bases. We can build a population pyramid by plotting two bar charts on two subplots with a shared y axis: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Change the age group to descending order current_population = current_population.iloc[::-1] # Create two subplots with shared y-axis fig, axes = plt.subplots(ncols=2, sharey=True) # Bar chart for male sns.barplot(x="Value",y="AgeGrp", color="darkblue", ax=axes[0], data = current_population[(current_population.Sex == 'Male')]) # Bar chart for female sns.barplot(x="Value",y="AgeGrp", color="darkred", ax=axes[1], data = current_population[(current_population.Sex == 'Female')]) # Use Matplotlib function to invert the first chart axes[0].invert_xaxis() # Use Matplotlib function to show tick labels in the middle axes[0].yaxis.tick_right() # Use Matplotlib functions to label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() Since Seaborn is built on top of the solid foundations of Matplotlib, we can customize the plot easily using built-in functions of Matplotlib. In the preceding example, we used matplotlib.axes.Axes.invert_xaxis() to flip the male population plot horizontally, followed by changing the location of the tick labels to the right-hand side using matplotlib.axis.YAxis.tick_right(). We further customized the titles and axis labels for the plot using a combination of matplotlib.axes.Axes.set_title(), matplotlib.axes.Axes.set(), and matplotlib.figure.Figure.suptitle(). Let's try to plot the population pyramids for Cambodia and Japan as well by changing the line population_df.Location == 'United States of America' to population_df.Location == 'Cambodia' or  population_df.Location == 'Japan'. Can you classify the pyramids into one of the three population pyramid classes? To see how Seaborn simplifies the code for relatively complex plots, let's see how a similar plot can be achieved using vanilla Matplotlib. First, like the previous Seaborn-based example, we create two subplots with shared y axis: fig, axes = plt.subplots(ncols=2, sharey=True) Next, we plot horizontal bar charts using matplotlib.pyplot.barh() and set the location and labels of ticks, followed by adjusting the subplot spacing: # Get a list of tick positions according to the data bins y_pos = range(len(current_population.AgeGrp.unique())) # Horizontal barchart for male axes[0].barh(y_pos, current_population[(current_population.Sex == 'Male')].Value, color="darkblue") # Horizontal barchart for female axes[1].barh(y_pos, current_population[(current_population.Sex == 'Female')].Value, color="darkred") # Show tick for each data point, and label with the age group axes[0].set_yticks(y_pos) axes[0].set_yticklabels(current_population.AgeGrp.unique()) # Increase spacing between subplots to avoid clipping of ytick labels plt.subplots_adjust(wspace=0.3) Finally, we use the same code to further customize the look and feel of the figure: # Invert the first chart axes[0].invert_xaxis() # Show tick labels in the middle axes[0].yaxis.tick_right() # Label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() When compared to the Seaborn-based code, the pure Matplotlib implementation requires extra lines to define the tick positions, tick labels, and subplot spacing. For some other Seaborn plot types that include extra statistical calculations such as linear regression, and Pearson correlation, the code reduction is even more dramatic. Therefore, Seaborn is a "batteries-included" statistical visualization package that allows users to write less verbose code. Histogram and distribution fitting in Seaborn In the population example, the raw data was already binned into different age groups. What if the data is not binned (for example, the BigMac Index data)? Turns out, seaborn.distplot can help us to process the data into bins and show us a histogram as a result. Let's look at this example: import seaborn as sns import matplotlib.pyplot as plt # Get the BigMac index in 2017 current_bigmac = bigmac_df[(bigmac_df.Date == "2017-01-31")] # Plot the histogram ax = sns.distplot(current_bigmac.dollar_price) plt.show() The seaborn.distplot function expects either pandas Series, single-dimensional numpy.array, or a Python list as input. Then, it determines the size of the bins according to the Freedman-Diaconis rule, and finally it fits a kernel density estimate (KDE) over the histogram. KDE is a non-parametric method used to estimate the distribution of a variable. We can also supply a parametric distribution, such as beta, gamma, or normal distribution, to the fit argument. In this example, we are going to fit the normal distribution from the scipy.stats package over the Big Mac Index dataset: from scipy import stats ax = sns.distplot(current_bigmac.dollar_price, kde=False, fit=stats.norm) plt.show() [INSERT IMAGE] You have now equipped yourself with the knowledge to visualize univariate data in Seaborn as Bar Charts, Histogram, and distribution fitting. To have more fun visualizing data with Seaborn and Matplotlib, check out the book,  this snippet appears from.
Read more
  • 0
  • 0
  • 7120

article-image-kriging-interpolation-geostatistics
Guest Contributor
15 Nov 2017
7 min read
Save for later

Using R to implement Kriging - A Spatial Interpolation technique for Geostatistics data

Guest Contributor
15 Nov 2017
7 min read
The Kriging interpolation technique is being increasingly used in geostatistics these days. But how does Kriging work to create a prediction, after all? To start with, Kriging is a method where the distance and direction between the sample data points indicate a spatial correlation. This correlation is then used to explain the different variations in the surface. In cases where the distance and direction give appropriate spatial correlation, Kriging will be able to predict surface variations in the most effective way. As such, we often see Kriging being used in Geology and Soil Sciences. Kriging generates an optimal output surface for prediction which it estimates based on a scattered set with z-values. The procedure involves investigating the z-values’ spatial behavior in an ‘interactive’ manner where advanced statistical relationships are measured (autocorrelation). Mathematically speaking, Kriging is somewhat similar to regression analysis and its whole idea is to predict the unknown value of a function at a given point by calculating the weighted average of all known functional values in the neighborhood. To get the output value for a location, we take the weighted sum of already measured values in the surrounding (all the points that we intend to consider around a specific radius), using a  formula such as the following: In a regression equation, λi would represent the weights of how far the points are from the prediction location. However, in Kriging, λi represent not just the weights of how far the measured points are from prediction location, but also how the measured points are arranged spatially around the prediction location. First, the variograms and covariance functions are generated to create the spatial autocorrelation of data. Then, that data is used to make predictions. Thus, unlike the deterministic interpolation techniques like Inverse Distance Weighted (IDW) and Spline interpolation tools, Kriging goes beyond just estimating a prediction surface. Here, it brings an element of certainty in that prediction surface. That is why experts rate kriging so highly for a strong prediction. Instead of a weather report forecasting a 2 mm rain on a certain Saturday, Kriging also tells you what is the "probability" of a 2 mm rain on that Saturday. We hope you enjoy this simple R tutorial on Kriging by Berry Boessenkool. Geostatistics: Kriging - spatial interpolation between points, using semivariance We will be covering following sections in our tutorial with supported illustrations: Packages read shapefile Variogram Kriging Plotting Kriging: packages install.packages("rgeos") install.packages("sf") install.packages("geoR") library(sf) # for st_read (read shapefiles), # st_centroid, st_area, st_union library(geoR) # as.geodata, variog, variofit, # krige.control, krige.conv, legend.krige ## Warning: package ’sf’ was built under R version 3.4.1 Kriging: read shapefile / few points for demonstration x <- c(1,1,2,2,3,3,3,4,4,5,6,6,6) y <- c(4,7,3,6,2,4,6,2,6,5,1,5,7) z <- c(5,9,2,6,3,5,9,4,8,8,3,6,7) plot(x,y, pch="+", cex=z/4) Kriging: read shapefile II GEODATA <- as.geodata(cbind(x,y,z)) plot(GEODATA) Kriging: Variogram I EMP_VARIOGRAM <- variog(GEODATA) ## variog: computing omnidirectional variogram FIT_VARIOGRAM <- variofit(EMP_VARIOGRAM) ## variofit: covariance model used is matern ## variofit: weights used: npairs ## variofit: minimisation function used: optim ## Warning in variofit(EMP_VARIOGRAM): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "9.19" "3.65" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 401.578968904954 Kriging: Variogram II plot(EMP_VARIOGRAM) lines(FIT_VARIOGRAM) Kriging: Kriging res <- 0.1 grid <- expand.grid(seq(min(x),max(x),res), seq(min(y),max(y),res)) krico <- krige.control(type.krige="OK", obj.model=FIT_VARIOGRAM) krobj <- krige.conv(GEODATA, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # KRigingObjekt Kriging: Plotting I image(krobj, col=rainbow2(100)) legend.krige(col=rainbow2(100), x.leg=c(6.2,6.7), y.leg=c(2,6), vert=T, off=-0.5, values=krobj$predict) contour(krobj, add=T) colPoints(x,y,z, col=rainbow2(100), legend=F) points(x,y) Kriging: Plotting II library("berryFunctions") # scatterpoints by color colPoints(x,y,z, add=F, cex=2, legargs=list(y1=0.8,y2=1)) Kriging: Plotting III colPoints(grid[ ,1], grid[ ,2], krobj$predict, add=F, cex=2, col2=NA, legargs=list(y1=0.8,y2=1)) Time for a real dataset Precipitation from ca 250 gauges in Brandenburg  as Thiessen Polygons with steep gradients at edges: Exercise 41: Kriging Load and plot the shapefile in PrecBrandenburg.zip with sf::st_read. With colPoints in the package berryFunctions, add the precipitation  values at the centroids of the polygons. Calculate the variogram and fit a semivariance curve. Perform kriging on a grid with a useful resolution (keep in mind that computing time rises exponentially  with grid size). Plot the interpolated  values with image or an equivalent (Rclick 4.15) and add contour lines. What went wrong? (if you used the defaults, the result will be dissatisfying.) How can you fix it? Solution for exercise 41.1-2: Kriging Data # Shapefile: p <- sf::st_read("data/PrecBrandenburg/niederschlag.shp", quiet=TRUE) # Plot prep pcol <- colorRampPalette(c("red","yellow","blue"))(50) clss <- berryFunctions::classify(p$P1, breaks=50)$index # Plot par(mar = c(0,0,1.2,0)) plot(p, col=pcol[clss], max.plot=1) # P1: Precipitation # kriging coordinates cent <- sf::st_centroid(p) berryFunctions::colPoints(cent$x, cent$y, p$P1, add=T, cex=0.7, legargs=list(y1=0.8,y2=1), col=pcol) points(cent$x, cent$y, cex=0.7) Solution for exercise 41.3: Variogram library(geoR) # Semivariance: geoprec <- as.geodata(cbind(cent$x,cent$y,p$P1)) vario <- variog(geoprec, max.dist=130000) ## variog: computing omnidirectional variogram fit <-variofit(vario) ## Warning in variofit(vario): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "1326.72" "19999.93" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 107266266.76371 plot(vario) ; lines(fit) # distance to closest other point: d <- sapply(1:nrow(cent), function(i) min(berryFunctions::distance( cent$x[i], cent$y[i], cent$x[-i], cent$y[-i]))) hist(d/1000, breaks=20, main="distance to closest gauge [km]") mean(d) # 8 km ## [1] 8165.633 Solution for exercise 41.4-5: Kriging # Kriging: res <- 1000 # 1 km, since stations are 8 km apart on average grid <- expand.grid(seq(min(cent$x),max(cent$x),res), seq(min(cent$y),max(cent$y),res)) krico <- krige.control(type.krige="OK", obj.model=fit) krobj <- krige.conv(geoprec, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # Set values outside of Brandenburg to NA: grid_sf <- sf::st_as_sf(grid, coords=1:2, crs=sf::st_crs(p)) isinp <- sapply(sf::st_within(grid_sf, p), length) > 0 krobj2 <- krobj krobj2$predict[!isinp] <- NA Solution for exercise 41.5: Kriging Visualization geoR:::image.kriging(krobj2, col=pcol) colPoints(cent$x, cent$y, p$P1, col=pcol, zlab="Prec", cex=0.7, legargs=list(y1=0.1,y2=0.8, x1=0.78, x2=0.87, horiz=F)) plot(p, add=T, col=NA, border=8)#; points(cent$x,cent$y, cex=0.7) [author title="About the author"]Berry started working with R in 2010 during his studies of Geoecology at Potsdam University, Germany. He has since then given a number of R programming workshops and tutorials, including full-week workshops in Kyrgyzstan and Kazachstan. He has left the department for environmental science in summer 2017 to focus more on software development and teaching in the data science industry. Please follow the Github link for detailed explanations on Berry’s R courses. [/author]
Read more
  • 0
  • 0
  • 9244

article-image-k-means-clustering-python
Aaron Lazar
09 Nov 2017
9 min read
Save for later

Implementing K-Means Clustering in Python

Aaron Lazar
09 Nov 2017
9 min read
This article is an adaptation of content from the book Data Science Algorithms in a Week, by David Natingga. I’ve modified it a bit and made turned it into a sequence from a thriller, starring Agents Hobbs and O’Connor, from the FBI. The idea is to practically show you how to implement a k-means cluster in your friendly neighborhood language, Python. Agent Hobbs: Agent… Agent O’Connor… O’Connor! Agent O’Connor: Blimey! Uh.. Ohh.. Sorry, sir! Hobbs: ‘Tat’s abou’ the fifth time oive’ caught you sleeping on duty, young man! O’Connor: Umm. Apologies, sir. I just arrived here, and didn’t have much to… Hobbs: Cut the bull, agent! There’s an important case oime workin’ on and oi’ need information on this righ’ awai’! Here’s the list of missing persons kidnapped so far by the suspects. The suspects now taunt us with open clues abou’ their next target! Based on the information, we’ve narrowed their target list down to Miss. Gibbons and Mr. Hudson. Hobbs throws a file across O’Connor’s desk. Hobbs says as he storms out the door: You ‘ave an hour to find out who needs the special security, so better get working. O’Connor: Yes, sir! Bloody hell, that was close! Here’s the information O’Connor has: He needs to find what the probability is, that the 11th person with a height of 172cm, weight of 60kg, and with long hair is a man. O’Connor gets to work. To simplify matters, he removes the column Hair length as well as the column Gender, since he would like to cluster the people in the table based on their height and weight. To find out whether the 11th person in the table is more likely to be a man or a woman, he uses Clustering: Analysis O’Connor may apply scaling to the initial data, but to simplify the matters, he uses the unscaled data in the algorithm. He clusters the data into the two clusters since there are two possibilities for genders – a male or a female. Then he aims to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there are a large number of the features to classify. So he goes on to apply the k-means clustering algorithm to the data he has. First, he picks up the initial centroids. He assumes the first centroid be, for example, a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid. The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. He displays the current situation of these two clusters in the image as below. Clustering of people by their height and weight He then recomputes the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7 (75+71+83+63+70+59+60)/7)~(175.14,68.71). The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4, (46+53+52+55)/4) = (161.75, 51.5). Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image: Clustering of people by their height and weight Now he classifies the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, he classifies the person with the height 172cm and the weight 60kg as a male. Implementing K-Means clustering in Python O’Connor implements the k-means clustering algorithm in Python. It takes as an input a CSV file with one data item per line. A data item is converted to a point. The algorithm classifies these points into the specified number of clusters. In the end, the clusters are visualized on the graph using the matplotlib library: # source_code/5/k-means_clustering.py import math import imp import sys import matplotlib.pyplot as plt import matplotlib import sys sys.path.append('../common') import common # noqa matplotlib.style.use('ggplot') # Returns k initial centroids for the given points. def choose_init_centroids(points, k): centroids = [] centroids.append(points[0]) while len(centroids) < k: # Find the centroid that with the greatest possible distance # to the closest already chosen centroid. candidate = points[0] candidate_dist = min_dist(points[0], centroids) for point in points: dist = min_dist(point, centroids) if dist > candidate_dist: candidate = point candidate_dist = dist centroids.append(candidate) return centroids # Returns the distance of a point from the closest point in points. def min_dist(point, points): min_dist = euclidean_dist(point, points[0]) for point2 in points: dist = euclidean_dist(point, point2) if dist < min_dist: min_dist = dist return min_dist # Returns an Euclidean distance of two 2-dimensional points. def euclidean_dist((x1, y1), (x2, y2)): return math.sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2)) # PointGroup is a tuple that contains in the first coordinate a 2d point # and in the second coordinate a group which a point is classified to. def choose_centroids(point_groups, k): centroid_xs = [0] * k centroid_ys = [0] * k group_counts = [0] * k for ((x, y), group) in point_groups: centroid_xs[group] += x centroid_ys[group] += y group_counts[group] += 1 centroids = [] for group in range(0, k): centroids.append(( float(centroid_xs[group]) / group_counts[group], float(centroid_ys[group]) / group_counts[group])) return centroids # Returns the number of the centroid which is closest to the point. # This number of the centroid is the number of the group where # the point belongs to. def closest_group(point, centroids): selected_group = 0 selected_dist = euclidean_dist(point, centroids[0]) for i in range(1, len(centroids)): dist = euclidean_dist(point, centroids[i]) if dist < selected_dist: selected_group = i selected_dist = dist return selected_group # Reassigns the groups to the points according to which centroid # a point is closest to. def assign_groups(point_groups, centroids): new_point_groups = [] for (point, group) in point_groups: new_point_groups.append( (point, closest_group(point, centroids))) return new_point_groups # Returns a list of pointgroups given a list of points. def points_to_point_groups(points): point_groups = [] for point in points: point_groups.append((point, 0)) return point_groups # Clusters points into the k groups adding every stage # of the algorithm to the history which is returned. def cluster_with_history(points, k): history = [] centroids = choose_init_centroids(points, k) point_groups = points_to_point_groups(points) while True: point_groups = assign_groups(point_groups, centroids) history.append((point_groups, centroids)) new_centroids = choose_centroids(point_groups, k) done = True for i in range(0, len(centroids)): if centroids[i] != new_centroids[i]: done = False Break if done: return history centroids = new_centroids # Program start csv_file = sys.argv[1] k = int(sys.argv[2]) everything = False # The third argument sys.argv[3] represents the number of the step of the # algorithm starting from 0 to be shown or "last" for displaying the last # step and the number of the steps. if sys.argv[3] == "last": everything = True Else: step = int(sys.argv[3]) data = common.csv_file_to_list(csv_file) points = data_to_points(data) # Represent every data item by a point. history = cluster_with_history(points, k) if everything: print "The total number of steps:", len(history) print "The history of the algorithm:" (point_groups, centroids) = history[len(history) - 1] # Print all the history. print_cluster_history(history) # But display the situation graphically at the last step only. draw(point_groups, centroids) else: (point_groups, centroids) = history[step] print "Data for the step number", step, ":" print point_groups, centroids draw(point_groups, centroids) Input data from gender classification He saves data from the classification into the CSV file: # source_code/5/persons_by_height_and_weight.csv 180,75 174,71 184,83 168,63 178,70 170,59 164,53 155,46 162,52 166,55 172,60 Program output for the classification data O’Connor runs the program implementing k-means clustering algorithm on the data from the classification. The numerical argument 2 means that he would like to cluster the data into 2 clusters: $ python k-means_clustering.py persons_by_height_weight.csv 2 last The total number of steps: 2 The history of the algorithm: Step number 0: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0), ((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)] centroids = [(180.0, 75.0), (155.0, 46.0)] Step number 1: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0), ((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)] centroids = [(175.14285714285714, 68.71428571428571), (161.75, 51.5)] The program also outputs a graph visible in the 2nd image. The parameter last means that O’Connor would like the program to do the clustering until the last step. If he wants to display only the first step (step 0), he can change last to 0 to run: $ python k-means_clustering.py persons_by_height_weight.csv 2 0 Upon the execution of the program, O’Connor gets the graph of the clusters and their centroids at the initial step, as in image 1. He heaves a sigh of relief. Hobbs returns just then: Oye there O’Connor, not snoozing again now O’are ya? O’Connor: Not at all, sir. I think we need to provide Mr. Hudson with special protection because it looks like he’s the next target. Hobbs raises an eyebrow as he adjusts his gun in it’s holster: Emm, O’are ya sure, agent? O’Connor replies with a smile: 83.33% confident, sir! Hobbs: Wha’ are we waiting for then, eh? Let’s go get em! If you liked reading this mystery, go ahead and buy the book it was inspired by: Data Science Algorithms in a Week, by David Natingga.
Read more
  • 0
  • 1
  • 10938
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-integrating-keras-tensorflow-r
Amey Varangaonkar
08 Nov 2017
6 min read
Save for later

How to Integrate Keras and TensorFlow with R

Amey Varangaonkar
08 Nov 2017
6 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this post, we see how to integrate popular deep learning libraries and frameworks like TensorFlow with R for effective neural network modeling.[/box] TensorFlow is an open source numerical computing library provided by Google for machine intelligence. It hides all of the programming required to build deep learning models and gives the developers a black box interface to program. The Keras API for TensorFlow provides a high-level interface for neural networks. Python is the de facto programming language for deep learning, but R is catching up. Deep learning libraries are now available with R and a developer can easily download TensorFlow or Keras similar to other R libraries and use them. In TensorFlow, nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. TensorFlow was originally developed by the Google Brain Team within Google's machine intelligence research for machine learning and deep neural networks research, but it is now available in the public domain. TensorFlow exploits GPU processing when configured appropriately. The generic use cases for TensorFlow are as follows: Image recognition Computer vision Voice/sound recognition Time series analysis Language detection Language translation Text-based processing Handwriting Recognition (HWR) Many others Integrating Tensorflow with R In this section, we will see how we can bring TensorFlow libraries into R. This will open up a huge number of possibilities with deep learning using TensorFlow with R. In order to use TensorFlow, we must first install Python. If you don't have a Python installation on your machine, it's time to get it. Python is a dynamic Object-Oriented Programming (OOP) language that can be used for many types of software development. It offers strong support for integration with other languages and programs, is provided with a large standard library, and can be learned within a few days. Many Python programmers can confirm a substantial increase in productivity and feel that it encourages the development of higher quality code and maintainability. Python runs on Windows, Linux/Unix, macOS X, OS/2, Amiga, Palm Handhelds, and Nokia phones. It also works on Java and .NET virtual machines. Python is licensed under the OSI-approved open source license; its use is free, including for commercial products. [box type="shadow" align="" class="" width=""]If you do not know which version to use, there is a document that could help you choose. In principle, if you have to start from scratch, we recommend choosing Python 3, and if you need to use third-party software packages that may not be compatible with Python 3, we recommend using Python 2.7. All information about the available versions and how to install Python is given at https://www.python.org/[/box] After properly installing the Python version of our machine, we have to worry about installing TensorFlow. We can retrieve all library information and available versions of the operating system from the following link: https://www.tensorflow.org/ . Also, in the install section, we can find a series of guides that explain how to install a version of TensorFlow that allows us to write applications in Python. Guides are available for the following operating systems: Installing TensorFlow on Ubuntu Installing TensorFlow on macOS X Installing TensorFlow on Windows Installing TensorFlow from sources For example, to install Tensorflow on Windows, we must choose one of the following types: TensorFlow with CPU support only TensorFlow with GPU support To install TensorFlow, start a terminal with privileges as administrator. Then issue the appropriate pip3 install command in that terminal. To install the CPU-only version, enter the following command: C:> pip3 install --upgrade tensorflow A series of code lines will be displayed on the video to keep us informed of the execution of the installation procedure, as shown in the following figure: Note: The output screenshot has been cropped for clarity purposes At this point, we can return to our favorite environment; I am referring to the R development environment. We will need to install the interface to TensorFlow. The R interface to TensorFlow lets you work productively using the high-level Keras and Estimator APIs, and when you need more control, it provides full access to the core TensorFlow API. To install the R interface to TensorFlow, follow the steps below First, install the tensorflow R package from CRAN as follows: install.packages("tensorflow")  2. Then, use the install_tensorflow() function to install TensorFlow (for a proper installation procedure, you must have administrator privileges): library(tensorflow) install_tensorflow()  3. We can confirm that the installation succeeded: sess = tf$Session() hello <- tf$constant('Hello, TensorFlow!') sess$run(hello) This will provide you with a default installation of TensorFlow suitable for use with the tensorflow R package. Read on if you want to learn about additional installation options.  4. If you want to install a version of TensorFlow that takes advantage of NVIDIA GPUs, you need to have the correct CUDA libraries installed. In the following code, we can check the success of the installation: library(tensorflow) > sess = tf$Session() > hello <- tf$constant('Hello, TensorFlow!') > sess$run(hello) b'Hello, TensorFlow!' Integrating Keras with R Keras is a set of open source neural network libraries coded in Python. It is capable of running on top of MxNet, TensorFlow, or Theano. The steps to install Keras in RStudio are very simple. The following code snippet gives the steps for installation and we can check whether Keras is working by checking the load of the MNIST dataset. By default, RStudio loads the CPU version of TensorFlow. Once Keras is loaded, we have a powerful set of deep learning libraries that can be utilized by R programmers to execute neural networks and deep learning. To install Keras for R, follow these steps: Run the following code install.packages("devtools") devtools::install_github("rstudio/keras")  2. At this point, we load the keras library: library(keras)  3. Finally, we check whether keras is installed correctly by loading the MNIST dataset: > data=dataset_mnist()   If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.
Read more
  • 0
  • 0
  • 4038

article-image-ensemble-methods-optimize-machine-learning-models
Guest Contributor
07 Nov 2017
8 min read
Save for later

Ensemble Methods to Optimize Machine Learning Models

Guest Contributor
07 Nov 2017
8 min read
[box type="info" align="" class="" width=""]We are happy to bring you an elegant guest post on ensemble methods by Benjamin Rogojan, popularly known as The Seattle Data Guy.[/box] How do data scientists improve their algorithm’s accuracy or improve the robustness of a model? A method that is tried and tested is ensemble learning. It is a must know topic if you claim to be a data scientist and/or a machine learning engineer. Especially, if you are planning to go in for a data science/machine learning interview. Essentially, ensemble learning stays true to the meaning of the word ‘ensemble’. Rather than having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification. Another way to think about ensemble learning is the fable of the blind men and the elephant. Each blind man in the story seeks to identify the elephant in front of them. However, they work separately and come up with their own conclusions about the animal. Had they worked in unison, they might have been able to eventually figure out what they were looking at. Similarly, ensemble learning utilizes the workings of different algorithms and combines them for a successful and optimal classification. Ensemble methods such as Boosting and Bagging have led to an increased robustness of statistical models with decreased variance. Before we begin with explaining the various ensemble methods, let us have a glance at the common bond between them, Bootstrapping. Bootstrap: The common glue Explaining Bootstrapping can occasionally be missed by many data scientists. However, an understanding of bootstrapping is essential as both the ensemble methods, Boosting and Bagging, are based on the concept of bootstrapping. Figure 1: Bootstrapping In machine learning terms, bootstrap method refers to random sampling with replacement. This sample, after replacement, is referred as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics which the sample might have contained. This would, in turn, affect the overall mean, standard deviation, and other descriptive metrics of a data set. Ultimately, leading to the development of more robust models. The above diagram depicts each sample population having different and non-identical pieces. Bootstrapping is also great for small size data sets that may have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping are more robust and can handle new datasets depending on the methodology chosen (boosting or bagging). The bootstrap method can also test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness.  In certain cases, one sample data set may have a larger mean than another or a different standard deviation. This might break a model that was overfitted and not tested using data sets with different variations. One of the many reasons bootstrapping has become so common is because of the increase in computing power. This allows multiple permutations to be done with different resamples. Let us now move on to the most prominent ensemble methods: Bagging and Boosting. Ensemble Method 1: Bagging Bagging actually refers to Bootstrap Aggregators. Most papers or posts that explain bagging algorithms are bound to refer to Leo Breiman’s work, a paper published in 1996 called  “Bagging Predictors”. In the paper, Leo describes bagging as: “Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.” Bagging helps reduce variance from models that are accurate only on the data they were trained on. This problem is also known as overfitting. Overfitting happens when a function fits the data too well. Typically this is because the actual equation is highly complicated to take into account each data point and the outlier. Figure 2: Overfitting Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in the spread of underlying features compared to the previous set, the model will fail to be as accurate as before. This is because the data will not fit the model well. Bagging gets around the overfitting problem by creating its own variance amongst the data. This is done by sampling and replacing data while it tests multiple hypotheses (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc). Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” of the “Bootstrap Aggregating” comes into play. As in the figure shown below, each hypothesis has the same weight as all the others. (When we later discuss boosting, this is one of the places the two methodologies differ.) Figure 3: Bagging Essentially, all these models run at the same time and vote on the hypothesis which is the most accurate. This helps to decrease variance i.e. reduce the overfit. Ensemble Method 2: Boosting Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging (that has each model run independently and then aggregate the outputs at the end without preference to any model), boosting is all about “teamwork”. Each model that runs dictates what features the next model will focus on. Boosting also requires bootstrapping. However, there is another difference here. Unlike bagging, boosting weights each sample of data. This means some samples will be run more often than others. Why put weights on the samples of data? Figure 4: Boosting When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. This is because such data sets are considered to have more complexity. Thus, more iterations would be required to properly train the model. During the actual classification stage, boosting tracks the model's error rates to ensure that better models are given better weights. That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output. Which of these ensemble methods is right for me? Ensemble methods generally out-perform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Another important ensemble methodology, not discussed here, is stacking. Boosting and bagging are both great techniques to decrease variance. However, they won’t fix every problem, and they themselves have their own issues. There are different reasons why you would use one over the other. Bagging is great for decreasing variance when a model is overfitted. However, boosting is likely to be a better pick of the two methods. This is because it is also great for decreasing bias in an underfit model. On the other hand, boosting is likely to suffer performance issues. This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, a decision tree that sets specific leafs shouldn’t be implemented if it can’t be supported with other data points and visuals. It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results an algorithm is getting and the support provided. [author title="About the Author"] Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/author]
Read more
  • 0
  • 0
  • 6808

article-image-machine-learning-algorithms-naive-bayes-with-spark-mllib
Wilson D'souza
07 Nov 2017
7 min read
Save for later

Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

Wilson D'souza
07 Nov 2017
7 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook, we look at how to implement Naïve Bayes classification algorithm with Spark 2.0 MLlib. The associated code and exercise are available at the end of the article.[/box] How to implement Naive Bayes with Spark MLlib Naïve Bayes is one of the most widely used classification algorithms which can be trained and optimized quite efficiently. Spark’s machine learning library, MLlib, primarily focuses on simplifying machine learning and has great support for multinomial naïve Bayes and Bernoulli naïve Bayes. Here we use the famous Iris dataset and use Apache Spark API NaiveBayes() to classify/predict which of the three classes of flower a given set of observations belongs to. This is an example of a multi-class classifier and requires multi-class metrics for measurements of fit. Let’s have a look at the steps to achieve this: For the Naive Bayes exercise, we use a famous dataset called iris.data, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups. In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolour, Iris Virginica).We can download the data from here: https://archive.ics.uci.edu/ml/datasets/Iris/  The column definition is as follows: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm  Class: -- Iris Setosa => Replace it with 0 -- Iris Versicolour => Replace it with 1 -- Iris Virginica => Replace it with 2 The steps/actions we need to perform on the data are as follows: Download and then replace column five (that is, the label or classification classes) with a numerical value, thus producing the iris.data.prepared data file. The Naïve Bayes call requires numerical labels and not text, which is very common with most tools. Remove the extra lines at the end of the file. Remove duplicates within the program by using the distinct() call. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter6 Import the necessary packages for SparkSession to gain access to the cluster and Log4j.Logger to reduce the amount of output produced by Spark:  import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics, MultilabelMetrics, binary} import org.apache.spark.sql.{SQLContext, SparkSession} import org.apache.log4j.Logger import org.apache.log4j.Level Initialize a SparkSession specifying configurations with the builder pattern, thus making an entry point available for the Spark cluster: val spark = SparkSession .builder .master("local[4]") .appName("myNaiveBayes08") .config("spark.sql.warehouse.dir", ".") .getOrCreate() val data = sc.textFile("../data/sparkml2/chapter6/iris.data.prepared.txt") Parse the data using map() and then build a LabeledPoint data structure. In this case, the last column is the Label and the first four columns are the features. Again, we replace the text in the last column (that is, the class of Iris) with numeric values (that is, 0, 1, 2) accordingly: val NaiveBayesDataSet = data.map { line => val columns = line.split(',') LabeledPoint(columns(4).toDouble , Vectors.dense(columns(0).toDouble,columns(1).toDouble,columns(2).to Double,columns(3).toDouble )) } Then make sure that the file does not contain any redundant rows. In this case, it has three redundant rows. We will use the distinct dataset going forward: println(" Total number of data vectors =", NaiveBayesDataSet.count()) val distinctNaiveBayesData = NaiveBayesDataSet.distinct() println("Distinct number of data vectors = ", distinctNaiveBayesData.count()) Output: (Total number of data vectors =,150) (Distinct number of data vectors = ,147) We inspect the data by examining the output: distinctNaiveBayesData.collect().take(10).foreach(println(_)) Output: (2.0,[6.3,2.9,5.6,1.8]) (2.0,[7.6,3.0,6.6,2.1]) (1.0,[4.9,2.4,3.3,1.0]) (0.0,[5.1,3.7,1.5,0.4]) (0.0,[5.5,3.5,1.3,0.2]) (0.0,[4.8,3.1,1.6,0.2]) (0.0,[5.0,3.6,1.4,0.2]) (2.0,[7.2,3.6,6.1,2.5]) .............. ................ ............. Split the data into training and test sets using a 30% and 70% ratio. The 13L in this case is simply a seeding number (L stands for long data type) to make sure the result does not change from run to run when using a randomSplit() method: val allDistinctData = distinctNaiveBayesData.randomSplit(Array(.30,.70),13L) val trainingDataSet = allDistinctData(0) val testingDataSet = allDistinctData(1) Print the count for each set: println("number of training data =",trainingDataSet.count()) println("number of test data =",testingDataSet.count()) Output: (number of training data =,44) (number of test data =,103) Build the model using train() and the training dataset: val myNaiveBayesModel = NaiveBayes.train(trainingDataSet Use the training dataset plus the map() and predict() methods to classify the flowers based on their features: val predictedClassification = testingDataSet.map( x => (myNaiveBayesModel.predict(x.features), x.label)) Examine the predictions via the output: predictedClassification.collect().foreach(println(_)) (2.0,2.0) (1.0,1.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (2.0,2.0) ....... ....... ....... Use MulticlassMetrics() to create metrics for the multi-class classifier. As a reminder, this is different from the previous recipe, in which we used BinaryClassificationMetrics(): val metrics = new MulticlassMetrics(predictedClassification) Use the commonly used confusion matrix to evaluate the model: val confusionMatrix = metrics.confusionMatrix println("Confusion Matrix= n",confusionMatrix) Output: (Confusion Matrix= ,35.0 0.0 0.0 0.0 34.0 0.0 0.0 14.0 20.0 ) We examine other properties to evaluate the model: val myModelStat=Seq(metrics.precision,metrics.fMeasure,metrics.recall) myModelStat.foreach(println(_)) Output: 0.8640776699029126 0.8640776699029126 0.8640776699029126 How it works... We used the IRIS dataset for this recipe, but we prepared the data ahead of time and then selected the distinct number of rows by using the NaiveBayesDataSet.distinct() API. We then proceeded to train the model using the NaiveBayes.train() API. In the last step, we predicted using .predict() and then evaluated the model performance via MulticlassMetrics() by outputting the confusion matrix, precision, and F-Measure metrics. The idea here was to classify the observations based on a selected feature set (that is, feature engineering) into classes that correspond to the left-hand label. The difference here was that we are applying joint probability given conditional probability to the classification. This concept is known as Bayes' theorem, which was originally proposed by Thomas Bayes in the 18th century. There is a strong assumption of independence that must hold true for the underlying features to make Bayes' classifier work properly. At a high level, the way we achieved this method of classification was to simply apply Bayes' rule to our dataset. As a refresher from basic statistics, Bayes' rule can be written as follows: The formula states that the probability of A given B is true is equal to probability of B given A is true times probability of A being true divided by probability of B being true. It is a complicated sentence, but if we step back and think about it, it will make sense. The Bayes' classifier is a simple yet powerful one that allows the user to take the entire probability feature space into consideration. To appreciate its simplicity, one must remember that probability and frequency are two sides of the same coin. The Bayes' classifier belongs to the incremental learner class in which it updates itself upon encountering a new sample. This allows the model to update itself on-the-fly as the new observation arrives rather than only operating in batch mode. We evaluated a model with different metrics. Since this is a multi-class classifier, we have to use MulticlassMetrics() to examine model accuracy. [box type="download" align="" class="" width=""]Download exercise and code files here. Exercise Files_Implementing Naive Bayes algorithm with Spark MLlib[/box] For more information on Multiclass Metrics, please see the following link: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib .evaluation.MulticlassMetrics Documentation for constructor can be found here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes If you enjoyed this article, you should have a look at Apache Spark 2.0 Machine Learning Cookbook which contains this excerpt.
Read more
  • 0
  • 0
  • 7642

article-image-object-detection-go-tensorflow
Kunal Parikh
07 Nov 2017
10 min read
Save for later

Implementing Object detection with Go using TensorFlow

Kunal Parikh
07 Nov 2017
10 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Machine Learning with Go, Chapter 8, Neural Networks and Deep Learning, written by Daniel Whitenack. The associated code bundle is available at the end of the article.[/box] Deep learning models are powerful! Especially for tasks like computer vision. However, you should also keep in mind that complicated combinations of these neural net components are also extremely hard to interpret. That is, determining why the model made a certain prediction can be near impossible. This can be a problem when you need to maintain compliance in certain industries and jurisdictions, and it also might inhibit debugging or maintenance of your applications. That being said, there are some major efforts to improve the interpretability of deep learning models. Notable among these efforts is the LIME project: Deep learning with Go There are a variety of options when you are looking to build or utilize deep learning models from Go. This, as with deep learning itself, is an ever-changing landscape. However, the options for building, training and utilizing deep learning models in Go are generally as follows: Use a Go package: There are Go packages that allow you to use Go as your main interface to build and train deep learning models. The most features and developed of these packages is Gorgonia. It treats Go as a first-class citizen and is written in Go, even if it does make significant usage of cgo to interface with numerical libraries. Use an API or Go client for a non-Go DL framework: You can interface with popular deep learning services and frameworks from Go including TensorFlow, MachineBox, H2O, and the various cloud providers or third-party API offerings (such as IBM Watson). TensorFlow and Machine Box actually have Go bindings or SDKs, which are continually improving. For the other services, you may need to interact via REST or even call binaries using exec. Use cgo: Of course, Go can talk to and integrate with C/C++ libraries for deep learning, including the TensorFlow libraries and various libraries from Intel. However, this is a difficult road, and it is only recommended when absolutely necessary. As TensorFlow is by far the most popular framework for deep learning (at the moment), we will briefly explore the second category listed here. However, the Tensorflow Go bindings are under active development and some functionality is quite crude at the moment. The TensorFlow team recommends that if you are going to use a TensorFlow model in Go, you first train and export this model using Python. That pre-trained model can then be utilized from Go, as we will demonstrate in the next section. There are a number of members of the community working very hard to make Go more of a first-class citizen for TensorFlow. As such, it is likely that the rough edges of the TensorFlow bindings will be smoothed over the coming year. Setting up TensorFlow for use with Go The TensorFlow team has provided some good docs to install TensorFlow and get it ready for usage with Go. These docs can be found here. There are a couple of preliminary steps, but once you have the TensorFlow C libraries installed, you can get the following Go package: $ go get github.com/tensorflow/tensorflow/tensorflow/go Everything should be good to go if you were able to get github.com/tensorflow/tensorflow/tensorflow/go without error, but you can make sure that you are ready to use TensorFlow by executing the following tests: $ go test github.com/tensorflow/tensorflow/tensorflow/go ok github.com/tensorflow/tensorflow/tensorflow/go 0.045s Retrieving and calling a pretrained TensorFlow model The model that we are going to use is a Google model for object recognition in images called Inception. The model can be retrieved as follows: $ mkdir model $ cd model $ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip --2017-09-09 18:29:03-- https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.6.112, 2607:f8b0:4009:812::2010 Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.6.112|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 49937555 (48M) [application/zip] Saving to: ‘inception5h.zip’ inception5h.zip 100%[====================================================================== ===================================================>] 47.62M 19.0MB/s in 2.5s 2017-09-09 18:29:06 (19.0 MB/s) - ‘inception5h.zip’ saved [49937555/49937555] $ unzip inception5h.zip Archive: inception5h.zip inflating: imagenet_comp_graph_label_strings.txt inflating: tensorflow_inception_graph.pb inflating: LICENSE After unzipping the compressed model, you should see a *.pb file. This is a protobuf file that represents a frozen state of the model. Think back to our simple neural network. The network was fully defined by a series of weights and biases. Although more complicated, this model can be defined in a similar way and these definitions are stored in this protobuf file. To call this model, we will use some example code from the TensorFlow Go bindings docs--. This code loads the model and uses the model to detect and label the contents of a *.jpg image. As the code is included in the TensorFlow docs, I will spare the details and just highlight a couple of snippets. To load the model, we perform the following: // Load the serialized GraphDef from a file. modelfile, labelsfile, err := modelFiles(*modeldir) if err != nil { log.Fatal(err) } model, err := ioutil.ReadFile(modelfile) if err != nil { log.Fatal(err) } Then we load the graph definition of the deep learning model and create a new TensorFlow session with the graph, as shown in the following code: // Construct an in-memory graph from the serialized form. graph := tf.NewGraph() if err := graph.Import(model, ""); err != nil { log.Fatal(err) } // Create a session for inference over graph. session, err := tf.NewSession(graph, nil) if err != nil { log.Fatal(err) } defer session.Close() Finally, we can make an inference using the model as follows: // Run inference on *imageFile. // For multiple images, session.Run() can be called in a loop (and concurrently). Alternatively, images can be batched since the model // accepts batches of image data as input. tensor, err := makeTensorFromImage(*imagefile) if err != nil { log.Fatal(err) } output, err := session.Run( map[tf.Output]*tf.Tensor{ graph.Operation("input").Output(0): tensor, }, []tf.Output{ graph.Operation("output").Output(0), }, nil) if err != nil { log.Fatal(err) } // output[0].Value() is a vector containing probabilities of // labels for each image in the "batch". The batch size was 1. // Find the most probable label index. probabilities := output[0].Value().([][]float32)[0] printBestLabel(probabilities, labelsfile) Object detection with Go using TensorFlow The Go program for object detection, as specified in the TensorFlow GoDocs, can be called as follows: $ ./myprogram -dir=<path/to/the/model/dir> -image=<path/to/a/jpg/image> When the program is called, it will utilize the pretrained and loaded model to infer the contents of the specified image. It will then output the most likely contents of that image along with its calculated probability. To illustrate this, let's try performing the object detection on the following image of an airplane, saved as airplane.jpg: Running the TensorFlow model from Go gives the following results: $ go build $ ./myprogram -dir=model -image=airplane.jpg 2017-09-09 20:17:30.655757: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655807: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655814: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655818: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (86% likely) airliner After some suggestions about speeding up CPU computations, we get a result: airliner. Wow! That's pretty cool. We just performed object recognition with TensorFlow right from our Go program! Let try another one. This time, we will use pug.jpg, which looks like the following: Running our program again with this image gives the following: $ ./myprogram -dir=model -image=pug.jpg 2017-09-09 20:20:32.323855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323902: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323906: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323911: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (84% likely) pug Success! Not only did the model detect that there was a dog in the picture, it correctly identified that there was a pug dog in the picture. Let try just one more. As this is a Go article, we cannot resist trying gopher.jpg, which looks like the following (huge thanks to Renee French, the artist behind the Go gopher): Running the model gives the following result: $ ./myprogram -dir=model -image=gopher.jpg 2017-09-09 20:25:57.967753: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967801: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967808: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967812: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967817: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (12% likely) safety pin Well, I guess we can't win them all. Looks like we need to refactor our model to be able to recognize Go gophers. More specifically, we should probably add a bunch of Go gophers to our training dataset, because a Go gopher is definitely not a safety pin! [box type="download" align="" class="" width=""]The code for this exercise is available here.[/box] Summary Congratulations! We have gone from parsing data with Go to calling deep learning models from Go. You now know the basics of neural networks and can implement them and utilize them in your Go programs. In the next chapter, we will discuss how to get these models and applications off of your laptops and run them at production scale in data pipelines. If you enjoyed the above excerpt from the book Machine Learning with Go, check out the book to learn how to build machine learning apps with Go.
Read more
  • 0
  • 0
  • 8750
article-image-pattern-mining-using-spark-mllib-part-2
Aarthi Kumaraswamy
06 Nov 2017
15 min read
Save for later

Pattern mining using Spark MLlib - Part 2

Aarthi Kumaraswamy
06 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] In part 1 of the tutorial, we motivated and introduced three pattern mining problems along with the necessary notation to properly talk about them. In part 2, we will now discuss how each of these problems can be solved with an algorithm available in Spark MLlib. As is often the case, actually applying the algorithms themselves is fairly simple due to Spark MLlib's convenient run method available for most algorithms. What is more challenging is to understand the algorithms and the intricacies that come with them. To this end, we will explain the three pattern mining algorithms one by one, and study how they are implemented and how to use them on toy examples. Only after having done all this will we apply these algorithms to a real-life data set of click events retrieved from http:/ / MSNBC. com. The documentation for the pattern mining algorithms in Spark can be found at https:/ / spark. apache. org/ docs/ 2. 1. 0/ mllib- frequent- pattern- mining. html. It provides a good entry point with examples for users who want to dive right in. Frequent pattern mining with FP-growth When we introduced the frequent pattern mining problem, we also quickly discussed a strategy to address it based on the apriori principle. The approach was based on scanning the whole transaction database again and again to expensively generate pattern candidates of growing length and checking their support. We indicated that this strategy may not be feasible for very large data. The so-called FP-growth algorithm, where FP stands for frequent pattern, provides an interesting solution to this data mining problem. The algorithm was originally described in Mining Frequent Patterns without Candidate Generation, available at https:/ /www. cs. sfu. ca/~jpei/ publications/sigmod00. pdf. We will start by explaining the basics of this algorithm and then move on to discussing its distributed version, parallel FP-growth, which has been introduced in PFP: Parallel FP-Growth for Query Recommendation, found at https:/ /static.googleusercontent.com/ media/research. google. com/en/ /pubs/ archive/ 34668. pdf. While Spark's implementation is based on the latter paper, it is best to first understand the baseline algorithm and extend from there. The core idea of FP-growth is to scan the transaction database D of interest precisely once in the beginning, find all the frequent patterns of length 1, and build a special tree structure called FP-tree from these patterns. Once this step is done, instead of working with D, we only do recursive computations on the usually much smaller FP-tree. This step is called the FP-growth step of the algorithm, since it recursively constructs trees from the subtrees of the original tree to identify patterns. We will call this procedure fragment pattern growth, which does not require us to generate candidates but is rather built on a divide-and-conquer strategy that heavily reduces the workload in each recursion step. To be more precise, let's first define what an FP-tree is and what it looks like in an example. Recall the example database we used in the last section, shown in Table 1. Our item set consisted of the following 15 grocery items, represented by their first letter: b, c, a, e, d, f, p, m, i, l, o, h, j, k, s. We also discussed the frequent items; that is, patterns of length 1, for a minimum support threshold of t = 0.6, were given by {f, c, b, a, m, p}. In FP-growth, we first use the fact that the ordering of items does not matter for the frequent pattern mining problem; that is, we can choose the order in which to present the frequent items. We do so by ordering them by decreasing frequency. To summarize the situation, let's have a look at the following table: Transaction ID Transaction Ordered frequent items 1 a, c, d, f, g, i, m, p f, c, a, m, p 2 a, b, c, f, l, m, o f, c, a, b, m 3 b, f, h, j, o f, b 4 b, c, k, s, p c, b, p 5 a, c, e, f, l, m, n, p f, c, a, m, p Table 3: Continuation of the example started with Table 1, augmenting the table by ordered frequent items. As we can see, ordering frequent items like this already helps us to identify some structure. For instance, we see that the item set {f, c, a, m, p} occurs twice and is slightly altered once as {f, c, a, b, m}. The key idea of FP-growth is to use this representation to build a tree from the ordered frequent items that reflect the structure and interdependencies of the items in the third column of Table 3. Every FP-tree has a so-called root node that is used as a base for connecting ordered frequent items as constructed. On the right of the following diagram, we see what is meant by this: Figure 1: FP-tree and header table for our frequent pattern mining running example. The left-hand side of Figure 1 shows a header table that we will explain and formalize in just a bit, while the right-hand side shows the actual FP-tree. For each of the ordered frequent items in our example, there is a directed path starting from the root, thereby representing it. Each node of the tree keeps track of not only the frequent item itself but also of the number of paths traversed through this node. For instance, four of the five ordered frequent item sets start with the letter f and one with c. Thus, in the FP-tree, we see f: 4 and c: 1 at the top level. Another interpretation of this fact is that f is a prefix for four item sets and c for one. For another example of this sort of reasoning, let's turn our attention to the lower left of the tree, that is, to the leaf node p: 2. Two occurrences of p tells us that precisely two identical paths end here, which we already know: {f, c, a, m, p} is represented twice. This observation is interesting, as it already hints at a technique used in FP-growth--starting at the leaf nodes of the tree, or the suffixes of the item sets, we can trace back each frequent item set, and the union of all these distinct root node paths yields all the paths--an important idea for parallelization. The header table you see on the left of Figure 1 is a smart way of storing items. Note that by the construction of the tree, a node is not the same as a frequent item but, rather, items can and usually do occur multiple times, namely once for each distinct path they are part of. To keep track of items and how they relate, the header table is essentially a linked list of items, that is, each item occurrence is linked to the next by means of this table. We indicated the links for each frequent item by horizontal dashed lines in Figure 1 for illustration purposes. With this example in mind, let's now give a formal definition of an FP-tree. An FP-tree T is a tree that consists of a root node together with frequent item prefix subtrees starting at the root and a frequent item header table. Each node of the tree consists of a triple, namely the item name, its occurrence count, and a node link referring to the next node of the same name, or null if there is no such next node. To quickly recap, to build T, we start by computing the frequent items for the given minimum support threshold t, and then, starting from the root, insert each path represented by the sorted frequent pattern list of a transaction into the tree. Now, what do we gain from this? The most important property to consider is that all the information needed to solve the frequent pattern mining problem is encoded in the FP-tree T because we effectively encode all co-occurrences of frequent items with repetition. Since T can also have at most as many nodes as the occurrences of frequent items, T is usually much smaller than our original database D. This means that we have mapped the mining problem to a problem on a smaller data set, which in itself reduces the computational complexity compared with the naive approach sketched earlier. Next, we'll discuss how to grow patterns recursively from fragments obtained from the constructed FP tree. To do so, let's make the following observation. For any given frequent item x, we can obtain all the patterns involving x by following the node links for x, starting from the header table entry for x, by analyzing at the respective subtrees. To explain how exactly, we further study our example and, starting at the bottom of the header table, analyze patterns containing p. From our FP-tree T, it is clear that p occurs in two paths: (f:4, c:3, a:3, m:3, p:2) and (c:1, b:1, p:1), following the node links for p. Now, in the first path, p occurs only twice, that is, there can be at most two total occurrences of the pattern {f, c, a, m, p} in the original database D. So, conditional on p being present, the paths involving p actually read as follows: (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1). In fact, since we know we want to analyze patterns, given p, we can shorten the notation a little and simply write (f:2, c:2, a:2, m:2) and (c:1, b:1). This is what we call the conditional pattern base for p. Going one step further, we can construct a new FP-tree from this conditional database. Conditioning on three occurrences of p, this new tree does only consist of a single node, namely (c:3). This means that we end up with {c, p} as a single pattern involving p, apart from p itself. To have a better means of talking about this situation, we introduce the following notation: the conditional FP-tree for p is denoted by {(c:3)}|p. To gain more intuition, let's consider one more frequent item and discuss its conditional pattern base. Continuing bottom to top and analyzing m, we again see two paths that are relevant: (f:4, c:3, a:3, m:2) and (f:4, c:3, a:3, b:1, m:1). Note that in the first path, we discard the p:2 at the end, since we have already covered the case of p. Following the same logic of reducing all other counts to the count of the item in question and conditioning on m, we end up with the conditional pattern base {(f:2, c:2, a:2), (f:1, c:1, a:1, b:1)}. The conditional FP- tree in this situation is thus given by {f:3, c:3, a:3}|m. It is now easy to see that actually every possible combination of m with each of f, c, and a forms a frequent pattern. The full set of patterns, given m, is thus {m}, {am}, {cm}, {fm}, {cam], {fam}, {fcm}, and {fcam}. By now, it should become clear as to how to continue, and we will not carry out this exercise in full but rather summarize the outcome of it in the following table: Frequent pattern Conditional pattern base Conditional FP-tree p {(f:2, c:2, a:2, m:2), (c:1, b:1)} {(c:3)}|p m {(f :2, c:2, a:2), (f :1, c:1, a:1, b:1)} {f:3, c:3, a:3}|m b {(f :1, c:1, a:1), (f :1), (c:1)} null a {(f:3, c:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f null null Table 4: The complete list of conditional FP-trees and conditional pattern bases for our running example. As this derivation required a lot of attention to detail, let's take a step back and summarize the situation so far: Starting from the original FP-tree T, we iterated through all the items using node links. For each item x, we constructed its conditional pattern base and its conditional FP-tree. Doing so, we used the following two properties:We discarded all the items following x in each potential pattern, that is, we only kept the prefix of xWe modified the item counts in the conditional pattern base to match the count of x Modifying a path using the latter two properties, we called the transformed prefix path of x. To finally state the FP-growth step of the algorithm, we need two more fundamental observations that we have already implicitly used in the example. Firstly, the support of an item in a conditional pattern base is the same as that of its representation in the original database. Secondly, starting from a frequent pattern x in the original database and an arbitrary set of items y, we know that xy is a frequent pattern if and only if y is. These two facts can easily be derived in general, but should be clearly demonstrated in the preceding example. What this means is that we can completely focus on finding patterns in conditional pattern bases, as joining them with frequent patterns is again a pattern, and this way, we can find all the patterns. This mechanism of recursively growing patterns by computing conditional pattern bases is therefore called pattern growth, which is why FP-growth bears its name. With all this in mind, we can now summarize the FP-growth procedure in pseudocode, as follows: def fpGrowth(tree: FPTree, i: Item): if (tree consists of a single path P){ compute transformed prefix path P' of P return all combinations p in P' joined with i } else{ for each item in tree { newI = i joined with item construct conditional pattern base and conditional FP-tree newTree call fpGrowth(newTree, newI) } } With this procedure, we can summarize our description of the complete FP-growth algorithm as follows: Compute frequent items from D and compute the original FP-tree T from them (FP-tree computation). Run fpGrowth(T, null) (FP-growth computation). Having understood the base construction, we can now proceed to discuss a parallel extension of base FP-growth, that is, the basis of Spark's implementation. Parallel FP- growth, or PFP for short, is a natural evolution of FP-growth for parallel computing engines such as Spark. It addresses the following problems with the baseline algorithm: Distributed storage: For frequent pattern mining, our database D may not fit into memory, which can already render FP-growth in its original form unapplicable. Spark does help in this regard for obvious reasons. Distributed computing: With distributed storage in place, we will have to take care of parallelizing all the steps of the algorithm suitably as well and PFP does precisely this. Adequate support values: When dealing with finding frequent patterns, we usually do not want to set the minimum support threshold t too high so as to find interesting patterns in the long tail. However, a small t might prevent the FP-tree from fitting into memory for a sufficiently large D, which would force us to increase t. PFP successfully addresses this problem as well, as we will see. The basic outline of PFP, with Spark for implementation in mind, is as follows: Sharding: Instead of storing our database D on a single machine, we distribute it to multiple partitions. Regardless of the particular storage layer, using Spark we can, for instance, create an RDD to load D. Parallel frequent item count: The first step of computing frequent items of D can be naturally performed as a map-reduce operation on an RDD. Building groups of frequent items: The set of frequent items is divided into a number of groups, each with a unique group ID. Parallel FP-growth: The FP-growth step is split into two steps to leverage parallelism: Map phase: The output of a mapper is a pair comprising the group ID and the corresponding transaction. Reduce phase: Reducers collect data according to the group ID and carry out FP-growth on these group-dependent transactions. Aggregation: The final step in the algorithm is the aggregation of results over group IDs. In light of already having spent a lot of time with FP-growth on its own, instead of going into too many implementation details of PFP in Spark, let's instead see how to use the actual algorithm on the toy example that we have used throughout: import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val transactions: RDD[Array[String]] = sc.parallelize(Array( Array("a", "c", "d", "f", "g", "i", "m", "p"), Array("a", "b", "c", "f", "l", "m", "o"), Array("b", "f", "h", "j", "o"), Array("b", "c", "k", "s", "p"), Array("a", "c", "e", "f", "l", "m", "n", "p") )) val fpGrowth = new FPGrowth() .setMinSupport(0.6) .setNumPartitions(5) val model = fpGrowth.run(transactions) model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq) } The code is straightforward. We load the data into transactions and initialize Spark's FPGrowth implementation with a minimum support value of 0.6 and 5 partitions. This returns a model that we can run on the transactions constructed earlier. Doing so gives us access to the patterns or frequent item sets for the specified minimum support, by calling freqItemsets, which, printed in a formatted way, yields the following output of 18 patterns in total: [box type="info" align="" class="" width=""]Recall that we have defined transactions as sets, and we often call them item sets. This means that within such an item set, a particular item can only occur once, and FPGrowth depends on this. If we were to replace, for instance, the third transaction in the preceding example by Array("b", "b", "h", "j", "o"), calling run on these transactions would throw an error message. We will see later on how to deal with such situations.[/box] The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn how to fully implement and deploy pattern mining applications in Spark among other machine learning tasks using Spark, check out the book. 
Read more
  • 0
  • 0
  • 4869

article-image-pattern-mining-using-spark-part-1
Aarthi Kumaraswamy
03 Nov 2017
15 min read
Save for later

Pattern Mining using Spark MLlib - Part 1

Aarthi Kumaraswamy
03 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following two-part tutorial is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this tutorial is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides. In this tutorial, we will look into doing pattern mining in Spark. The tutorial is split up into two main sections. In the first, we will first introduce the three available pattern mining algorithms that Spark currently comes with and then apply them to an interesting dataset. In particular, you will learn the following from this two-part tutorial: The basic principles of frequent pattern mining. Useful and relevant data formats for applications. Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span. Frequent pattern mining When presented with a new data set, a natural sequence of questions is: What kind of data do we look at; that is, what structure does it have? Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data? How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it? On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data. To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item. Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three: Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often brought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system. Based on the previous question, are there any interesting implications or rules to observe in shopping behavior?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean. Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items. The reason we focus on these three questions, in particular, is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP- growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection. Pattern mining terminology We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions. To give a concrete example of this, consider the following situation. Assume that the full item set to shop from is given by I = {bread, cheese, ananas, eggs, donuts, fish, pork, milk, garlic, ice cream, lemon, oil, honey, jam, kale, salt}. Since we will look at a lot of item subsets, to make things more readable later on, we will simply abbreviate these items by their first letter, that is, we'll write I = {b, c, a, e, d, f, p, m, g, i, l, o, h, j, k, s}. Given these items, a small transaction database D could look as follows:   Transaction ID Transaction 1 a, c, d, f, g, i, m, p 2 a, b, c, f, l, m, o 3 b, f, h, j, o 4 b, c, k, s, p 5 a, c, e, f, l, m, n, p Table 1: A small shopping cart database with five transactions Frequent pattern mining problem Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D: supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D| We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items. Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance. Given a minimum support threshold, the problem of finding all the frequent patterns is called the frequent pattern mining problem and it is, in fact, the formalized version of the aforementioned first question. Continuing with our example, we have found all frequent patterns of length 1 for t = 0.6 already. How do we find longer patterns? On a theoretical level, given unlimited resources, this is not much of a problem, since all we need to do is count the occurrences of items. On a practical level, however, we need to be smart about how we do so to keep the computation efficient. Especially for databases large enough for Spark to come in handy, it can be very computationally intense to address the frequent pattern mining problem. One intuitive way to go about this is as follows: Find all the frequent patterns of length 1, which requires one full database scan. This is how we started with in our preceding example. For patterns of length 2, generate all the combinations of frequent 1-patterns, the so-called candidates, and test if they exceed the minimum support by doing another scan of D. Importantly, we do not have to consider the combinations of infrequent patterns, since patterns containing infrequent patterns can not become frequent. This rationale is called the apriori principle. For longer patterns, continue this procedure iteratively until there are no more patterns left to combine. This algorithm, using a generate-and-test approach to pattern mining and utilizing the apriori principle to bound combinations, is called the apriori algorithm. There are many variations of this baseline algorithm, all of which share similar drawbacks in terms of scalability. For instance, multiple full database scans are necessary to carry out the iterations, which might already be prohibitively expensive for huge datasets. On top of that, generating candidates themselves is already expensive, but computing their combinations might simply be infeasible. In the next section, we will see how a parallel version of an algorithm called FP-growth, available in Spark, can overcome most of the problems just discussed. The association rule mining problem To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows: conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1) This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity. In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case: {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p} {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p} {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m} {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p} {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m} From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront. [box type="info" align="" class="" width=""]At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http:/ / www. cse. msu. edu/ ~ptan/ papers/ IS. pdf.[/box] The sequential pattern mining problem Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows: s = <s1, s2, ..., sl> Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example:   Sequence ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Table 2: A small sequence database with four short sequences. In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u1, ..., un) a subsequence of s = (s1, ..., sl) and write u < s if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following: u1 < si1, ..., un < sin Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>. With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows: suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S| Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t. Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section. Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced. If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.
Read more
  • 0
  • 0
  • 4053

article-image-classification-decision-trees-apache-spark-mllib
Wilson D'souza
02 Nov 2017
9 min read
Save for later

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza
02 Nov 2017
9 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster:  Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file:  val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset:  val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count:  println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset:  val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset:  def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters:  val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session:  spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.
Read more
  • 0
  • 0
  • 8616
article-image-building-motion-charts-tableau
Ashwin Nair
31 Oct 2017
4 min read
Save for later

Building Motion Charts with Tableau

Ashwin Nair
31 Oct 2017
4 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Tableau 10 Bootcamp, Chapter 2, Interactivity – written by Joshua N. Milligan and Donabel Santos. It offers intensive training on Data Visualization and Dashboarding with Tableau 10. In this article, we will learn how to build motion charts with Tableau.[/box] Tableau is an amazing platform for achieving incredible data discovery, analysis, and Storytelling. It allows you to build fully interactive dashboards and stories with your visualizations and insights so that you can share the data story with others. Creating Motion Charts with Tableau Let`s learn how to build motion charts with Tableau. A motion chart, as its name suggests, is a chart that displays the entire trail of changes in data over time by showing movement using the X and Y-axes. It is very much similar to the doodles in our notebooks which seem to come to life after flipping through the pages. It is amazing to see the same kind of movement in action in Tableau using the Pagesshelf. It is work that feels like play. On the Pages shelf, when you drop a field, Tableau creates a sequence of pages that filters the view for each value in that field. Tableau's page control allows us to flip pages, enabling us to see our view come to life. With three predefined speed settings, we can control the speed of the flip. The three settings include one that relates to the slowest speed, the others to the fastest speed. We can also format the marks and show the marks or trails, or both, using page control. In our viz, we have used a circle for marking each year. The circle that moves to a new position each year represents the specific country's new population value. These circles are all connected by trail lines that enable us to simulate a moving time series graph by setting the  mark and trail histories both to show in page control: Let's create an animated motion chart showing the population change over the years for a selected few countries: Open the Motion Chart worksheet and connect to the CO2 (Worldbank) data Source: Open Dimensions and drag Year to the Columns shelf. Open Measures and drag CO2 Emission to the Rows shelf. Right-click on the CO2 Emission axis, and change the title to CO2 Emission (metric tons per capita): In the Marks card, click on the dropdown to change the mark from Automatic to Circle. Open Dimensions and drag Country Name to Color in the Marks card. Also, drag Country Name to the Filter shelf from Dimensions Under the General tab of the Filter window, while the Select from list radio button is selected, select None. Select the Custom value list radio button, still under the General tab, and add China, Trinidad and Tobago, and United States: Click OK when done. This should close the Filter window. Open Dimensions and drag Year to Pages for adding a page control to the view. Click on the Show history checkbox to select it. Click on the drop-down beside Show history and perform the following steps: Select All for Marks to show history for Select Both for Show Using the Year page control, click on the forward arrow to play. This shows the change in the population of the three selected countries over the years. [box type="info" align="" class="" width=""]Tip -  In case you ever want to loopback the animation, you can click on the dropdown on the top-right of your page control card, and select Loop Playback:[/box] Note that Tableau Server does not support the animation effect that you see when working on motion charts with Tableau Desktop. Tableau strives for zero footprints when serving the charts and dashboards on the server so that there is no additional download to enable the functionalities. So, the play control does not work the same. No need to fret though. You can click manually on the slider and have a similar effect.  If you liked the above excerpt from the book Tableau 10 Bootcamp, check out the book to learn more data visualization techniques.
Read more
  • 0
  • 0
  • 10353

article-image-halloween-costume-data-science-nerds
Packt Editorial Staff
31 Oct 2017
14 min read
Save for later

(13*3)+ Halloween costume ideas for Data science nerds

Packt Editorial Staff
31 Oct 2017
14 min read
Are you a data scientist, a machine learning engineer, an AI researcher or simply a data enthusiast? Channel the inner data science nerd within you with these geeky ideas for your Halloween costumes! The Data Science Spectrum Don't know what to go as to this evening's party because you've been busy cleaning that terrifying data? Don’t worry, here are some easy-to-put-together Halloween costume ideas just for you. [dropcap]1[/dropcap] Big Data Go as Baymax, the healthcare robot, (who can also turn into battle mode when required). Grab all white clothes that you have. Stuff your tummy with some pillows and wear a white mask with cutouts for eyes. You are all ready to save the world. In fact, convince a friend or your brother to go as Hiro! [dropcap]2[/dropcap] A.I. agent Enter as Agent Smith, the AI antagonist, this Halloween. Lure everyone with your bold black suit paired with a white shirt and a black tie. A pair of polarized sunglasses would replicate you as the AI agent. Capture the crowd by being the most intelligent and cold-hearted personality of all. [dropcap]3[/dropcap] Data Miner Put on your dungaree with a tee. Fix a flashlight atop your cap. Grab a pickaxe from the gardening toolkit, if you have one. Stripe some mud onto your face. Enter the party wheeling with loads of data boxes that you have freshly mined. You’ll definitely grab some traffic for data. Unstructured data anyone? [dropcap]4[/dropcap] Data Lake Go as a Data lake this Halloween. Simply grab any blue item from your closet. Draw some fishes, crabs, and weeds. (Use a child’s marker for that). After all, it represents the data you have. And you’re all set. [dropcap]5[/dropcap] Dark Data Unleash the darkness within your soul! Just kidding. You don’t actually have to turn to the evil side. Just coming up with your favorite black-costume character would do. Looking for inspiration? Maybe, a witch, The dark knight, or The Darth Vader. [dropcap]6[/dropcap] Cloud A fluffy, white cloud is what you need to be this Halloween. Raid your nearby drug store for loads of cotton balls. Better still, tear up that old pillow you have been meaning to throw away for a while. Use the fiber inside to glue onto an unused tee. You will be the cutest cloud ever seen. Don’t forget to carry an umbrella in case you turn grey! [dropcap]7[/dropcap] Predictive Analytics Make your own paper wizard hat with silver stars and moons pasted on it. If you can arrange for an advocate gown, it would be great. Else you could use a long black bed sheet as a cape. And most importantly, a crystal ball to show off some prediction stunts at the Halloween. [dropcap]8[/dropcap] Gradient boosting Enter Halloween as the energy booster. Wear what you want. Grab loads of empty energy drink tetra packs and stick it all over you. Place one on your head too. Wear a nameplate that says “ G-booster Energy drink”. Fuel up some weak models this Halloween. [dropcap]9[/dropcap] Cryptocurrency Wear head to toe black. In fact, paint your face black as well, like the Grim reaper. Then grab a cardboard piece. Cut out a circle, paint it orange, and then draw a gold B symbol, just like you see in a bitcoin. This Halloween costume will definitely grab you the much-needed attention just as this popular cryptocurrency. [dropcap]10[/dropcap] IoT Are you a fan of IoT and the massive popularity it has gained? Then you should definitely dress up as your web-slinging, friendly neighborhood Spiderman. Just grab a spiderman costume from any costume store and attach some handmade web slings. Remember to connect with people by displaying your IoT knowledge. [dropcap]11[/dropcap] Self-driving car Choose a mono-color outfit of your choice (P.S. The color you would choose for your car). Cut out four wheels and paste two on your lower calves and two on your arms. Cut out headlights too. Put on a wiper goggle. And yes you do not need a steering wheel or the brakes, clutch and the accelerator. Enter the Halloween at your own pace, go self-driving this Halloween. Bonus point: You can call yourself Bumblebee or Optimus Prime. Machine Learning and Deep learning Frameworks If machine learning or deep learning is your forte, here are some fresh Halloween costume ideas based on some of the popular frameworks in that space. [dropcap]12[/dropcap] Torch Flame up the party with a costume inspired by the fantastic four superhero, Johnny Storm a.k.a The Human Torch. Wear a yellow tee and orange slacks. Draw some orange flames on your tee. And finally, wear a flame-inspired headband. Someone is a hot machine learning library! [dropcap]13[/dropcap] TensorFlow No efforts for this one. Just arrange for a pumpkin costume, paste a paper cut-out of the TensorFlow logo and wear it as a crown. Go as the most powerful and widely popular deep learning library. You will be the star of the Halloween as you are a Google Kid. [dropcap]14[/dropcap] Caffe Go as your favorite Starbucks coffee this Halloween. Wear any of your brown dress/ tee. Draw or stick a Starbucks logo. And then add frothing to the top by bunching up a cream-colored sheet. Mamma Mia! [dropcap]15[/dropcap] Pandas Go as a Panda this Halloween! Better still go as a group of Pandas. The best option is to buy a panda costume. But if you don’t want that, wear a white tee, black slacks, black goggles and some cardboard cutouts for ears. This will make you not only the cutest animal in the party but also a top data manipulation library. Good luck finding your python in the party by the way. [dropcap]16[/dropcap] Jupyter Notebook Go as a top trending open-source web application by dressing up as the largest planet in our solar system. People would surely be intimidated by your mass and also by your computing power. [dropcap]17[/dropcap] H2O Go to Halloween as a world famous open source deep learning platform. No, no, you don’t have to go as the platform itself. Instead go as the chemical alter-ego, water. Wear all blue and then grab some leftover asymmetric, blue cloth pieces to stick at your sides. Thirsty anyone? Data Viz & Analytics Tools If you’re all about analytics and visualization, grab the attention of every data geek in your party by dressing up as your favorite data insight tools. [dropcap]18[/dropcap] Excel Grab an old white tee and paint some green horizontal stripes. You’re all ready to go as the most widely used spreadsheet. The simplest of costumes, yet the most useful - a timeless classic that never goes out of fashion. [dropcap]19[/dropcap] MatLab If you have seriously run out of all costume ideas, going out as MatLab is your only solution. Just grab a blue tablecloth. Stick or sew it with some orange curtain and throw it over your head. You’re all ready to go as the multi-paradigm numerical computing environment. [dropcap]20[/dropcap] Weka Wear a brown overall, a brown wig, and paint your face brown. Make an orange beak out of a chart paper, and wear a pair orange stockings/ socks with your trousers tucked in. You are all set to enter as a data mining bird with ML algorithms and Java under your wings. [dropcap]21[/dropcap] Shiny Go all Shimmery!! Get some glitter powder and put it all over you. (You’ll have a tough time removing it though). Else choose a glittery outfit, with glittery shoes, and touch-up with some glitter on your face. Let the party see the bling of R that you bring. You will be the attractive storyteller out there. [dropcap]22[/dropcap] Bokeh A colorful polka-dotted outfit and some dim lights to do the magic. You are all ready to grab the show with such a dazzle. Make sure you enter the party gates with Python. An eye-catching beauty with the beast pair. [dropcap]23[/dropcap] Tableau Enter the Halloween as one of your favorite characters from history. But there is a term and condition for this: You cannot talk or move. Enjoy your Halloween by being still. Weird, but you’ll definitely grab everyone’s eye. [dropcap]24[/dropcap] Microsoft Power BI Power up your Halloween party by entering as a data insights superhero. Wear a yellow turtleneck, a stylish black leather jacket, black pants, some mid-thigh high boots and a slick attitude. You’re ready to save your party! Data Science oriented Programming languages These hand-picked Halloween costume ideas are for you if you consider yourself a top coder. By a top coder we mean you’re all about learning new programming languages in your spare and, well, your not so spare time.   [dropcap]25[/dropcap] Python Easy peasy as the language looks, the reptile is not that easy to handle. A pair of python-printed shirt and trousers would do the job. You could be getting more people giving you candies some out of fear, other out of the ease. Definitely, go as a top trending and a go-to language which everyone loves! And yes, don’t forget the fangs. [dropcap]26[/dropcap] R Grab an eye patch and your favorite leather pants. Wear a loose white shirt with some rugged waistcoat and a sword. Here you are all decked up as a pirate for your next loot. You’ll surely thank me for giving you a brilliant Halloween idea. But yes! Don’t forget to make that Arrrr (R) noise! [dropcap]27[/dropcap] Java Go as a freshly roasted coffee bean! People in your Halloween party would be allured by your aroma. They would definitely compliment your unique idea and also the fact that you’re the most popular programming language. [dropcap]28[/dropcap] SAS March in your Halloween party up as a Special Airforce Service (SAS) agent. You would be disciplined, accurate, precise and smart. Just like the advanced software suite that goes by the same name. You would need a full black military costume, with a gas mask, some fake ammunition from a nearby toy store, and some attitude of course! [dropcap]29[/dropcap] SQL If you pride yourself on being very organized or are a stickler for the rules, you should go as SQL this Halloween. Prep-up yourself with an overall blue outfit. Spike up your hair and spray some temporary green hair color. Cut out bold letters S, Q, and L from a plain white paper and stick them on your chest. You are now ready to enter the Halloween party as the most popular database of all times. Sink in all the data that you collect this Halloween. [dropcap]30[/dropcap] Scala If Scala is your favorite programming language, add a spring to your Halloween by going as, well, a spring! Wear the brightest red that you have. Using a marker, draw some swirls around your body (You can ask your mom to help). Just remember to elucidate a 3D picture. And you’re all set. [dropcap]31[/dropcap] Julia If you want to make a red carpet entrance to your Halloween party, go as the Academy award-winning actress, Julia Roberts. You can even take up inspiration from her character in the 90s hit film Pretty Woman. For extra oomph, wear a pink, red, and purple necklace to highlight the Julia programming language [dropcap]32[/dropcap] Ruby Act pricey this Halloween. Be the elegant, dynamic yet simple programming language. Go blood red, wear on your brightest red lipstick, red pumps, dazzle up with all the red accessories that you have. You’ll definitely gather some secret admirers around the hall. [dropcap]33[/dropcap] Go Go as the mascot of Go, the top trending programming language. All you need is a blue mouse costume. Fear not if you don’t have one. Just wear a powder blue jumpsuit, grab a baby pink nose, and clip on a fake single, large front tooth. Ready for the party! [dropcap]34[/dropcap] Octave Go as a numerically competent programming language. And if that doesn’t sound very trendy, go as piano keys depicting an octave. You simply need to wear all white and divide your space into 8 sections. Then draw 5 horizontal black stripes. You won’t be able to do that vertically, well, because they are a big number. Here you go, you’re all set to fill the party with your melody. Fancy an AI system inspired Halloween costume? This is for you if you love the way AI works and the enigma that it has thrown around the world. This is for you if you are spellbound with AI magic. You should go dressed as one of these at your Halloween party this season. Just pick up the AI you want to look like and follow as advised. [dropcap]35[/dropcap] IBM Watson Wear a dark blue hat, a matching long overcoat, a vest and a pale blue shirt with a dark tie tucked into the vest. Complement it with a mustache and a brooding look. You are now ready to be IBM Watson at your Halloween party. [dropcap]36[/dropcap] Apple Siri If you want to be all cool and sophisticated like the Apple’s Siri, wear an alluring black turtleneck dress. Don’t forget to carry your latest iPhone and air pods. Be sure you don’t have a sore throat, in case someone needs your assistance. [dropcap]37[/dropcap] Microsoft Cortana If Microsoft Cortana is your choice of voice assistant, dress up as Cortana, the fictional synthetic intelligence character in the Halo video game series. Wear a blue bodysuit. Get a bob if you’re daring. (A wig would also do). Paint some dark blue robot like designs over your body and well, your face. And you’re all set. [dropcap]38[/dropcap] Salesforce Einstein Dress up as the world’s most famous physicist and also an AI-powered CRM. How? Just grab a white shirt, a blue pullover and a blue tie (Salesforce colors). Finish your look with a brown tweed coat, brown pants and shoes, a rugged white wig and mustache, and a deep thought on your face. [dropcap]39[/dropcap] Facebook Jarvis Get inspired by the Iron man’s Jarvis, the coolest A.I. in the Marvel universe. Just grab a plexiglass, draw some holograms and technological symbols over it with a neon marker. (Try to keep the color palette in shades of blues and reds). And fix this plexiglass in a curved fashion in front of your face by a headband. Do practice saying “Hello Mr. Stark.”  [dropcap]40[/dropcap] Amazon Echo This is also an easy one. Grab a long, black chart paper. Roll it around in a tube form around your body. Draw the Amazon symbol at the bottom with some glittery, silver sketch pen, color your hair blue, and there you go. If you have a girlfriend, convince her to go as Amazon Alexa. [dropcap]41[/dropcap] SAP Leonardo Put on a hat, wear a long cloak, some fake overgrown mustache, and beard. Accessorize with a color palette and a paintbrush. You will be the Leonardo da Vinci of the Halloween party. Wait a minute, don’t forget to cut out SAP initials and stick them on your cap. After all, you are entering as SAP’s very own digital revolution system. [dropcap]42[/dropcap] Intel Neon Deck the Halloween hall with a Harley Quinn costume. For some extra dramatization, roll up some neon blue lights around your head. Create an Intel logo out of some blue neon lights and wear it as your neckpiece. [dropcap]43[/dropcap] Microsoft Brainwave This one will require a DIY task. Arrange for a red and green t-shirt, cut them into a vertical half. Stitch it in such a way that the green is on the left and the red on the right. Similarly, do that with your blue and yellow pants; with yellow on the left and blue on the right. You will look like the most powerful Microsoft’s logo. Wear a skullcap with wires protruding out and a Hololens like eyewear to go with. And so, you are all ready to enter the Halloween party as Microsoft’s deep learning acceleration platform for real-time AI. [dropcap]44[/dropcap] Sophia, the humanoid Enter with all the confidence and a top-to-toe professional attire. Be ready to answer any question thrown at you with grace and without a stroke of skepticism. And to top it off, sport a clean shaved head. And there, you are all ready to blow off everyone’s mind with a mix of beauty with super intelligent brains.   Happy Halloween folks!
Read more
  • 0
  • 0
  • 4854