Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-mathematical-imaging
Packt
05 Oct 2015
17 min read
Save for later

Mathematical Imaging

Packt
05 Oct 2015
17 min read
 In this article by Francisco J. Blanco-Silva, author of the book Mastering SciPy, you will learn about image editing and the purpose of editing is the alteration of digital images, usually to perform improvement of its properties or to turn them into an intermediate step for further processing. Let's examine different examples of editing: Transformations of the domain Intensity adjustment Image restoration (For more resources related to this topic, see here.) Transformations of the domain In this setting, we address transformations to images by first changing the location of pixels, rotations, compressions, stretching, swirls, cropping, perspective control, and so on. Once the transformation to the pixels in the domain of the original is performed, we observe the size of the output. If there are more pixels in this image than in the original, the extra locations are filled with numerical values obtained by interpolating the given data. We do have some control over the kind of interpolation performed, of course. To better illustrate these techniques, we will pair an actual image (say, Lena) with a representation of its domain as a checkerboard: In [1]: import numpy as np, matplotlib.pyplot as plt In [2]: from scipy.misc import lena; ...: from skimage.data import checkerboard In [3]: image = lena().astype('float') ...: domain = checkerboard() In [4]: print image.shape, domain.shape Out[4]: (512, 512) (200, 200) Rescale and resize Before we proceed with the pairing of image and domain, we have to make sure that they both have the same size. One quick fix is to rescale both objects, or simply resize one of the images to match the size of the other. Let's go with the first choice, so that we can illustrate the usage of the two functions available for this task in the module skimage.transform to resize and rescale: In [5]: from skimage.transform import rescale, resize In [6]: domain = rescale(domain, scale=1024./200); ...: image = resize(image, output_shape=(1024, 1024), order=3) Observe how, in the resizing operation, we requested a bicubic interpolation. Swirl To perform a swirl, we call the function swirl from the module skimage.transform: In all the examples of this section, we will present the results visually after performing the requested computations. In all cases, the syntax of the call to offer the images is the same. For a given operation mapping, we issue the command display (mapping, image, and domain) where the routine display is defined as follows: def display(mapping, image, domain): plt.figure() plt.subplot(121) plt.imshow(mapping(image)) plt.gray() plt.subplot(122) plt.imshow(mapping(domain)) plt.show() For the sake of brevity, we will not include this command in the following code, but assume it is called every time: In [7]: from skimage.transform import swirl In [8]: mapping = lambda img: swirl(img, strength=6, ...: radius=512) Geometric transformations A simple rotation around any location (no matter whether inside or outside the domain of the image) can be achieved with the function rotate from either module scipy.ndimage or skimage.transform. They are basically the same under the hood, but the syntax of the function in the scikit-image toolkit is more user-friendly: In [10]: from skimage.transform import rotate In [11]: mapping = lambda img: rotate(img, angle=30, resize=True, ....: center=None) This gives a counter-clockwise rotation of 30 degrees (angle=30) around the center of the image (center=None). The size of the output image is expanded to guarantee that all the original pixels are present in the output (resize=True): Rotations are a special case of what we call an affine transformation—a combination of rotation with scales (one for each dimension), shear, and translation. Affine transformations are in turn a special case of a homography (a projective transformation). Rather than learning a different set of functions, one for each kind of geometric transformation, the library skimage.transform allows a very comfortable setting. There is a common function (warp) that gets called with the requested geometric transformation and performs the computations. Each suitable geometric transformation is previously initialized with an appropriate python class. For example, to perform an affine transformation with a counter-clockwise rotation angle of 30 degrees about the point with coordinates (512, -2048), and scale factors of 2 and 3 units, respectively for the x and y coordinates, we issue the following command: In [13]: from skimage.transform import warp, AffineTransform In [14]: operation = AffineTransform(scale=(2,3), rotation=np.pi/6, ....: translation = (512, -2048)) In [15]: mapping = lambda img: warp(img, operation) Observe how all the lines in the transformed checkerboard are either parallel or perpendicular—affine transformations preserve angles. The following illustrates the effect of a homography: In [17]: from skimage.transform import ProjectiveTransform In [18]: generator = np.matrix('1,0,10; 0,1,20; -0.0007,0.0005,1'); ....: homography = ProjectiveTransform(matrix=generator); ....: mapping = lambda img: warp(img, homography) Observe how, unlike in the case of an affine transformation, the lines cease to be all parallel or perpendicular. All vertical lines are now incident at a point at infinity. All horizontal lines are also incident at a different point at infinity. The real usefulness of homographies arises, for example, when we need to perform perspective control. For instance, the image skimage.data.text is clearly slanted. By choosing the four corners of what we believe is a perfect rectangle (we estimate such a rectangle by visual inspection), we can compute a homography that transforms the given image into one that is devoid of any perspective. The Python classes representing geometric transformations allow us to perform this estimation very easily, as the following example shows: In [20]: from skimage.data import text In [21]: text().shape Out[21]: (172, 448) In [22]: source = np.array(((155, 15), (65, 40), ....: (260, 130), (360, 95), ....: (155, 15))) In [23]: mapping = ProjectiveTransform() Let's estimate the homography that transforms the given set of points into a perfect rectangle of the size 48 x 256 centered in an output image of size 512 x 512. The choice of size of the output image is determined by the length of the diagonal of the original image (about 479 pixels). This way, after the homography is computed, the output is likely to contain all the pixels from the original. Observe that we have included one of vertices in the source, twice. This is not strictly necessary for the following computations, but will make the visualization of rectangles much easier to code. We use the same trick for the target rectangle. In [24]: target = np.array(((256-128, 256-24), (256-128, 256+24), ....: (256+128, 256+24), (256+128, 256-24), ....: (256-128, 256-24))) In [25]: mapping.estimate(target, source) Out[25]: True In [26]: plt.figure(); ....: plt.subplot(121); ....: plt.imshow(text()); ....: plt.gray(); ....: plt.plot(source[:,0], source[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 448); ....: plt.ylim(172, 0); ....: plt.subplot(122); ....: plt.imshow(warp(text(), mapping,output_shape=(512, 512))); ....: plt.plot(target[:,0], target[:,1],'-', lw=1, color='red'); ....: plt.xlim(0, 512); ....: plt.ylim(512, 0); ....: plt.show() Other more involved geometric operations are needed, for example, to fix vignetting and some of the other kinds of distortions produced by photographic lenses. Traditionally, once we acquire an image, we assume that all these distortions are present. By knowing the technical specifications of the equipment used to take the photographs, we can automatically rectify these defects. With this purpose in mind, in the SciPy stack, we have access to the lensfun library (http://lensfun.sourceforge.net/) through the package lensfunpy (https://pypi.python.org/pypi/lensfunpy). For examples of usage and documentation, an excellent resource is the API reference of lensfunpy at http://pythonhosted.org/lensfunpy/api/. Intensity adjustment In this category, we have operations that only modify the intensity of an image obeying a global formula. All these operations can therefore be easily coded by using purely NumPy operations, by creating vectorized functions adapting the requested formulas. The applications of these operations can be explained in terms of exposure in black and white photography, for example. For this reason, all the examples in this section are applied on gray-scale images. We have mainly three approaches to enhancing images by working on its intensity: Histogram equalizations Intensity clipping/resizing Contrast adjustment Histogram equalization The starting point in this setting is always the concept of intensity histogram (or more precisely, the histogram of pixel intensity values)—a function that indicates the number of pixels in an image at each different intensity value found in that image. For instance, for the original version of Lena, we could issue the following command: In [27]: plt.figure(); ....: plt.hist(lena().flatten(), 256); ....: plt.show() The operations of histogram equalization improve the contrast of images by modifying the histogram in a way that most of the relevant intensities have the same impact. We can accomplish this enhancement by calling, from the module skimage.exposure, any of the functions equalize_hist (pure histogram equalization) or equalize_adaphist (contrast limited adaptive histogram equalization (CLAHE)). Note the obvious improvement after the application of the histogram equalization of the image skimage.data.moon. In the following examples, we include the corresponding histogram below all relevant images for comparison. A suitable code to perform this visualization could be as follows: def display(image, transform, bins): target = transform(image) plt.figure() plt.subplot(221) plt.imshow(image) plt.gray() plt.subplot(222) plt.imshow(target) plt.subplot(223) plt.hist(image.flatten(), bins) plt.subplot(224) plt.hist(target.flatten(), bins) plt.show() In [28]: from skimage.exposure import equalize_hist; ....: from skimage.data import moon In [29]: display(moon(), equalize_hist, 256) Intensity clipping/resizing A peak at the histogram indicates the presence of a particular intensity that is remarkably more predominant than its neighboring ones. If we desire to isolate intensities around a peak, we can do so using purely NumPy masking/clipping operations on the original image. If storing the result is not needed, we can request a quick visualization of the result by employing the command clim in the library matplotlib.pyplot. For instance, to isolate intensities around the highest peak of Lena (roughly, these are between 150 and 160), we could issue the following command: In [30]: plt.figure(); ....: plt.imshow(lena()); ....: plt.clim(vmin=150, vmax=160); ....: plt.show() Note how this operation, in spite of having reduced the representative range of intensities from 256 to 10, offers us a new image that keeps sufficient information to recreate the original one. Naturally, we can regard this operation also as a lossy compression method. Contrast enhancement An obvious drawback of clipping intensities is the loss of perceived lightness contrast. To overcome this loss, it is preferable to employ formulas that do not reduce the size of the range. Among the many available formulas conforming to this mathematical property, the most successful ones are those that replicate an optical property of the acquisition method. We explore the following three cases: Gamma correction: Human vision follows a power function, with greater sensitivity to relative differences between darker tones than between lighter ones. Each original image, as captured by the acquisition device, might allocate too many bits or too much bandwidth to highlight that humans cannot actually differentiate. Similarly, too few bits/bandwidth could be allocated to the darker areas of the image. By manipulation of this power function, we are capable of addressing the correct amount of bits and bandwidth. Sigmoid correction: Independently of the amount of bits and bandwidth, it is often desirable to maintain the perceived lightness contrast. Sigmoidal remapping functions were then designed based on empirical contrast enhancement model developed from the results of psychophysical adjustment experiments. Logarithmic correction: This is a purely mathematical process designed to spread the range of naturally low-contrast images by transforming to a logarithmic range. To perform gamma correction on images, we could employ the function adjust_gamma in the module skimage.exposure. The equivalent mathematical operation is the power-law relationship output = gain * input^gamma. Observe the great improvement in definition of the brighter areas of a stained micrograph of colonic glands, when we choose the exponent gamma=2.5 and no gain (gain=1.0): In [31]: from skimage.exposure import adjust_gamma; ....: from skimage.color import rgb2gray; ....: from skimage.data import immunohistochemistry In [32]: image = rgb2gray(immunohistochemistry()) In [33]: correction = lambda img: adjust_gamma(img, gamma=2.5, ....: gain=1.) Note the huge improvement in contrast in the lower right section of the micrograph, allowing a better description and differentiation of the observed objects: Immunohistochemical staining with hematoxylin counterstaining. This image was acquired at the Center for Microscopy And Molecular Imaging (CMMI). To perform sigmoid correction with given gain and cutoff coefficients, according to the formula output = 1/(1 + exp*(gain*(cutoff - input))), we employ the function adjust_sigmoid in skimage.exposure. For example, with gain=10.0 and cutoff=0.5 (the default values), we obtain the following enhancement: In [35]: from skimage.exposure import adjust_sigmoid In [36]: display(image[:256, :256], adjust_sigmoid, 256) Note the improvement in the definition of the walls of cells in the enhanced image: We have already explored logarithmic correction in the previous section, when enhancing the visualization of the frequency of an image. This is equivalent to applying a vectorized version of np.log1p to the intensities. The corresponding function in the scikit-image toolkit is adjust_log in the sub module exposure. Image restoration In this category of image editing, the purpose is to repair the damage produced by post or preprocessing of the image, or even the removal of distortion produced by the acquisition device. We explore two classic situations: Noise reduction Sharpening and blurring Noise reduction In mathematical imaging, noise is by definition a random variation of the intensity (or the color) produced by the acquisition device. Among all the possible types of noise, we acknowledge four key cases: Gaussian noise: We add to each pixel a value obtained from a random variable with normal distribution and a fixed mean. We usually allow the same variance on each pixel of the image, but it is feasible to change the variance depending on the location. Poisson noise: To each pixel, we add a value obtained from a random variable with Poisson distribution. Salt and pepper: This is a replacement noise, where some pixels are substituted by zeros (black or pepper), and some pixels are substituted by ones (white or salt). Speckle: This is a multiplicative kind of noise, where each pixel gets modified by the formula output = input + n * input. The value of the modifier n is a value obtained from a random variable with uniform distribution of fixed mean and variance. To emulate all these kinds of noise, we employ the utility random_noise from the module skimage.util. Let's illustrate the possibilities in a common image: In [37]: from skimage.data import camera; ....: from skimage.util import random_noise In [38]: gaussian_noise = random_noise(camera(), 'gaussian', ....: var=0.025); ....: poisson_noise = random_noise(camera(), 'poisson'); ....: saltpepper_noise = random_noise(camera(), 's&p', ....: salt_vs_pepper=0.45); ....: speckle_noise = random_noise(camera(), 'speckle', var=0.02) In [39]: variance_generator = lambda i,j: 0.25*(i+j)/1022. + 0.001; ....: variances = np.fromfunction(variance_generator,(512,512)); ....: lclvr_noise = random_noise(camera(), 'localvar', ....: local_vars=variances) In the last example, we have created a function that assigns a variance between 0.001 and 0.026 depending on the distance to the upper corner of an image. When we visualize the corresponding noisy version of skimage.data.camera, we see that the level of degradation gets stronger as we get closer to the lower right corner of the picture. The following is an example of visualization of the corresponding noisy images: The purpose of noise reduction is to remove as much of this unwanted signal, so we obtain an image as close to the original as possible. The trick, of course, is to do so without any previous knowledge of the properties of the noise. The most basic methods of denoising are the application of either a Gaussian or a median filter. We explored them both in the previous section. The former was presented as a smoothing filter (gaussian_filter), and the latter was discussed when we explored statistical filters (median_filter). They both offer decent noise removal, but they introduce unwanted artifacts as well. For example, the Gaussian filter does not preserve edges in images. The application of any of these methods is also not recommended if preserving texture information is needed. We have a few more advanced methods in the module skimage.restoration, able to tailor denoising to images with specific properties: denoise_bilateral: This is the bilateral filter. It is useful when preserving edges is important. denoise_tv_bregman, denoise_tv_chambolle: We will use this if we require a denoised image with small total variation. nl_means_denoising: The so-called non-local means denoising. This method ensures the best results to denoise areas of the image presenting texture. wiener, unsupervised_wiener: This is the Wiener-Hunt deconvolution. It is useful when we have knowledge of the point-spread function at acquisition time. Let's show you, by example, the performance of one of these methods on some of the noisy images we computed earlier: In [40]: from skimage.restoration import nl_means_denoising as dnoise In [41]: images = [gaussian_noise, poisson_noise, ....: saltpepper_noise, speckle_noise]; ....: names = ['Gaussian', 'Poisson', 'Salt & Pepper', 'Speckle'] In [42]: plt.figure() Out[42]: <matplotlib.figure.Figure at 0x118301490> In [43]: for index, image in enumerate(images): ....: output = dnoise(image, patch_size=5, patch_distance=7) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [44]: plt.show() Under each noisy image, we have presented the corresponding result after employing, by nonlocal means, denoising: It is also possible to perform denoising by thresholding coefficient, provided we represent images with a transform. For example, to do a soft thresholding, employing Biorthonormal 2.8 wavelets; we will use the package PyWavelets: In [45]: import pywt In [46]: def dnoise(image, wavelet, noise_var): ....: levels = int(np.floor(np.log2(image.shape[0]))) ....: coeffs = pywt.wavedec2(image, wavelet, level=levels) ....: value = noise_var * np.sqrt(2 * np.log2(image.size)) ....: threshold = lambda x: pywt.thresholding.soft(x, value) ....: coeffs = map(threshold, coeffs) ....: return pywt.waverec2(coeffs, wavelet) ....: In [47]: plt.figure() Out[47]: <matplotlib.figure.Figure at 0x10e5ed790> In [48]: for index, image in enumerate(images): ....: output = dnoise(image, pywt.Wavelet('bior2.8'), ....: noise_var=0.02) ....: plt.subplot(2, 4, index+1) ....: plt.imshow(image) ....: plt.gray() ....: plt.title(names[index]) ....: plt.subplot(2, 4, index+5) ....: plt.imshow(output) ....: In [49]: plt.show() Observe that the results are of comparable quality to those obtained with the previous method: Sharpening and blurring There are many situations that produce blurred images: Incorrect focus at acquisition Movement of the imaging system The point-spread function of the imaging device (like in electron microscopy) Graphic-art effects For blurring images, we could replicate the effect of a point-spread function by performing convolution of the image with the corresponding kernel. The Gaussian filter that we used for denoising performs blurring in this fashion. In the general case, convolution with a given kernel can be done with the routine convolve from the module scipy.ndimage. For instance, for a constant kernel supported on a 10 x 10 square, we could do as follows: In [50]: from scipy.ndimage import convolve; ....: from skimage.data import page In [51]: kernel = np.ones((10, 10))/100.; ....: blurred = convolve(page(), kernel) To emulate the blurring produced by movement too, we could convolve with a kernel as created here: In [52]: from skimage.draw import polygon In [53]: x_coords = np.array([14, 14, 24, 26, 24, 18, 18]); ....: y_coords = np.array([ 2, 18, 26, 24, 22, 18, 2]); ....: kernel_2 = np.zeros((32, 32)); ....: kernel_2[polygon(x_coords, y_coords)]= 1.; ....: kernel_2 /= kernel_2.sum() In [54]: blurred_motion = convolve(page(), kernel_2) In order to reverse the effects of convolution when we have knowledge of the degradation process, we perform deconvolution. For example, if we have knowledge of the point-spread function, in the module skimage.restoration we have implementations of the Wiener filter (wiener), the unsupervised Wiener filter (unsupervised_wiener), and Lucy-Richardson deconvolution (richardson_lucy). We perform deconvolution on blurred images by applying a Wiener filter. Note the huge improvement in readability of the text: In [55]: from skimage.restoration import wiener In [56]: deconv = wiener(blurred, kernel, balance=0.025, clip=False) Summary In this article, we have seen how the SciPy stack helps us solve many problems in mathematical imaging, from the effective representation of digital images, to modifying, restoring, or analyzing them. Although certainly exhaustive, this article scratches but the surface of this challenging field of engineering. Resources for Article: Further resources on this subject: Symbolizers [article] Machine Learning [article] Analyzing a Complex Dataset [article]
Read more
  • 0
  • 0
  • 2958

article-image-object-detection-using-image-features-javascript
Packt
05 Oct 2015
16 min read
Save for later

Object Detection Using Image Features in JavaScript

Packt
05 Oct 2015
16 min read
In this article by Foat Akhmadeev, author of the book Computer Vision for the Web, we will discuss how we can detect an object on an image using several JavaScript libraries. In particular, we will see techniques such as FAST features detection, and BRIEF and ORB descriptors matching. Eventually, the object detection example will be presented. There are many ways to detect an object on an image. Color object detection, which is the detection of changes in intensity of an image is just a simple computer vision methods. There are some sort of fundamental things which every computer vision enthusiast should know. The libraries we use here are: JSFeat (http://inspirit.github.io/jsfeat/) tracking.js (http://trackingjs.com) (For more resources related to this topic, see here.) Detecting key points What information do we get when we see an object on an image? An object usually consists of some regular parts or unique points, which represent this particular object. Of course, we can compare each pixel of an image, but it is not a good thing in terms of computational speed. Probably, we can take unique points randomly, thus reducing the computation cost significantly, but we will still not get much information from random points. Using the whole information, we can get too much noise and lose important parts of an object representation. Eventually, we need to consider that both ideas, getting all pixels and selecting random pixels, are really bad. So, what can we do in that case? We are working with a grayscale image and we need to get unique points of an image. Then, we need to focus on the intensity information. For example, getting object edges in the Canny edge detector or the Sobel filter. We are closer to the solution! But still not close enough. What if we have a long edge? Don't you think that it is a bit bad that we have too many unique pixels that lay on this edge? An edge of an object has end points or corners; if we reduce our edge to those corners, we will get enough unique pixels and remove unnecessary information. There are various methods of getting keypoints from an image, many of which extract corners as those keypoints. To get them, we will use the FAST (Features from Accelerated Segment Test) algorithm. It is really simple and you can easily implement it by yourself if you want. But you do not need to. The algorithm implementation is provided by both tracking.js and JSFeat libraries. The idea of the FAST algorithm can be captured from the following image: Suppose we want to check whether the pixel P is a corner. We will check 16 pixels around it. If at least 9 pixels in an arc around P are much darker or brighter than the value of P, then we say that P is a corner. How much darker or brighter should the P pixels be? The decision is made by applying a threshold for the difference between the value of P and the value of pixels around P. A practical example First, we will start with an example of FAST corner detection for the tracking.js library. Before we do something, we can set the detector threshold. Threshold defines the minimum difference between a tested corner and the points around it: tracking.Fast.THRESHOLD = 30; It is usually a good practice to apply a Gaussian blur on an image before we start the method. It significantly reduces the noise of an image: var imageData = context.getImageData(0, 0, cols, rows); var gray = tracking.Image.grayscale(imageData.data, cols, rows, true); var blurred4 = tracking.Image.blur(gray, cols, rows, 3); Remember that the blur function returns a 4 channel array—RGBA. In that case, we need to convert it to 1-channel. Since we can easily skip other channels, it should not be a problem: var blurred1 = new Array(blurred4.length / 4); for (var i = 0, j = 0; i < blurred4.length; i += 4, ++j) { blurred1[j] = blurred4[i]; } Next, we run a corner detection function on our image array: var corners = tracking.Fast.findCorners(blurred1, cols, rows); The result returns an array with its length twice the length of the corner's number. The array is returned in the format [x0,y0,x1,y1,...]. Where [xn, yn] are coordinates of a detected corner. To print the result on a canvas, we will use the fillRect function: for (i = 0; i < corners.length; i += 2) { context.fillStyle = '#0f0'; context.fillRect(corners[i], corners[i + 1], 3, 3); } Let's see an example with the JSFeat library,. for which the steps are very similar to that of tracking.js. First, we set the global threshold with a function: jsfeat.fast_corners.set_threshold(30); Then, we apply a Gaussian blur to an image matrix and run the corner detection: jsfeat.imgproc.gaussian_blur(matGray, matBlurred, 3); We need to preallocate keypoints for a corners result. The keypoint_t function is just a new type which is useful for keypoints of an image. The first two parameters represent coordinates of a point and the other parameters set: point score (which checks whether the point is good enough to be a key point), point level (which you can use it in an image pyramid, for example), and point angle (which is usually used for the gradient orientation): var corners = []; var i = cols * rows; while (--i >= 0) { corners[i] = new jsfeat.keypoint_t(0, 0, 0, 0, -1); } After all this, we execute the FAST corner detection method. As a last parameter of detection function, we define a border size. The border is used to constrain circles around each possible corner. For example, you cannot precisely say whether the point is a corner for the [0,0] pixel. There is no [0, -3] pixel in our matrix: var count = jsfeat.fast_corners.detect(matBlurred, corners, 3); Since we preallocated the corners, the function returns the number of calculated corners for us. The result returns an array of structures with the x and y fields, so we can print it using those fields: for (var i = 0; i < count; i++) { context.fillStyle = '#0f0'; context.fillRect(corners[i].x, corners[i].y, 3, 3); } The result is nearly the same for both algorithms. The difference is in some parts of realization. Let's look at the following example: From left to right: tracking.js without blur, JSFeat without blur, tracking.js and JSFeat with blur. If you look closely, you can see the difference between tracking.js and JSFeat results, but it is not easy to spot it. Look at how much noise was reduced by applying just a small 3 x 3 Gaussian filter! A lot of noisy points were removed from the background. And now the algorithm can focus on points that represent flowers and the pattern of the vase. We have extracted key points from our image, and we successfully reached the goal of reducing the number of keypoints and focusing on the unique points of an image. Now, we need to compare or match those points somehow. How we can do that? Descriptors and object matching Image features by themselves are a bit useless. Yes, we have found unique points on an image. But what did we get? Only values of pixels and that's it. If we try to compare these values, it will not give us much information. Moreover, if we change the overall image brightness, we will not find the same keypoints on the same image! Taking into account all of this, we need the information that surrounds our key points. Moreover, we need a method to efficiently compare this information. First, we need to describe the image features, which comes from image descriptors. In this part, we will see how these descriptors can be extracted and matched. The tracking.js and JSFeat libraries provide different methods for image descriptors. We will discuss both. BRIEF and ORB descriptors The descriptors theory is focused on changes in image pixels' intensities. The tracking.js library provides the BRIEF (Binary Robust Independent Elementary Features) descriptors and its JSFeat extension—ORB (Oriented FAST and Rotated BRIEF). As we can see from the ORB naming, it is rotation invariant. This means that even if you rotate an object, the algorithm can still detect it. Moreover, the authors of the JSFeat library provide an example using the image pyramid, which is scale invariant too. Let's start by explaining BRIEF, since it is the source for ORB descriptors. As a first step, the algorithm takes computed image features, and it takes the unique pairs of elements around each feature. Based on these pairs' intensities it forms a binary string. For example, if we have a pair of positions i and j, and if I(i) < I(j) (where I(pos) means image value at the position pos), then the result is 1, else 0. We add this result to the binary string. We do that for N pairs, where N is taken as a power of 2 (128, 256, 512). Since descriptors are just binary strings, we can compare them in an efficient manner. To match these strings, the Hamming distance is usually used. It shows the minimum number of substitutions required to change one string to another. For example, we have two binary strings: 10011 and 11001. The Hamming distance between them is 2, since we need to change 2 bits of information to change the first string to the second. The JSFeat library provides the functionality to apply ORB descriptors. The core idea is very similar to BRIEF. However, there are two major differences: The implementation is scale invariant, since the descriptors are computed for an image pyramid. The descriptors are rotation invariant; the direction is computed using intensity of the patch around a feature. Using this orientation, ORB manages to compute the BRIEF descriptor in a rotation-invariant manner. Implementation of descriptors implementation and their matching Our goal is to find an object from a template on a scene image. We can do that by finding features and descriptors on both images and matching descriptors from a template to an image. We start from the tracking.js library and BRIEF descriptors. The first thing that we can do is set the number of location pairs: tracking.Brief.N = 512 By default, it is 256, but you can choose a higher value. The larger the value, the more information you will get and the more the memory and computational cost it requires. Before starting the computation, do not forget to apply the Gaussian blur to reduce the image noise. Next, we find the FAST corners and compute descriptors on both images. Here and in the next example, we use the suffix Object for a template image and Scene for a scene image: var cornersObject = tracking.Fast.findCorners(grayObject, colsObject, rowsObject); var cornersScene = tracking.Fast.findCorners(grayScene, colsScene, rowsScene); var descriptorsObject = tracking.Brief.getDescriptors(grayObject, colsObject, cornersObject); var descriptorsScene = tracking.Brief.getDescriptors(grayScene, colsScene, cornersScene); Then we do the matching: var matches = tracking.Brief.reciprocalMatch(cornersObject, descriptorsObject, cornersScene, descriptorsScene); We need to pass information of both corners and descriptors to the function, since it returns coordinate information as a result. Next, we print both images on one canvas. To draw the matches using this trick, we need to shift our scene keypoints for the width of a template image as a keypoint1 matching returns a point on a template and keypoint2 returns a point on a scene image. The keypoint1 and keypoint2 are arrays with x and y coordinates at 0 and 1 indexes, respectively: for (var i = 0; i < matches.length; i++) { var color = '#' + Math.floor(Math.random() * 16777215).toString(16); context.fillStyle = color; context.strokeStyle = color; context.fillRect(matches[i].keypoint1[0], matches[i].keypoint1[1], 5, 5); context.fillRect(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1], 5, 5); context.beginPath(); context.moveTo(matches[i].keypoint1[0], matches[i].keypoint1[1]); context.lineTo(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1]); context.stroke(); } The JSFeat library provides most of the code for pyramids and scale invariant features not in the library, but in the examples, which are available on https://github.com/inspirit/jsfeat/blob/gh-pages/sample_orb.html. We will not provide the full code here, because it requires too much space. But do not worry, we will highlight main topics here. Let's start from functions that are included in the library. First, we need to preallocate the descriptors matrix, where 32 is the length of a descriptor and 500 is the maximum number of descriptors. Again, 32 is a power of two: var descriptors = new jsfeat.matrix_t(32, 500, jsfeat.U8C1_t); Then, we compute the ORB descriptors for each corner, we need to do that for both template and scene images: jsfeat.orb.describe(matBlurred, corners, num_corners, descriptors); The function uses global variables, which mainly define input descriptors and output matching: function match_pattern() The result match_t contains the following fields: screen_idx: This is the index of a scene descriptor pattern_lev: This is the index of a pyramid level pattern_idx: This is the index of a template descriptor Since ORB works with the image pyramid, it returns corners and matches for each level: var s_kp = screen_corners[m.screen_idx]; var p_kp = pattern_corners[m.pattern_lev][m.pattern_idx]; We can print each matching as shown here. Again, we use Shift, since we computed descriptors on separate images, but print the result on one canvas: context.fillRect(p_kp.x, p_kp.y, 4, 4); context.fillRect(s_kp.x + shift, s_kp.y, 4, 4); Working with a perspective Let's take a step away. Sometimes, an object you want to detect is affected by a perspective distortion. In that case, you may want to rectify an object plane. For example, a building wall: Looks good, doesn't it? How do we do that? Let's look at the code: var imgRectified = new jsfeat.matrix_t(mat.cols, mat.rows, jsfeat.U8_t | jsfeat.C1_t); var transform = new jsfeat.matrix_t(3, 3, jsfeat.F32_t | jsfeat.C1_t); jsfeat.math.perspective_4point_transform(transform, 0, 0, 0, 0, // first pair x1_src, y1_src, x1_dst, y1_dst 640, 0, 640, 0, // x2_src, y2_src, x2_dst, y2_dst and so on. 640, 480, 640, 480, 0, 480, 180, 480); jsfeat.matmath.invert_3x3(transform, transform); jsfeat.imgproc.warp_perspective(mat, imgRectified, transform, 255); Primarily, as we did earlier, we define a result matrix object. Next, we assign a matrix for image perspective transformation. We calculate it based on four pairs of corresponding points. For example, the last, that is, the fourth point of the original image, which is [0, 480], should be projected to the point [180, 480] on the rectified image. Here, the first coordinate refers to x and the second to y. Then, we invert the transform matrix to be able to apply it to the original image—the mat variable. We pick the background color as white (255 for an unsigned byte). As a result, we get a nice image without any perspective distortion. Finding an object location Returning to our primary goal, we found a match. That is great. But what we did not do is finding an object location. There is no function for that in the tracking.js library, but JSFeat provides such functionality in the examples section. First, we need to compute a perspective transform matrix. This is why we discussed the example of such transformation previously. We have points from two images but we do not have a transformation for the whole image. First, we define a transform matrix: var homo3x3 = new jsfeat.matrix_t(3, 3, jsfeat.F32C1_t); To compute the homography, we need only four points. But after the matching, we get too many. In addition, there are can be noisy points, which we will need to skip somehow. For that, we use a RANSAC (Random sample consensus) algorithm. It is an iterative method for estimating a mathematical model from a dataset that contains outliers (noise). It estimates outliers and generates a model that is computed without the noisy data. Before we start, we need to define the algorithm parameters. The first parameter is a match mask, where all mathces will be marked as good (1) or bad (0): var match_mask = new jsfeat.matrix_t(500, 1, jsfeat.U8C1_t); Our mathematical model to find: var mm_kernel = new jsfeat.motion_model.homography2d(); Minimum number of points to estimate a model (4 points to get a homography): var num_model_points = 4; Maximum threshold to classify a data point as an inlier or a good match: var reproj_threshold = 3; Finally, the variable that holds main parameters and the last two arguments define the maximum ratio of outliers and probability of success when the algorithm stops at the point where the number of inliers is 99 percent: var ransac_param = new jsfeat.ransac_params_t(num_model_points, reproj_threshold, 0.5, 0.99); Then, we run the RANSAC algorithm. The last parameter represents the number of maximum iterations for the algorithm: jsfeat.motion_estimator.ransac(ransac_param, mm_kernel, object_xy, screen_xy, count, homo3x3, match_mask, 1000); The shape finding can be applied for both tracking.js and JSFeat libraries, you just need to set matches as object_xy and screen_xy, where those arguments must hold an array of objects with the x and y fields. After we find the transformation matrix, we compute the projected shape of an object to a new image: var shape_pts = tCorners(homo3x3.data, colsObject, rowsObject); After the computation is done, we draw computed shapes on our images: As we see, our program successfully found an object in both cases. Actually, both methods can show different performance, it is mainly based on the thresholds you set. Summary Image features and descriptors matching are powerful tools for object detection. Both JSFeat and tracking.js libraries provide different functionalities to match objects using these features. In addition to this, the JSFeat project contains algorithms for object finding. These methods can be useful for tasks such as uniques object detection, face tracking, and creating a human interface by tracking various objects very efficiently. Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack[article] Introducing JAX-RS API[article] Using Google Maps APIs with Knockout.js [article]
Read more
  • 0
  • 1
  • 23287

article-image-data-around-us
Packt
29 Sep 2015
25 min read
Save for later

Data Around Us

Packt
29 Sep 2015
25 min read
In this article by Gergely Daróczi, author of the book Mastering Data Analysis with R we will discuss Spatial data, also known as geospatial data, which identifies geographic locations, such as natural or constructed features around us. Although all observations have some spatial content, such as the location of the observation, but this is out of most data analysis tools' range due to the complex nature of spatial information; alternatively, the spatiality might not be that interesting (at first sight) in the given research topic. On the other hand, analyzing spatial data can reveal some very important underlying structures of the data, and it is well worth spending time visualizing the differences and similarities between close or far data points. In this article, we are going to help with this and will use a variety of R packages to: Retrieve geospatial information from the Internet Visualize points and polygons on a map (For more resources related to this topic, see here.) Geocoding We will use the hflights dataset to demonstrate how one can deal with data bearing spatial information. To this end, let's aggregate our dataset but instead of generating daily data let's view the aggregated characteristics of the airports. For the sake of performance, we will use the data.table package: > library(hflights) > library(data.table) > dt <- data.table(hflights)[, list( + N = .N, + Cancelled = sum(Cancelled), + Distance = Distance[1], + TimeVar = sd(ActualElapsedTime, na.rm = TRUE), + ArrDelay = mean(ArrDelay, na.rm = TRUE)) , by = Dest] So we have loaded and then immediately transformed the hlfights dataset to a data.table object. At the same time, we aggregated by the destination of the flights to compute: The number of rows The number of cancelled flights The distance The standard deviation of the elapsed time of the flights The arithmetic mean of the delays The resulting R object looks like this: > str(dt) Classes 'data.table' and 'data.frame': 116 obs. of 6 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... - attr(*, ".internal.selfref")=<externalptr> So we have 116 observations all around the world and five variables describing those. Although this seems to be a spatial dataset, we have no geospatial identifiers that a computer can understand per se, so let's fetch the geocodes of these airports from the Google Maps API via the ggmap package. First, let's see how it works when we are looking for the geo-coordinates of Houston: > library(ggmap) > (h <- geocode('Houston, TX')) Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&sensor=false lon lat 1 -95.3698 29.76043 So the geocode function can return the matched latitude and longitude of the string we sent to Google. Now let's do the very same thing for all flight destinations: > dt[, c('lon', 'lat') := geocode(Dest)] Well, this took some time as we had to make 116 separate queries to the Google Maps API. Please note that Google limits you to 2,500 queries a day without authentication, so do not run this on a large dataset. There is a helper function in the package, called geocodeQueryCheck, which can be used to check the remaining number of free queries for the day. Some of the methods and functions we plan to use in some later sections of this article do not support data.table, so let's fall back to the traditional data.frame format and also print the structure of the current object: > str(setDF(dt)) 'data.frame': 116 obs. of 8 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... $ lon : num -97 136.5 -122.3 -73.8 -157.9 ... $ lat : num 32.9 34.7 47.5 40.6 21.3 ... This was pretty quick and easy, wasn't it? Now that we have the longitude and latitude values of all the airports, we can try to show these points on a map. Visualizing point data in space For the first time, let's keep it simple and load some package-bundled polygons as the base map. To this end, we will use the maps package. After loading it, we use the map function to render the polygons of the United States of America, add a title, and then some points for the airports and also for Houston with a slightly modified symbol: > library(maps) > map('state') > title('Flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, col = 'red', pch = 19) And showing the airport names on the plot is pretty easy as well: we can use the well-known functions from the base graphics package. Let's pass the three character names as labels to the text function with a slightly increased y value to shift the preceding text the previously rendered data points: > text(dt$lon, dt$lat + 1, labels = dt$Dest, cex = 0.7) Now we can also specify the color of the points to be rendered. This feature can be used to plot our first meaningful map to highlight the number of flights in 2011 to different parts of the USA: > map('state') > title('Frequent flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, pch = 19, + col = rgb(1, 0, 0, dt$N / max(dt$N))) > legend('bottomright', legend = round(quantile(dt$N)), pch = 19, + col = rgb(1, 0, 0, quantile(dt$N) / max(dt$N)), box.col = NA) So the intensity of red shows the number of flights to the given points (airports); the values range from 1 to almost 10,000. Probably it would be more meaningful to compute these values on a state level, as there are many airports, very close to each other, that might be better aggregated at a higher administrative area level. To this end, we load the polygon of the states, match the points of interest (airports) with the overlaying polygons (states), and render the polygons as a thematic map instead of points, like we did on the previous pages. Finding polygon overlays of point data We already have all the data we need to identify the parent state of each airport. The dt dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map function. Actually, this latter function can return the underlying dataset without rendering a plot: > str(map_data <- map('state', plot = FALSE, fill = TRUE)) List of 4 $ x : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ... $ y : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ... $ range: num [1:4] -124.7 -67 25.1 49.4 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ... - attr(*, "class")= chr "map" So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington): > grep('^washington', map_data$names, value = TRUE) [1] "washington:san juan island" "washington:lopez island" [3] "washington:orcas island" "washington:whidbey island" [5] "washington:main" In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon: > states <- sapply(strsplit(map_data$names, ':'), '[[', 1) We will use this list as the basis of aggregation from now on. Let's transform this map dataset into another class of object, so that we can use the powerful features of the sp package. We will use the maptools package to do this transformation: > library(maptools) > us <- map2SpatialPolygons(map_data, IDs = states, + proj4string = CRS("+proj=longlat +datum=WGS84")) An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster package especially useful to download free map shapefiles from gadm.org via the getData function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify function of the rgeos package. So we have just created an object called us, which includes the polygons of map_data for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot method instead of the map function: > plot(us) Besides this, however, the sp package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over function. As this function name conflicts with the one found in the grDevices package, it's better to refer to the function along with the namespace using a double colon: > library(sp) > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(sp::over(us, dtp)) 'data.frame': 49 obs. of 8 variables: $ Dest : chr "BHM" "PHX" "XNA" "LAX" ... $ N : int 2736 5096 1172 6064 164 NA NA 2699 3085 7886 ... $ Cancelled: int 39 29 34 33 1 NA NA 35 11 141 ... $ Distance : int 562 1009 438 1379 926 NA NA 1208 787 689 ... $ TimeVar : num 10.1 13.61 9.47 15.16 13.82 ... $ ArrDelay : num 8.696 2.166 6.896 8.321 -0.451 ... $ lon : num -86.8 -112.1 -94.3 -118.4 -107.9 ... $ lat : num 33.6 33.4 36.3 33.9 38.5 ... What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame function, which stored our data as spatial points with the given longitude and latitude values. Next we called the over function to left-join the values of dtp to the US states. An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output argument of the geocode function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet: geocode('LAX','all')$results[[1]]$address_components Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R. The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX as the matched airport for California (returned by states[4]), although there are many others there as well. To overcome this issue, we can do at least two things. First, we can use the returnList argument of the over function to return all matched rows of dtp, and we will then post-process that data: > str(sapply(sp::over(us, dtp, returnList = TRUE), + function(x) sum(x$Cancelled))) Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ... - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ... So we created and called an anonymous function that will sum up the Cancelled values of the data.frame in each element of the list returned by over. Another, probably cleaner, approach is to redefine dtp to only include the related values and pass a function to over to do the summary: > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], + dt[, 'Cancelled', drop = FALSE], + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(cancels <- sp::over(us, dtp, fn = sum)) 'data.frame': 49 obs. of 1 variable: $ Cancelled: int 51 44 34 97 23 NA NA 35 66 149 ... Either way, we have a vector to merge back to the US state names: > val <- cancels$Cancelled[match(states, row.names(cancels))] And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure): > val[is.na(val)] <- 0 Plotting thematic maps Now we have everything to create our first thematic map. Let's pass the val vector to the previously used map function (or plot it using the us object), specify a plot title, add a blue point for Houston, and then create a legend, which shows the quantiles of the overall number of cancelled flights as a reference: > map("state", col = rgb(1, 0, 0, sqrt(val/max(val))), fill = TRUE) > title('Number of cancelled flights from Houston to US states') > points(h$lon, h$lat, col = 'blue', pch = 13) > legend('bottomright', legend = round(quantile(val)), + fill = rgb(1, 0, 0, sqrt(quantile(val)/max(val))), box.col = NA) Please note that, instead of a linear scale, we decided to compute the square root of the relative values to define the intensity of the fill color, so that we can visually highlight the differences between the states. This was necessary as most flight cancellations happened in Texas (748), and there were no more than 150 cancelled flights in any other state (with the average being around 45). You can also easily load ESRI shape files or other geospatial vector data formats into R as points or polygons with a bunch of packages already discussed and a few others as well, such as the maptools, rgdal, dismo, raster, or shapefile packages. Another, probably easier, way to generate country-level thematic maps, especially choropleth maps, is to load the rworldmap package made by Andy South, and rely on the convenient mapCountryData function. Rendering polygons around points Besides thematic maps, another really useful way of presenting spatial data is to draw artificial polygons around the data points based on the data values. This is especially useful if there is no available polygon shape file to be used to generate a thematic map. A level plot, contour plot, or isopleths, might be an already familiar design from tourist maps, where the altitude of the mountains is represented by a line drawn around the center of the hill at the very same levels. This is a very smart approach having maps present the height of hills—projecting this third dimension onto a 2-dimensional image. Now let's try to replicate this design by considering our data points as mountains on the otherwise flat map. We already know the heights and exact geo-coordinates of the geometric centers of these hills (airports); the only challenge here is to draw the actual shape of these objects. In other words: Are these mountains connected? How steep are the hillsides? Should we consider any underlying spatial effects in the data? In other words, can we actually render these as mountains with a 3D shape instead of plotting independent points in space? If the answer for the last question is positive, then we can start trying to answer the other questions by fine-tuning the plot parameters. For now, let's simply suppose that there is a spatial effect in the underlying data, and it makes sense to visualize the data in such a way. Later, we will have the chance to disprove or support this statement either by analyzing the generated plots, or by building some geo-spatial models—some of these will be discussed later, in the Spatial Statistics section. Contour lines First, let's expand our data points into a matrix with the fields package. The size of the resulting R object is defined arbitrarily but, for the given number of rows and columns, which should be a lot higher to generate higher resolution images, 256 is a good start: > library(fields) > out <- as.image(dt$ArrDelay, x = dt[, c('lon', 'lat')], + nrow = 256, ncol = 256) The as.image function generates a special R object, which in short includes a 3‑dimensional matrix-like data structure, where the x and y axes represent the longitude and latitude ranges of the original data respectively. To simplify this even more, we have a matrix with 256 rows and 256 columns, where each of those represents a discrete value evenly distributed between the lowest and highest values of the latitude and longitude. And on the z axis, we have the ArrDelay values—which are in most cases of course missing: > table(is.na(out$z)) FALSE TRUE 112 65424 What does this matrix look like? It's better to see what we have at the moment: > image(out) Well, this does not seem to be useful at all. What is shown there? We rendered the x and y dimensions of the matrix with z colors here, and most tiles of this map are empty due to the high amount of missing values in z. Also, it's pretty straightforward now that the dataset included many airports outside the USA as well. How does it look if we focus only on the USA? > image(out, xlim = base::range(map_data$x, na.rm = TRUE), + ylim = base::range(map_data$y, na.rm = TRUE)) An alternative and more elegant approach to rendering only the US part of the matrix would be to drop the non-US airports from the database before actually creating the out R object. Although we will continue with this example for didactic purposes, with real data make sure you concentrate on the target subset of your data instead of trying to smooth and model unrelated data points as well. A lot better! So we have our data points as a tile, now let's try to identify the slope of these mountain peaks, to be able to render them on a future map. This can be done by smoothing the matrix: > look <- image.smooth(out, theta = .5) > table(is.na(look$z)) FALSE TRUE 14470 51066 As can be seen in the preceding table, this algorithm successfully eliminated many missing values from the matrix. The image.smooth function basically reused our initial data point values in the neighboring tiles, and computed some kind of average for the conflicting overrides. This smoothing algorithm results in the following arbitrary map, which does not respect any political or geographical boundaries: > image(look) It would be really nice to plot these artificial polygons along with the administrative boundaries, so let's clear out all cells that do not belong to the territory of the USA. We will use the point.in.polygon function from the sp package to do so: > usa_data <- map('usa', plot = FALSE, region = 'main') > p <- expand.grid(look$x, look$y) > library(sp) > n <- which(point.in.polygon(p$Var1, p$Var2, + usa_data$x, usa_data$y) == 0) > look$z[n] <- NA In a nutshell, we have loaded the main polygon of the USA without any sub-administrative areas, and verified our cells in the look object, if those are overlapping the polygon. Then we simply reset the value of the cell, if not. The next step is to render the boundaries of the USA, plot our smoothed contour plot, then add some eye-candy in the means of the US states and, the main point of interest, the airport: > map("usa") > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) > title('Arrival delays of flights from Houston') > points(dt$lon, dt$lat, pch = 19, cex = .5) > points(h$lon, h$lat, pch = 13) Now this is pretty neat, isn't it? Voronoi diagrams An alternative way of visualizing point data with polygons is to generate Voronoi cells between them. In short, the Voronoi map partitions the space into regions around the data points by aligning all parts of the map to one of the regions to minimize the distance from the central data points. This is extremely easy to interpret, and also to implement in R. The deldir package provides a function with the very same name for Delaunay triangulation: > library(deldir) > map("usa") > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + pch = 19, col = c('red', 'darkgray'), add = TRUE) Here, we represented the airports with red dots, as we did before, but also added the Dirichlet tessellation (Voronoi cells) rendered as dark-gray dashed lines. For more options on how to fine-tune the results, see the plot.deldir method. In the next section, let's see how to improve this plot by adding a more detailed background map to it. Satellite maps There are many R packages on CRAN that can fetch data from Google Maps, Stamen, Bing, or OpenStreetMap—even some of the packages we previously used in this article, like the ggmap package, can do this. Similarly, the dismo package also comes with both geo-coding and Google Maps API integration capabilities, and there are some other packages focused on that latter, such as the RgoogleMaps package. Now we will use the OpenStreetMap package, mainly because it supports not only the awesome OpenStreetMap database back-end, but also a bunch of other formats as well. For example, we can render really nice terrain maps via Stamen: > library(OpenStreetMap) > map <- openmap(c(max(map_data$y, na.rm = TRUE), + min(map_data$x, na.rm = TRUE)), + c(min(map_data$y, na.rm = TRUE), + max(map_data$x, na.rm = TRUE)), + type = 'stamen-terrain') So we defined the left upper and right lower corners of the map we need, and also specified the map style to be a satellite map. As the data by default arrives from the remote servers with the Mercator projections, we first have to transform that to WGS84 (we used this previously), so that we can render the points and polygons on the top of the fetched map: > map <- openproj(map, + projection = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs') And Showtime at last: > plot(map) > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + col = c('red', 'black'), pch = 19, cex = 0.5, add = TRUE) This seems to be a lot better compared to the outline map we created previously. Now you can try some other map styles as well, such as mapquest-aerial, or some of the really nice-looking cloudMade designs. Interactive maps Besides being able to use Web-services to download map tiles for the background of the maps created in R, we can also rely on some of those to generate truly interactive maps. One of the best known related services is the Google Visualization API, which provides a platform for hosting visualizations made by the community; you can also use it to share maps you've created with others. Querying Google Maps In R, you can access this API via the googleVis package written and maintained by Markus Gesmann and Diego de Castillo. Most functions of the package generate HTML and JavaScript code that we can directly view in a Web browser as an SVG object with the base plot function; alternatively, we can integrate them in a Web page, for example via the IFRAME HTML tag. The gvisIntensityMap function takes a data.frame with country ISO or USA state codes and the actual data to create a simple intensity map. We will use the cancels dataset we created in the Finding Polygon Overlays of Point Data section but, before that, we have to do some data transformations. Let's add the state name as a new column to the data.frame, and replace the missing values with zero: > cancels$state <- rownames(cancels) > cancels$Cancelled[is.na(cancels$Cancelled)] <- 0 Now it's time to load the package and pass the data along with a few extra parameters, signifying that we want to generate a state-level US map: > library(googleVis) > plot(gvisGeoChart(cancels, 'state', 'Cancelled', + options = list( + region = 'US', + displayMode = 'regions', + resolution = 'provinces'))) The package also offers opportunities to query the Google Map API via the gvisMap function. We will use this feature to render the airports from the dt dataset as points on a Google Map with an auto-generated tooltip of the variables. But first, as usual, we have to do some data transformations again. The location argument of the gvisMap function takes the latitude and longitude values separated by a colon: > dt$LatLong <- paste(dt$lat, dt$lon, sep = ':') We also have to generate the tooltips as a new variable, which can be done easily with an apply call. We will concatenate the variable names and actual values separated by a HTML line break: > dt$tip <- apply(dt, 1, function(x) + paste(names(dt), x, collapse = '<br/ >')) And now we just pass these arguments to the function for an instant interactive map: > plot(gvisMap(dt, 'LatLong', tipvar = 'tip')) Another nifty feature of the googleVis package is that you can easily merge the different visualizations into one by using the gvisMerge function. The use of this function is quite simple: specify any two gvis objects you want to merge, and also whether they are to be placed horizontally or vertically. JavaScript mapping libraries The great success of the trending JavaScript data visualization libraries is only partly due to their great design. I suspect other factors also contribute to the general spread of such tools: it's very easy to create and deploy full-blown data models, especially since the release and on-going development of Mike Bostock's D3.js. Although there are also many really useful and smart R packages to interact directly with D3 and topojson (see for example my R user activity compilation at http://bit.ly/countRies). Now we will only focus on how to use Leaflet— probably the most used JavaScript library for interactive maps. What I truly love in R is that there are many packages wrapping other tools, so that R users can rely on only one programming language, and we can easily use C++ programs and Hadoop MapReduce jobs or build JavaScript-powered dashboards without actually knowing anything about the underlying technology. This is especially true when it comes to Leaflet! There are at least two very nice packages that can generate a Leaflet plot from the R console, without a single line of JavaScript. The Leaflet reference class of the rCharts package was developed by Ramnath Vaidyanathan, and includes some methods to create a new object, set the viewport and zoom level, add some points or polygons to the map, and then render or print the generated HTML and JavaScript code to the console or to a file. Unfortunately, this package is not on CRAN yet, so you have to install it from GitHub: > devtools::install_github('ramnathv/rCharts') As a quick example, let's generate a Leaflet map of the airports with some tooltips, like we did with the Google Maps API in the previous section. As the setView method expects numeric geo-coordinates as the center of the map, we will use Kansas City's airport as a reference: > library(rCharts) > map <- Leaflet$new() > map$setView(as.numeric(dt[which(dt$Dest == 'MCI'), + c('lat', 'lon')]), zoom = 4) > for (i in 1:nrow(dt)) + map$marker(c(dt$lat[i], dt$lon[i]), bindPopup = dt$tip[i]) > map$show() Similarly, RStudio's leaflet package and the more general htmlwidgets package also provide some easy ways to generate JavaScript-powered data visualizations. Let's load the library and define the steps one by one using the pipe operator from the magrittr package, which is pretty standard for all packages created or inspired by RStudio or Hadley Wickham: > library(leaflet) > leaflet(us) %>% + addProviderTiles("Acetate.terrain") %>% + addPolygons() %>% + addMarkers(lng = dt$lon, lat = dt$lat, popup = dt$tip) I especially like this latter map, as we can load a third-party satellite map in the background, then render the states as polygons; we also added the original data points along with some useful tooltips on the very same map with literally a one-line R command. We could even color the state polygons based on the aggregated results we computed in the previous sections! Ever tried to do the same in Java? Alternative map designs Besides being able to use some third-party tools, another main reason why I tend to use R for all my data analysis tasks is that R is extremely powerful in creating custom data exploration, visualization, and modeling designs. As an example, let's create a flow-map based on our data, where we will highlight the flights from Houston based on the number of actual and cancelled flights. We will use lines and circles to render these two variables on a 2-dimensional map, and we will also add a contour plot in the background based on the average time delay. But, as usual, let's do some data transformations first! To keep the number of flows at a minimal level, let's get rid of the airports outside the USA at last: > dt <- dt[point.in.polygon(dt$lon, dt$lat, + usa_data$x, usa_data$y) == 1, ] We will need the diagram package (to render curved arrows from Houston to the destination airports) and the scales package to create transparent colors: > library(diagram) > library(scales) Then let's render the contour map described in the Contour Lines section: > map("usa") > title('Number of flights, cancellations and delays from Houston') > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) And then add a curved line from Houston to each of the destination airports, where the width of the line represents the number of cancelled flights and the diameter of the target circles shows the number of actual flights: > for (i in 1:nrow(dt)) { + curvedarrow( + from = rev(as.numeric(h)), + to = as.numeric(dt[i, c('lon', 'lat')]), + arr.pos = 1, + arr.type = 'circle', + curve = 0.1, + arr.col = alpha('black', dt$N[i] / max(dt$N)), + arr.length = dt$N[i] / max(dt$N), + lwd = dt$Cancelled[i] / max(dt$Cancelled) * 25, + lcol = alpha('black', + dt$Cancelled[i] / max(dt$Cancelled))) + } Well, this article ended up being about visualizing spatial data, and not really about analyzing spatial data by fitting models and filtering raw data. Summary In case you are interested in knowing other R-related books that Packt has in store for you, here is the link: R for Data Science Practical Data Science Cookbook Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Reduction with Principal Component Analysis [article]
Read more
  • 0
  • 0
  • 1764
Visually different images

Packt
25 Sep 2015
11 min read
Save for later

The Dashboard Design – Best Practices

Packt
25 Sep 2015
11 min read
 In this article by Julian Villafuerte, author of the book Creating Stunning Dashboards with QlikView you know more about the best practices for the dashboard design. (For more resources related to this topic, see here.) Data visualization is a field that is constantly evolving. However, some concepts have proven their value time and again through the years and have become what we call best practices. These notions should not be seen as strict rules that must be applied without any further consideration but as a series of tips that will help you create better applications. If you are a beginner, try to stick to them as much as you can. These best practices will save you a lot of trouble and will greatly enhance your first endeavors. On the other hand, if you are an advanced developer, combine them with your personal experiences in order to build the ultimate dashboard. Some guidelines in this article come from the widely known characters in the field of data visualization, such as Stephen Few, Edward Tufte, John Tukey, Alberto Cairo, and Nathan Yau. So, if a concept strikes your attention, I strongly recommend you to read more about it in their books. Throughout this article, we will review some useful recommendations that will help you create not only engaging, but also effective and user-friendly dashboards. Remember that they may apply differently depending on the information displayed and the audience you are working with. Nevertheless, they are great guidelines to the field of data visualization, so do not hesitate to consider them in all of your developments. Gestalt principles In the early 1900s, the Gestalt school of psychology conducted a series of studies on human perception in order to understand how our brain interprets forms and recognizes patterns. Understanding these principles may help you create a better structure for your dashboard and make your charts easier to interpret: Proximity: When we see multiple elements located near one another, we tend to see them as groups. For example, we can visually distinguish clusters in a scatter plot by grouping the dots according to their position. Similarity: Our brain associates the elements that are similar to each other (in terms of shape, size, color, or orientation). For example, in color-coded bar charts, we can associate the bars that share the same color even if they are not grouped. Enclosure: If a border surrounds a series of objects, we perceive them as part of a group. For example, if a scatter plot has reference lines that wrap the elements between 20 and 30 percent, we will automatically see them as a cluster. Closure: When we detect a figure that looks incomplete, we tend to perceive it as a closed structure. For example, even if we discard the borders of a bar chart, the axes will form a region that our brain will isolate without needing the extra lines. Continuity: If a number of objects are aligned, we will perceive them as a continuous body. For example, the different blocks of code when you indent QlikView script are percieved as one continuous code. Connection: If objects are connected by a line, we will see them as a group. For example, we tend to associate the dots connected by lines on a scatter plot with lines and symbols. Giving context to the data When it comes to analyzing data, context is everything. If you present isolated figures, the users will have a hard time trying to find the story hidden behind them. For example, if I told you that the gross margin of our company was 16.5 percent during the first quarter of 2015, would you evaluate it as a positive or negative sign? This is pretty difficult, right? However, what if we added some extra information to complement this KPI? Then, the following image would make a lot more sense: As you can see, adding context to the data can make the landscape look quite different. Now, it is easy to see that even though the gross margin has substantially improved during the last year, our company has some work to do in order to be competitive and surpass the industry standard. The appropriate references may change depending on the KPI you are dealing with and the goals of the organization, but some common examples are as follows: Last year's performance The quota, budget, or objective Comparison with the closest competitor, product, or employee The market share The industry standards Another good tip in this regard is to anticipate the comparisons. If you display figures regarding the monthly quota and the actual sales, you can save the users the mental calculations by including complementary indicators, such as the gap between them and the percentage of completion. Data-Ink Ratio One of the most interesting principles in the field of data visualization is Data-Ink Ratio, introduced by Edward R. Tufte in his book, The Visual Display of Quantitative Information, which must be read by every designer. In this publication, he states that there are two different types of ink (or in our case, pixels) in any chart, as follows: Data-ink: This includes all the nonerasable portions of graphic that are used to represent the actual data. These pixels are at the core of the visualization and cannot be removed without losing some of its content. Non-data-ink: This includes any element that is not directly related to the data or doesn't convey anything meaningful to the reader. Based on these concepts, he defined the Data Ink Ratio as the proportion of the graphic's ink that is devoted to the nonredundant display of data information: Data Ink Ratio = Data Ink / Total Ink As you can imagine, our goal is to maximize this number by decreasing the non-data-ink used in our dashboards. For example, the chart to the left has a low data-ink ratio due to the usage of 3D effects, shadows, backgrounds, and multiple grid lines. On the contrary, the chart to the right presents a higher ratio as most of the pixels are data-related. Avoiding chart junk Chart junk is another term coined by Tufte that refers to all the elements that distract the viewer from the actual information in a graphic. Evidently, chart junk is considered as non-data-ink and comprises of features such as heavy gridlines, frames, redundant labels, ornamental axes, backgrounds, overly complex fonts, shadows, images, or other effects included only as decoration. Take for instance the following charts: As you can see, by removing all the unnecessary elements in a chart, it becomes easier to interpret and looks much more elegant. Balance Colors, icons, reference lines, and other visual cues can be very useful to help the users focus on the most important elements in a dashboard. However, misusing or overusing these features can be a real hazard, so try to find the adequate balance for each of them. Excessive precision QlikView applications should use the appropriate language for each audience. When designing, think about whether precise figures will be useful or if they are going to become a distraction. Most of the time, dashboards show high-level KPIs, so it may be more comfortable for certain users to see rounded numbers, as in the following image: 3D charts One of Microsoft Excel's greatest wrongdoings is making everyone believe that 3D charts are good for data analysis. For some reason, people seem to love them; but, believe me, they are a real threat to business analysts. Despite their visual charm, these representations can easily hide some parts of the information and convey wrong perceptions depending on their usage of colors, shadows, and axis inclination. I strongly recommend you to avoid them in any context. Sorting Whether you are working with a list box, a bar chart, or a straight table, sorting an object is always advisable, as it adds context to the data. It can help you find the most commonly selected items in a list box, distinguish which slice is bigger on a pie chart when the sizes are similar, or easily spot the outliners in other graphic representations. Alignment and distribution Most of my colleagues argue that I am on the verge of an obsessive-compulsive disorder, but I cannot stand an application with unaligned objects. (Actually, I am still struggling with the fact that the paragraphs in this book are not justified, but anyway...). The design toolbar offers useful options in this regard, so there is no excuse for not having a tidy dashboard. If you take care of the quadrature of all the charts and filters, your interface will display a clean and professional look that every user will appreciate: Animations I have a rule of thumb regarding chart animation in QlikView—If you are Hans Rosling, go ahead. If not, better think it over twice. Even though they can be very illustrative, chart animations end up being a distraction rather than a tool to help us visualize data most of the time, so be conservative about their use. For those of you who do not know him, Hans Rosling is a Swedish professor of international health who works in Stockholm. However, he is best known for his amazing way of presenting data with GapMinder, a simple piece of software that allows him to animate a scatter plot. If you are a data enthusiast, you ought to watch his appearances in TED Talks. Avoiding scroll bars Throughout his work, Stephen Few emphasizes that all the information in a dashboard must fit on a single screen. Whilst I believe that there is no harm in splitting the data in multiple sheets, it is undeniable that scroll bars reduce the overall usability of an application. If the user has to continuously scroll right and left to read all the figures in a table, or if she must go up and down to see the filter panel, she will end up getting tired and eventually discard your dashboard. Consistency If you want to create an easy way to navigate your dashboard, you cannot forget about consistency. Locating standard objects (such as Current Selections Box, Search Object, and Filter Panels) in the same area in every tab will help the users easily find all the items they need. In addition, applying the same style, fonts, and color palettes in all your charts will make your dashboard look more elegant and professional. White space The space between charts, tables, and filters is often referred to as white space, and even though you may not notice it, it is a vital part of any dashboard. Displaying dozens of objects without letting them breathe makes your interface look cluttered and, therefore, harder to understand. Some of the benefits of using white space adequately are: The improvement in readability It focuses and emphasizes the important objects It guides the users' eyes, creating a sense of hierarchy in the dashboard It fosters a balanced layout, making your interface look clear and sophisticated Applying makeup Every now and then, you stumble upon delicate situations where some business users try their best to hide certain parts of the data. Whether it is about low sales or the insane amount of defective products, they often ask you to remove a few charts or avoid visual cues so that those numbers go unnoticed. Needless to say, dashboards are tools intended to inform and guide the decisions of the viewers, so avoid presenting misleading visualizations. Meaningless variety As a designer, you will often hesitate to use the same chart type multiple times in your application fearing that the users will get bored of it. Though this may be a haunting perception, if you present valuable data in an adequate format, there is no need to add new types of charts just for variety's sake. We want to keep the users engaged with great analyses, not just with pretty graphics. Summary In this article, you learned all about the best practices to be followed in Qlikview. Resources for Article: Further resources on this subject: Analyzing Financial Data in QlikView[article] Securing QlikView Documents[article] Common QlikView script errors [article]
Read more
  • 0
  • 0
  • 5484

article-image-integration-spark-sql
Packt
24 Sep 2015
11 min read
Save for later

Integration with Spark SQL

Packt
24 Sep 2015
11 min read
 In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]
Read more
  • 0
  • 0
  • 1370

article-image-find-friends-facebook
Packt
22 Sep 2015
13 min read
Save for later

Find Friends on Facebook

Packt
22 Sep 2015
13 min read
 In this article by the authors, Vikram Garg and Sharan Kumar Ravindran, of the book, Mastering Social Media Mining with R, we learn about data mining using Facebook as our resource. (For more resources related to this topic, see here.) We will see how to use the R package Rfacebook, which provides access to the Facebook Graph API from R. It includes a series of functions that allow us to extract various data about our network such as friends, likes, comments, followers, newsfeeds, and much more. We will discuss how to visualize our Facebook network and we will see some methodologies to make use of the available data to implement business cases. Rfacebook package installation and authentication The Rfacebook package is authored and maintained by Pablo Barbera and Michael Piccirilli. It provides an interface to the Facebook API. It needs Version 2.12.0 or later of R and it is dependent on a few other packages, such as httr, rjson, and httpuv. Before starting, make sure those packages are installed. It is preferred to have Version 0.6 of the httr package installed. Installation We will now install the Rfacebook packages. We can download and install the latest package from GitHub using the following code and load the package using the library function. On the other hand, we will also install the Rfacebook package from the CRAN network. One prerequisite for installing the package using the function install_github is to have the package devtools loaded into the R environment. The code is as follows: library(devtools) install_github("Rfacebook", "pablobarbera", subdir="Rfacebook") library(Rfacebook) After installing the Rfacebook package for connecting to the API, make an authentication request. This can be done via two different methods. The first method is by using the access token generated for the app, which is short-lived (valid for two hours); on the other hand, we can create a long-lasting token using the OAuth function. Let's first create a temporary token. Go to https://developers.facebook.com/tools/explorer, click on Get Token, and select the required user data permissions. The Facebook Graph API explorer will open with an access token. This access token will be valid for two hours. The status of the access token as well as the scope can be checked by clicking on the Debug button. Once the tokens expire, we can regenerate a new token. Now, we can access the data from R using the following code. The access token generated using the link should be copied and passed to the token variable. The use of username in the function getUsers is deprecated in the latest Graph API; hence, we are passing the ID of a user. You can get your ID from the same link that was used for token generation. This function can be used to pull the details of any user, provided the generated token has the access. Usually, access is limited to a few users with a public setting or those who use your app. It is also based on the items selected in the user data permission check page during token generation. In the following code, paste your token inside the double quotes, so that it can be reused across the functions without explicitly mentioning the actual token. token<- "XXXXXXXXX" A closer look at how the package works The getUsers function using the token will hit the Facebook Graph API. Facebook will be able to uniquely identify the users as well as the permissions to access information. If all the check conditions are satisfied, we will be able to get the required data. Copy the token from the mentioned URL and paste it within the double quotes. Remember that the token generated will be active only for two hours. Use the getUsers function to get the details of the user. Earlier, the getUsers function used to work based on the Facebook friend's name as well as ID; in API Version 2.0, we cannot access the data using the name. Consider the following code for example: token<- "XXXXXXXXX" me<- getUsers("778278022196130", token, private_info = TRUE) Then, the details of the user, such as name and hometown, can be retrieved using the following code: me$name The output is also mentioned for your reference: [1] "Sharan Kumar R" For the following code: me$hometown The output is as follows: [1] "Chennai, Tamil Nadu" Now, let's see how to create a long-lasting token. Open your Facebook app page by going to https://developers.facebook.com/apps/ and choosing your app. On theDashboard tab, you will be able to see the App ID and Secret Code values. Use those in the following code. require("Rfacebook") fb_oauth<- fbOAuth(app_id="11",app_secret="XX",extended_permissions = TRUE) On executing the preceding statements, you will find the following message in your console: Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/ When done, press any key to continue... Copy the URL displayed and open your Facebook app; on the Settings tab, click on the Add Platform button and paste the copied URL in the Site URL text box. Make sure to save the changes. Then, return to the R console and press any key to continue, you will be prompted to enter your Facebook username and password. On completing that, you will return to the R console. If you find the following message, it means your long-lived token is ready to use. When you get the completion status, you might not be able to access any of the information. It is advisable to use the OAuth function a few minutes after creation of the Facebook application. Authentication complete. Authentication successful. After successfully authenticating, we can save it and load on demand using the following code: save(fb_oauth, file="fb_oauth") load("fb_oauth") When it is required to automate a few things or to use Rfacebook extensively, it will be very difficult as the tokens should be generated quite often. Hence, it is advisable to create a long-lasting token to authenticate the user, and then save it. Whenever required, we can just load it from a local file. Note that Facebook authentication might take several minutes. Hence, if your authentication fails on the retry, please wait for some time before pressing any key and check whether you have installed the httr package Version 0.6. If you continue to experience any issues in generating the token, then it's not a problem. We are good to go with the temporary token. Exercise Create an app in Facebook and authenticate by any one of the methods discussed. A basic analysis of your network In this section, we will discuss how to extract Facebook network of friends and some more information about the people in our network. After completing the app creation and authentication steps, let's move forward and learn to pull some basic network data from Facebook. First, let's find out which friends we have access to, using the following command in R. Let's use the temporary token for accessing the data: token<- "XXXXXXXXX" friends<- getFriends(token, simplify = TRUE) head(friends) # To see few of your friends The preceding function will return all our Facebook friends whose data is accessible. Version 1 of the API would allow us to download all the friends' data by default. But in the new version, we have limited access. Since we have set simplify as TRUE, we will pull only the username and their Facebook ID. By setting the same parameter to FALSE, we will be able to access additional data such as gender, location, hometown, profile picture, relationship status, and full name. We can use the function getUsers to get additional information about a particular user. The following information is available by default: gender, location, and language. We can, however, get some additional information such as relationship status, birthday, and the current location by setting the parameter private_info to TRUE: friends_data<- getUsers(friends$id, token, private_info = TRUE) table(friends_data$gender) The output is as follows: female male 5 21 We can also find out the language, location, and relationship status. The commands to generate the details as well as the respective outputs are given here for your reference: #Language table(substr(friends_data$locale, 1, 2)) The output is as follows: en 26 The code to find the location is as follows: # Location (Country) table(substr(friends_data$locale, 4, 5)) The output is as follows: GB US 1 25 Here's the code to find the relationship status: # Relationship Status table(friends_data$relationship_status) Here's the output: Engaged Married Single 1 1 3 Now, let's see what things were liked by us in Facebook. We can use the function getLikes to get the like data. In order to know about your likes data, specify user as me. The same function can be used to extract information about our friends, in which case we should pass the user's Facebook ID. This function will provide us with a list of Facebook pages liked by the user, their ID, name, and the website associated with the page. We can even restrict the number of results retrieved by setting a value to the parameter n. The same function will be used to get the likes of people in our network; instead of the keyword me, we should give the Facebook ID of those users. Remember we can only access data of people with accessibility from our app. The code is as follows: likes<- getLikes(user="me", token=token) head(likes) After exploring the use of functions to pull data, let's see how to use the Facebook Query Language using the function getFQL, which can be used to pass the queries. The following query will get you the list of friends in your network: friends<- getFQL("SELECT uid2 FROM friend WHERE uid1=me()", token=token) In order to get the complete details of your friends, the following query can be used. The query will return the username, Facebook ID, and the link to their profile picture. Note that we might not be able to access the complete network of friends' data, since access to data of all your friends are deprecated with Version 2.0. The code is as follows: # Details about friends Friends_details<- getFQL("SELECT uid, name, pic_square FROM user WHERE uid = me() OR uid IN (SELECT uid2 FROM friend WHERE uid1 = me())", token=token) In order to know more about the Facebook Query Language, check out the following link. This method of extracting the information might be preferred by people familiar with query language. It can also help extract data satisfying only specific conditions (https://developers.facebook.com/docs/technical-guides/fql). Exercise Download your Facebook network and do an exploration analysis on the languages your friends speak, places where they live, the total number of pages they have liked, and their marital status. Try all these with the Facebook Query Language as well. Network analysis and visualization So far, we used a few functions to get the details about our Facebook profile as well as friends' data. Let's see how to get to know more about our network. Before learning to get the network data, let's understand what a network is as well as a few important concepts about the network. Anything connected to a few other things could be a network. Everything in real life is connected to each other, for example, people, machines, events, and so on. It would make a lot of sense if we analyzed them as a network. Let's consider a network of people; here, people will be the nodes in the network and the relationship between them would be the edges (lines connecting them). Social network analysis The technique to study/analyze the network is called social network analysis. We will see how to create a simple plot of friends in our network in this section. To understand the nodes (people/places/etc) in a network in social network analysis, we need to evaluate the position of the nodes. We can evaluate the nodes using centrality. Centrality can be measured using different methods like degree, betweenness, and closeness. Let's first get our Facebook network and then get to know the centrality measures in detail. We use the function getNetwork to download our Facebook network. We need to mention how we would like to format the data. When the parameter format is set to adj.matrix, it will produce the data in matrix format where the people in the network would become the row names and column names of the matrix and if they are connected to each other, then the corresponding cell in the matrix will hold a value. The command is as follows: network<- getNetwork(token, format="adj.matrix") We now have our Facebook network downloaded. Let's visualize our network before getting to understand the centrality concept one by one with our own network. To visualize the network, we need to use the package called igraph in R. Since we downloaded our network in the adjacency matrix format, we will use the same function in igraph. We use the layout function to determine the placement of vertices in the network for drawing the graph and then we use the plot function to draw the network. In order to explore various other functionalities in these parameters, you can execute the ?<function_name> function in RStudio and the help window will have the description of the function. Let’s use the following code to load the package igraph into R. require(igraph) We will now build the graph using the function graph.adjacency; this function helps in creating a network graph using the adjacency matrix. In order to build a force-directed graph, we will use the function layout.drl. The force-directed graph will help in making the graph more readable. The commands are as follows: social_graph<- graph.adjacency(network) layout<- layout.drl(social_graph, options=list(simmer.attraction=0)) At last, we will use the plot function with various built in parameters to make the graph more readable. For example, we can name the nodes in our network, we can set the size of the nodes as well as the edges in the network, and we can color the graph and the components of the graph. Use the following code to see what the network looks like. The output that was plotted can be saved locally using the function dev.copy, and the size of the image as well as the type can be passed as a parameter to the function: plot(social_graph, vertex.size=10, vertex.color="green", vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE, layout=layout.fruchterman.reingold) dev.copy(png,filename= "C:/Users/Sharan/Desktop/3973-03-community.png", width=600, height=600); dev.off (); With the preceding plot function, my network will look like the following one. In the following network, the node labels (name of the people) have been disabled. They can be enabled by removing the vertex.label parameter. Summary In this article, we discussed how to use the various functions implemented in the Rfacebook package, analyze the network. This article covers the important techniques that helps in performing vital network analysis and also enlightens us about the wide range of business problems that could be addressed with the Facebook data. It gives us a glimpse of the great potential for implementation of various analyses. Resources for Article: Further resources on this subject: Using App Directory in HootSuite[article] Supervised learning[article] Warming Up [article]
Read more
  • 0
  • 0
  • 1869
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-implementing-decision-trees
Packt
22 Sep 2015
4 min read
Save for later

Implementing Decision Trees

Packt
22 Sep 2015
4 min read
 In this article by the author, Sunila Gollapudi, of this book, Practical Machine Learning, we will outline a business problem that can be addressed by building a decision tree-based model, and see how it can be implemented in Apache Mahout, R, Julia, Apache Spark, and Python. This can happen many, many times. So, building a website or an app will take a bit longer than it used to. (For more resources related to this topic, see here.) Implementing decision trees Here, we will explore implementing decision trees using various frameworks and tools. The R example We will use the rpart and ctree packages in R to build decision tree-based models: Import the packages for data import and decision tree libraries as shown here: Start data manipulation: Create a categorical variable on Sales and append to the existing dataset as shown here: Using random functions, split data into training and testing datasets; Fit the tree model with training data and check how the model is working with testing data, measure the error: Prune the tree; Plotting the pruned tree will look like the following: The Spark example Java-based example using MLib is shown here: import java.util.HashMap; import scala.Tuple2; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.SparkConf; SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree"); JavaSparkContext sc = new JavaSparkContext(sparkConf); // Load and parse the data file. String datapath = "data/mllib/sales.txt"; JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD(); // Split the data into training and test sets (30% held out for testing) JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3}); JavaRDD<LabeledPoint> trainingData = splits[0]; JavaRDD<LabeledPoint> testData = splits[1]; // Set parameters. // Empty categoricalFeaturesInfo indicates all features are continuous. Integer numClasses = 2; Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>(); String impurity = "gini"; Integer maxDepth = 5; Integer maxBins = 32; // Train a DecisionTree model for classification. final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins); // Evaluate model on test instances and compute test error JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<Double, Double>(model.predict(p.features()), p.label()); } }); Double testErr = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return !pl._1().equals(pl._2()); } }).count() / testData.count(); System.out.println("Test Error: " + testErr); System.out.println("Learned classification tree model:n" + model.toDebugString()); The Julia example We will use the DecisionTree package in Julia as shown here; julia> Pkg.add("DecisionTree")julia> using DecisionTree We will use the RDatasets package to load the dataset for the example in context; julia> Pkg.add("RDatasets"); using RDatasets julia> sales = data("datasets", "sales"); julia> features = array(sales[:, 1:4]); # use matrix() for Julia v0.2 julia> labels = array(sales[:, 5]); # use vector() for Julia v0.2 julia> stump = build_stump(labels, features); julia> print_tree(stump) Feature 3, Threshold 3.0 L-> price : 50/50 R-> shelvelock : 50/100 Pruning the tree julia> length(tree) 11 julia> pruned = prune_tree(tree, 0.9); julia> length(pruned) 9 Summary In this article, we implemented decision trees using R, Spark, and Julia. Resources for Article: Further resources on this subject: An overview of common machine learning tasks[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]
Read more
  • 0
  • 0
  • 4028

article-image-stata-data-analytics-software
Packt
22 Sep 2015
11 min read
Save for later

Stata as Data Analytics Software

Packt
22 Sep 2015
11 min read
In this article by Prasad Kothari, the author of the book Data Analysis with STATA, the overall goal is to cover the STATA related topics such as data management, graphs and visualization and programming in STATA. The article will give a detailed description of STATA starting with an introduction to STATA and Data analytics and then talks about STATA programming and data management. After which it takes you through Data visualization and all the important statistical tests in STATA. Then the article will cover the Linear and the Logistics regression in STATA and in the end it will take you through few analyses like Survey analysis, Time Series analysis and Survival analysis in STATA. It also teaches different types of statistical modelling techniques and how to implement these techniques in STATA. (For more resources related to this topic, see here.) These days, many people use Stata for econometric and medical research purposes, among other things. There are many people who use different packages, such as Statistical Package for the Social Sciences (SPSS) and EViews, Micro, RATS/CATS (used by time series experts), and R for Matlab/Guass/Fortan (used for hardcore analysis). One should know the usage of Stata and then apply it in their relative fields. Stata is a command-driven language; there are over 500 different commands and menu options, and each has a particular syntax required to invoke any of the various options. Learning these commands is a time-consuming process, but it is not hard. At the end of each class, your do-file will contain all the commands that we have covered, but there is no way we will cover all of these commands in this short introductory course. Stata is a combined statistical analytical tool that is intended for use by research scholars and analytics practitioners. Stata has many strengths, but we are going to talk about the most important one: managing, adjusting, and arranging large sets of data. Stata has many versions, and with every version, it keeps on improving; for example, in Stata versions 11 to 14, there are changes and progress in the computing speed, capabilities and functionalities, as well as flexible graphic capabilities. Over a period of time, Stata keeps on changing and updating the model as per users' suggestions. In short, the regression method is based on a nonstandard feature, which means that you can easily get help from the Web if another person has written a program that can be integrated with their software for the purpose of analysis. The following topics will be covered in this articler: Introducing Data analytics Introducing the Stata interface and basic techniques Introducing data analytics We analyze data everyday for various reasons. To predict an event or forecast the key indicators, such as the revenue for given organization, is fast becoming a major requirement in the industry. There are various types of techniques and tools that can be leveraged to analyze the data. Here are the techniques that will be covered in this article using Stata as a tool: Stata Programming and Data management: Before predicting anything, we need to manage and massage the data in order to make it good enough to be something through which insights can be derived. The programming aspect helps in creating new variables to treat data in such a way that finding patterns in historical data or predicting the outcome of given event becomes much easier. Data visualization: After the data preparation, we need to visualize the data for the the following: To view what patterns in the data look like To check whether there are any outliers in the data To understand the data better To draw preliminary insights from the data Important statistical tests in Stata: After data visualization, based on observations, you can try to come up with various hypotheses about the data. We need to test these hypotheses on the datasets to check whether they are statistically significant and whether we can depend on and apply these hypotheses in future situations as well. Linear regression in Stata: Once done with the hypothesis testing, there is always a business need to predict one of the variables, such as what the revenue of the financial organization will be given the specific conditions, and so on. These predictions about continuous variables, such as the revenue, the default amount on the credit card, and the number of items sold in a given store, come through linear regression. Linear regression is the most basic and widely used prediction methodology. We will go into details of linear regression in a later chapter. Logistic regression in Stata: When you need to predict the outcome of a particular event along with the probability, logistic regression is the best and most acknowledged method by far. Predicting which team will win the match in football or cricket or predicting whether a customer will default on a loan payment can be decided through the probabilities given by logistic regression. Survey analysis in Stata: Understanding the customer sentiment and consumer experience is one of the biggest requirements of the retail industry. The research industry also needs data about people's opinion in order to derive the effect of a certain event or the sentiments of the affected people. All of these can be achieved by conducting and analyzing survey datasets. Survey analysis can have various subtechniques, such as factor analysis, principle component analysis, panel data analysis, and so on. Time series analysis in Stata: When you try to forecast a time-dependent variable with reasonable cyclic behavior of seasonality, time series analysis comes handy. There are many techniques of time series analysis, but we will talk about a couple of them: Autoregressive Integrated Moving Average (ARIMA) and Box Jenkins. Forecasting the amount of rainfall depending on the amount of rainfall in the past 5 years is a classic time series analysis problem. Survival analysis in Stata: These days, lots of customers attrite from telecom plans, healthcare plans, and so on and join the competitors. When you need to develop a churn model or attrition model to check who will attrite, survival analysis is the best model. The Stata interface Let's discuss the location and layout of Stata. It is very easy to locate Stata on a computer or laptop; after installing the software, go to the start menu, go to the search menu, and type Stata. You can find out the path where the file is saved. This depends on which version has been installed. Another way to find Stata on computer is through the quick launch button as well as through start programs. The preceding diagram represents the Stata layout. The four types of processors in Stata are multiprocessor (two or four), special edition processor (flavors), intercooled, and small processor. The multiprocessor is one of the most efficient processors. Though all processor versions function in a similar fashion, only variables' repressors frequency increases with each new version. At present, Stata version 11 is in demand and is being used on various computers. It is a type of software that runs on commands. In the new versions of Stata, new ways, such as menus that can search Stata, have come in the market; however, typing a command is the most simple and quick way to learn Stata. The more you leverage the functionality of typing the command, the better your learning is. Through the typing technique method, programming becomes easy and simple for analytics. Sometimes, it is difficult to find the exact syntax in commands; therefore, it is advisable that the menu command be used. Later on, you just copy the same command for further use. There are three ways to enter the commands, as follows: Use the do-file program. This is a type of program in which one has to inform the computer (through a command) that it needs to use the do-file type. Type the command manually through typing. Enter the command interactively; just click on the menu screen. Though all the three types discussed in the preceding bullets are used, the do-file type is the most frequently used one. The reason is that for a bigger file, it is faster as compared to manual typing. Secondly, it can store the data and keep it in the same format in which it was stored. Suppose you make a mistake and want to rectify it; what would you do? In this case, do-file is useful; one can correct it and run the program once again. Generally, an interactive command is used to find out the problem and later on, do-file is used to solve it. The following is an example of an interactive command: Data-storing techniques in Stata Stata is a multipurpose program, which can serve not only its own data, but also other data in a simple format, for example, ASCII. Regardless of the data type format (Excel/statistical package), it gets automatically exported to the ASCII file. This means that all the data can now easily be imported to Stata. The data entered in Stata is in different types of variables, such as vectors with individual observations in every row; it also holds strings and numeric strings. Every row has a detailed observation of the individual, country, firm, or whatever information is entered in Stata. As the data is stored in variables, it makes Stata the most efficient way to store information. Sometimes, it is better to save the data in a different storage form, such as the following: Matrices Macros Matrices should be used carefully as they consume more memory as compared to variables, so there might be a possibility of low space memory before work is started. Another form is macros; these are similar to variables in other programming languages and are named containers, which means they contain information of any type. There are two flavors of macros: local/temporary and global. Global macros are flexible and easy to manage; once they are defined in a computer or laptop, they can be easily opened through all commands. On the other hand, local macros are temporary objects that are formed for a particular environment and cannot be use in another area. For example, if you use a local macro for do-file, that code will only exist in that particular environment. Directories and folders in Stata Stata has a tree-style structure to organize directories as well as folders similar to other operating systems, such as Windows, Linux, Unix, and Mac OS. This makes things easy and can be retrieved later on dates that are convenient. For example, the data folder is used to save entire datasets, subfolders for every single dataset, and so on. In Stata, the following commands can be leveraged: Dos Linux Unix For example, if you need to change the directory, you can use the CD command for example: CD C:Stataforlder You can also generate a new directory along with the current directory you have been using. For example: mkdir "newstata". You can leverage the dir command to get the details of the directory. If you need the current directory name along with the directory, you can utilize the pwd or cd command. The use of paths in Stata depends on the type of data; usually, there are two paths: absolute and relative. The absolute path contains the full address, denoting the folder. In the command you have seen in the earlier example, we leveraged the CD command using the path that is absolute. On the contrary, the relative path provides us with the location of the file. The following example of mkdir has used the relative path: mkdir "EStata|Stata1" The use of the relative path will be beneficial, especially when working on different devices, such as a PC at home or a library or server. To separate folders, Windows and Dos use a backslash (), whereas Linux and Unix use a slash (/). Sometimes, these connotations might be troublesome when working on the server where Stata is installed. As a general rule, it is advisable that you use slashes in the relative path as Stata can easily understand slash as a separator. The following is an example of this: mkdir "/Stata1/Data" – this is how you create the new folder for your STATA work. Summary In this Article we discussed lots of basic commands, which can be leveraged while performing Stata programming. Read Data Analysis with Stata to gain detailed knowledge of the different data management techniques and programming in detail. As you learn more about Stata, you will understand the various commands and functions and their business applications. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] Financial Management with Microsoft Dynamics AX 2012 R3 [article] Taming Big Data using HDInsight [article]
Read more
  • 0
  • 0
  • 2623

article-image-getting-started-apache-spark-dataframes
Packt
22 Sep 2015
5 min read
Save for later

Getting Started with Apache Spark DataFrames

Packt
22 Sep 2015
5 min read
 In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes: Getting Apache Spark ML – a framework for large-scale machine learning Creating a data frame from CSV (For more resources related to this topic, see here.) Getting started with Apache Spark Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally. The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt. How to do it... Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib: Under a brand new folder (which will be our project root), we create a new file called build.sbt. Next, let's add to the project dependencies the Spark libraries: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) How it works... Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on. For this recipe, we skipped streaming and GraphX and added the remaining three libraries. There’s more… Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place. Creating a data frame from CSV In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file. The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class. How to do it... CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt: After adding the spark-csv dependency, our complete build.sbt looks as follows: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) resolvers ++= Seq( "Apache HBase" at"https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) fork := true Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]: val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") For this recipe, we'll be running Spark on standalone mode. Now let's load our pipe-separated file: org.apache.spark.sql.DataFrame val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) Summary In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning. Then we saw the creation of a data frame from CSV with the help of example code. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven[article] Ridge Regression[article] Reactive Data Streams [article]
Read more
  • 0
  • 0
  • 2189

Packt
22 Sep 2015
16 min read
Save for later

R ─ Classification and Regression Trees

Packt
22 Sep 2015
16 min read
"The classifiers most likely to be the best are the random forest (RF) versions, the best of which (implemented in R and accessed via caret), achieves 94.1 percent of the maximum accuracy overcoming 90 percent in the 84.3 percent of the data sets."                                                                          – Fernández-Delgado et al (2014) "You can't see the forest for the trees!"                                                                                                     – An old saying (For more resources related to this topic, see here.) In this article by Cory Lesmeister, the author of Mastering Machine Learning with R, the first item of discussion is the basic decision tree, which is both simple to build and understand. However, the single decision tree method does not perform as well as the other methods such as support vector machines or neural networks. Therefore, we will discuss the creation of multiple, sometimes hundreds of, different trees with their individual results combined, leading to a single overall prediction. The first quote written above is from Fernández-Delgado et al in the Journal of Machine Learning Research and is meant to set the stage that the techniques in this article are quite powerful, particularly when used for the classification problems. Certainly, they are not always the best solution, but they do provide a good starting point. Regression trees For an understanding of the tree-based methods, it is probably easier to start with a quantitative outcome and then move on to how it works on a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the residual sum of squares the most. These binary splits continue until the termination of the tree. Each subsequent split/partition is not done on the entire dataset but only on the portion of the prior split that it falls under. This top-down process is referred as recursive partitioning. It is also a process that is greedy, a term you may stumble on in reading about the machine learning methods. Greedy means that in each split in the process, the algorithm looks for the greatest reduction in the residual sum of squares without a regard to how well it will perform in the later partitions. The result is that you may end up with a full tree of unnecessary branches, leading to a low bias but high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree. The following figure provides a visual of the technique in action. The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits that lead to four terminal nodes. Each split is basically an if or then statement or uses an R syntax, ifelse(). In the first split, if X1 < 3.5, then the response is split into 4 observations with an average value of 2.4 and the remaining 26 observations. This left branch of 4 observations is a terminal node as any further splits would not substantially improve the residual sum of squares. The predicted value for the 4 observations in that partition of the tree becomes the average. The next split is at X2 < 4 and finally X1 < 7.5. An advantage of this method is that it can handle the highly nonlinear relationships; but can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node that it falls under. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further to achieve a low bias, high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree size. Regression Tree with 3 splits and 4 terminal nodes and the corresponding node average and number of observations. Classification trees Classification trees operate under the same principal as regression trees except that the splits are not determined by the residual sum of squares but an error rate. The error rate used is not what you would expect, where the calculation is simply misclassified observations divided by the total observations. As it turns out, when it comes to tree splitting, a misclassification rate by itself may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example. Suppose we have a node—let's call it N0 where you have 7 observations labeled No and 3 observations labeled Yes, and we say that the misclassified rate is 30 percent. With this in mind, let's calculate a common alternative error measure called Gini index. The formula for a single node Gini index is as follows: Gini = 1 – (probability of Class 1)2 – (probability of Class 2)2. For N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30 percent. Taking this example further, we will now create an N1 node with 3 of Class 1 and none of Class 2 along with N2, which has 4 observations from Class 1 and 3 from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30 percent, but look at the following to see how the overall Gini index has improved: Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0. Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.49. The new Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)) which is equal to (.3 x 0) + (.7 x 0.49) or 0.343. By doing a split on a surrogate error rate, we actually improved our model impurity by reducing it from 0.42 to 0.343, whereas the misclassification rate did not change. This is the methodology used by the rpart() package. Random forest To greatly improve our model's predictive ability, we can produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in the model development. The first is the use of bootstrap aggregation or bagging as it is called. In bagging, an individual tree is built on a sample of dataset, roughly two-thirds of the total observations. It is important to note that the remaining one-third is referred to as Out of Bag(OOB). This is repeated for dozens or hundreds of times and the results are averaged. Each of these trees is grown and not pruned based on any error measure and this means that the variance of each of these individual trees is high. However, by averaging the results, you can reduce the variance without increasing the bias. The next thing that the random forest brings to the table is that concurrently with the random sample of the data, it also takes a random sample of the input features at each split. In the randomForest package, we will use the default random number of the sampled predictors, which is the square root of the total predictors for classification problems and total predictors divided by 3 for regression. The number of predictors that the algorithm randomly chooses at each split can be changed via the model tuning process. By doing this random sampling of the features at each split and incorporating it into the methodology, you mitigate the effect of a highly correlated predictor in becoming the main driver in all of your bootstrapped trees and preventing you from reducing the variance that you hoped to achieve with bagging. The subsequent averaging of the trees that are less correlated to each other than if you only performed bagging, is more generalizable and more robust to outliers. Gradient boosting The boosting methods can become extremely complicated for you to learn and understand, but you should keep in mind about what is fundamentally happening behind the curtain. The main idea is to build an initial model of some kind (linear, spline, tree, and so on.) called the base-learner, examine the residuals, and fit a model based on these residuals around the so-called loss function. A loss function is merely the function that measures the discrepancy between the model and desired prediction, for example, a squared error for the regression or the logistic function for the classification. The process continues until it reaches some specified stopping criterion. This is like the student who takes a practice exam and gets 30 out of 100 questions wrong and as a result, studies only those 30 questions that were missed. The next practice exam they get 10 out of these 30 wrong and so only focus on these 10 questions and so on. If you would like to explore the theory behind this further, a great resource for you is available in Frontiers in Neurorobotics, Gradient boosting machines, a tutorial, Natekin A., Knoll A., (2013), at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/. As previously mentioned, boosting can be applied to many different base learners, but here we will only focus on the specifics of tree-based learning. Each tree iteration is small and we will determine how small it is with one of the tuning parameters referred to as interaction depth. In fact, it may be as small as one split, which is referred to as a stump. Trees are sequentially fit to the residuals according to the loss function up to the number of trees that we specified (our stopping criterion). There is another tuning parameter that we will need to identify and that is shrinkage. You can think of shrinkage as the rate at which your model is learning generally and specifically, as the contribution of each tree or stump to the model. This learning rate acts as a regularization parameter. The other thing about our boosting algorithm is that it is stochastic, meaning that it adds randomness by taking a random sample of our data at each tree. Introducing some randomness to a boosted model usually improves the accuracy and speed and reduces overfitting (Friedman 2002). As you may have guessed, tuning these parameters can be quite a challenge. These parameters can interact with each other and if you just tinker with one without considering the other, your model may actually perform worse. The caret package will help us in this endeavor. Business case The overall business objective in this situation is to see if we can improve the predictive ability for some of the cases. For regression, we will visit the prostate cancer data. For classification purposes, we will utilize both the breast cancer biopsy data and Pima Indian Diabetes data. Both random forests and boosting will be applied to all the three datasets. The simple tree method will be used only on the breast and prostate cancer sets. Regression tree We will jump right into the prostate data set, but first let's load the necessary R package, as follows: > library(rpart) #classification and regression trees > library(partykit) #treeplots > library(MASS) #breast and pima indian data > library(ElemStatLearn) #prostate data > library(randomForest) #random forests > library(gbm) #gradient boosting > library(caret) #tune hyper-parameter First, we will do regression on the prostate data. This involves calling the dataset, coding the gleason score as an indicator variable using the ifelse() function, and creating a test and training set. The training set will be pros.train and the test set will be pros.test, as follows: > data(prostate) > prostate$gleason = ifelse(prostate$gleason == 6, 0, 1) > pros.train = subset(prostate, train==TRUE)[,1:9] > pros.test = subset(prostate, train==FALSE)[,1:9] To build a regression tree on the training data, we will use the following rpart() function from R's party package. The syntax is quite similar to what we used in the other modeling techniques: > tree.pros <- rpart(lpsa~., data=pros.train) We can call this object using the print() function and cptable and then examine the error per split to determine the optimal number of splits in the tree, as follows: > print(tree.pros$cptable) CP nsplit rel error xerror xstd 1 0.35852251 0 1.0000000 1.0364016 0.1822698 2 0.12295687 1 0.6414775 0.8395071 0.1214181 3 0.11639953 2 0.5185206 0.7255295 0.1015424 4 0.05350873 3 0.4021211 0.7608289 0.1109777 5 0.01032838 4 0.3486124 0.6911426 0.1061507 6 0.01000000 5 0.3382840 0.7102030 0.1093327 This is a very important table to analyze. The first column labeled CP is the cost complexity parameter, which states that the second column, nsplit, is the number of splits in the tree. The rel error column stands for relative errors and is the residual sum of squares for that number of splits divided by the residual sum of squares for no splits (RSS(k)/RSS(0). Both xerror and xstd are based on a ten-fold cross-validation with xerror being the average error and xstd being the standard deviation of the cross-validation process. We can see that four splits produced slightly less errors using cross-validation while five splits produced the lowest error on the full dataset. You can examine this using the plotcp() function, as follows: > plotcp(tree.pros) The following is the output of the preceding command: The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting the tree size 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree accordingly—first creating an object for cp associated with the pruned tree from the table. Then the prune() function handles the rest as follows: > cp = min(tree.pros$cptable[5,]) > prune.tree.pros <- prune(tree.pros, cp = cp) With this done, you can plot and compare the full and pruned trees. The tree plots produced by the partykit package are much better than those produced by the party package. You can simply use the as.party() function as a wrapper in the plot() function: > plot(as.party(tree.pros)) The output of the preceding command is as follows: > plot(as.party(prune.tree.pros)) The following is the output of the preceding command: Note that the splits are exactly the same in the two trees with the exception of the last split, which includes the age variable for the full tree. Interestingly, both the first and second splits in the tree are related to the log of cancer volume lcavol. These plots are quite informative as they show the splits, nodes, observations per node, and box plots of the outcome that we are trying to predict. Let's see how well the pruned tree performs on the test data. What we will do is create an object of the predicted values using the predict() function by incorporating the test data. Then, we will calculate the errors as the predicted values minus the actual values and finally the mean of the squared errors, as follows: > party.pros.test <- predict(prune.tree.pros, newdata=pros.test) > rpart.resid = party.pros.test - pros.test$lpsa #calculate residuals > mean(rpart.resid^2) #caluclate MSE [1] 0.5267748 One can look at the tree plots that we produced and easily explain what are the primary drivers behind the response. As mentioned in the introduction, the trees are easy to interpret and explain, which, in many cases, may be more important than the accuracy. Classification tree For the classification problem, we will prepare the breast cancer data. After loading the data, you will delete the patient ID, rename the features, eliminate the few missing values, and then create the train/test datasets, as follows: > data(biopsy) > biopsy <- biopsy[,-1] #delete ID > names(biopsy) = c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class") #change the feature names > biopsy.v2 = na.omit(biopsy) #delete the observations with missing values > set.seed(123) #random number generator > ind = sample(2, nrow(biopsy.v2), replace=TRUE, prob=c(0.7, 0.3)) > biop.train = biopsy.v2[ind==1,] #the training data set > biop.test = biopsy.v2[ind==2,] #the test data set With the data set up appropriately, we will use the same syntax style for a classification problem as we did previously for a regression problem, but before creating a classification tree, we will need to ensure that the outcome is a factor, which can be done using the str() function. as follows: > str(biop.test[,10]) Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 2 1 1 … First, we will create the tree: > set.seed(123) > tree.biop <- rpart(class~., data=biop.train) Then, examine the table for the optimal number of splits: > print(tree.biop$cptable) CP nsplit rel error xerror xstd 1 0.79651163 0 1.0000000 1.0000000 0.06086254 2 0.07558140 1 0.2034884 0.2674419 0.03746996 3 0.01162791 2 0.1279070 0.1453488 0.02829278 4 0.01000000 3 0.1162791 0.1744186 0.03082013 The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the full and pruned tree, and see how it performs on the test set, as follows: > cp = min(tree.biop$cptable[3,]) > prune.tree.biop <- prune(tree.biop, cp = cp) > plot(as.party(tree.biop)) > plot(as.party(prune.tree.biop)) An examination of the tree plots shows that the uniformity of the cell size is the first split, then bare nuclei. The full tree had an additional split at the cell thickness. We can predict the test observations using type="class" in the predict() function: > rparty.test <- predict(prune.tree.biop, newdata=biop.test, type="class") > table(rparty.test, biop.test$class) rparty.test benign malignant benign 136 3 malignant 6 64 > (136+64)/209 [1] 0.9569378 The basic tree with just two splits gets us almost 96 percent accuracy. This still falls short but should encourage us to believe that we can improve on it with the upcoming methods, starting with random forests. Summary In this article we learned both the power and limitations of tree-based learning methods for both classification and regression problems. To improve on predictive ability, we have the tools of the random forest and gradient boosted trees at our disposal. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] Using R for Statistics, Research, and Graphics [article] First steps with R [article]
Read more
  • 0
  • 0
  • 5027
article-image-modeling-complex-functions-artificial-neural-networks
Packt
21 Sep 2015
13 min read
Save for later

Modeling complex functions with artificial neural networks

Packt
21 Sep 2015
13 min read
 In this article by Sebastian Raschka, the author of Python Machine Learning, we will take a look at the concept of multilayer artificial neural networks, which was inspired by hypotheses and models of how the human brain works to solve complex problem tasks. (For more resources related to this topic, see here.) Although artificial neural networks gained a lot of popularity in the recent years, early studies of neural networks goes back to the 1940s, when Warren McCulloch and Walter Pitt first described the concept of how neurons may work. However, the decades that followed saw the first implementation of the McCulloch-Pitt neuron model, Rosenblatt's perceptron in the 1950s. Many researchers and machine learning practitioners slowly began to lose interest in neural networks, since no one had a good solution for the training of a neural network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the discovery and popularization of the backpropagation algorithm to train neural networks more efficiently (Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). Learning representations by back-propagating errors. Nature 323 (6088): 533–536). During the last decade, many more major breakthroughs have been made, known as deep learning algorithms. These can be used to create so-called feature detectors from unlabeled data to pre-train deep neural networks—neural networks that are composed of many layers. Neural networks are a hot topic not only in academic research but also in big technology companies such as Facebook, Microsoft, and Google. They invest heavily in artificial neural networks and deep learning research. Today, complex neural networks powered by deep learning algorithms are considered state of the art when it comes to solving complex problems, such as image and voice recognition. Introducing the multilayer neural network architecture In this section, we will connect multiple single neurons to a multilayer feed-forward neural network; this type of network is also called multilayer perceptron (MLP). The following figure illustrates the concept of an MLP consisting of three layers: one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively. As shown in the preceding diagram, we denote the ith activation unit in the jth layer as , and the activation units  and  are the bias units, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit: Each unit in layer j is connected to all units in layer j + 1 via a weight coefficient; for example, the connection between unit a in layer j and unit b in layer j + 1 would be written as  . Note that the superscript i in  stands for the ith sample, not the ith layer; in the following paragraphs, we will often omit the superscript i for clarity. Activating a neural network via forward propagation In this section, we will describe the process of forward propagation to calculate the output of an MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the MLP learning procedure in three simple steps: Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output. Based on the network's output, we calculate the error we want to minimize using a cost function, which we will describe later. We then backpropagate the error, find its derivative with respect to each weight in the network, and update the model. Finally, after we have repeated steps 1-3 for many epochs and learned the weights of the MLP, we use forward propagation to calculate the network output, and apply a threshold function to obtain the predicted class labels in the one-hot representation, which we described in the previous section. Now, let's walk through the individual steps of forward propagation to generate an output from the patterns in the training data. Since each unit in the hidden unit is connected to all units in the input layers, we first calculate the activation  as follows: Here, is the net input and  is the activation function, which has to be differentiable so as to learn the weights that connect the neurons using a gradient-based approach. To be able to solve complex problems such as image classification, we need non-linear activation functions in our MLP model, for example, the sigmoid (logistic) activation function: The sigmoid function is an "S"-shaped curve that maps the net input z onto a logistic distribution in the range 0 to 1, which passes the origin at z = 0.5 as shown in the following graph: Intuitively, we can think of the neurons in the MLP as logistic regression units that return values in the continuous range between 0 and 1. For purposes of code efficiency and readability, we will now write the activation in a more compact form using the concepts of basic linear algebra, which will allow us to vectorize our code implantation via NumPy rather than writing multiple nested and expensive Python for-loops: Here,  is our [m +1] x 1 dimensional feature vector for a sample  plus bias unit, and  is [m + 1] x h dimensional weight matrix where h is the number of hidden units in our neural network. After matrix-vector multiplication, we obtain the [m + 1] x 1 dimensional net input vector  . Furthermore, we can generalize this computation to all n samples in the training set: is now an n x [m + 1] matrix, and the matrix-matrix multiplication will result in an h x n dimensional net input matrix  . Finally, we apply the activation function g to each value in the net input matrix to get the h x n activation matrix  for the next layer (here, the output layer): Similarly, we can rewrite the activation of the output layer in the vectorized form: Here, we multiply the t x n matrix  (t is the number of output class labels) by the h x n dimensional matrix  to obtain the t x n dimensional matrix  (the columns in this matrix represent the outputs for each sample). Lastly, we apply the sigmoid activation function to obtain the continuous-valued output of our network: Classifying handwritten digits In this section, we will train our first multilayer neural network to classify handwritten digits from the popular MNIST dataset (Mixed National Institute of Standards and Technology database), which has been constructed by Yann LeCun and others (Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998) and serves as a popular benchmark dataset for machine learning algorithms. Obtaining the MNIST dataset The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and consists of these four parts: Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, 60,000 samples) Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, 60,000 labels) Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, 10,000 samples) Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, 10,000 labels) In this section, we will only be working with a subset of MNIST. Thus, we only need to download the training set images and training set labels. After downloading the files, I recommend that you unzip the files using the Unix/Linux GZip tool from the terminal for efficiency, for example, using the following command in your local MNIST download directory or, alternatively, your favorite unarchiver tool if you are working with a Microsoft Windows machine: gzip *ubyte.gz -d The images are stored in byte form, and using the following function, we will read them into NumPy arrays, which we will use to train our MLP: >>> import os >>> import struct >>> import numpy as np >>> def load_mnist(path): ... labels_path = os.path.join(path, 'train-labels-idx1-ubyte') ... images_path = os.path.join(path, 'train-images-idx3-ubyte') ... with open(labels_path, 'rb') as lbpath: ... magic, n = struct.unpack('>II', lbpath.read(8)) ... labels = np.fromfile(lbpath, dtype=np.uint8) ... with open(images_path, 'rb') as imgpath: ... magic, num, rows, cols = struct.unpack( ... ">IIII", imgpath.read(16)) ... images = np.fromfile(imgpath, ... dtype=np.uint8).reshape(len(labels), 784) ... return images, labels The load_mnist function returns an n x m dimensional NumPy array (images), where n is the number of samples (60,000), and m is the number of features. The images in the MNIST dataset consist of 28 x 28 pixels, and each pixel is represented by a grayscale intensity value. Here, we unroll the 28 x 28 pixels into 1D row vectors, which represent the rows in our images array (784 per row or image). The load_mnist function returns a second array, labels, which contains the 60,000 class labels (integers 0-9) of the handwritten digits. The way we read in the image might seem a little strange at first: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.int8) To understand how these two lines of code work, let's take a look at the dataset description from the MNIST website: [offset] [type] [value] [description] 0000 32 bit integer 0x00000801(2049) magic number (MSB first) 0004 32 bit integer 60000 number of items 0008 unsigned byte ?? label 0009 unsigned byte ?? label ........ xxxx unsigned byte ?? label Using the two lines of the preceding code, we first read in the "magic number," which is a description of the file protocol as well as the "number of items" (n) from the file buffer, before we read the following bytes into a NumPy array using the fromfile method. The fmt parameter value >II that we passed as an argument to struct.unpack can be composed of two parts: >: Big-endian (defines the order in which a sequence of bytes is stored) I: Unsigned integer After executing the following code, we should have a label vector of 60,000 instances, that is, a 60,000 × 784 dimensional image matrix: >>> X, y = load_mnist('mnist') >>> print('Rows: %d, columns: %d' % (X.shape[0], X.shape[1])) Rows: 60000, columns: 784 To get a idea of what those images in MNIST look like, let's define a function that reshapes a 784-pixel sample from our feature matrix into the original 28 × 28 image that we can plot via matplotlib's imshow function: >>> import matplotlib.pyplot as plt >>> def plot_digit(X, y, idx): ... img = X[idx].reshape(28,28) ... plt.imshow(img, cmap='Greys', interpolation='nearest') ... plt.title('true label: %d' % y[idx]) ... plt.show() Now let's use the plot_digit function to display an arbitrary digit (here, the fifth digit) from the dataset: >>> plot_digit(X, y, 4) Implementing a multilayer perceptron In this section, we will implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I tried to keep the code as simple as possible. However, it may seem a little complicated at first. If you are not running the code from the IPython notebook, I recommend that you copy it to a Python script file in your current working directory, for example, neuralnet.py, which you can then import into your current Python session via this: from neuralnet import NeuralNetMLP Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features), 50 hidden units (n_hidden), and 10 output units (n_output): >>> nn = NeuralNetMLP(n_output=10, ... n_features=X.shape[1], ... n_hidden=50, ... l2=0.1, ... l1=0.0, ... epochs=800, ... eta=0.001, ... alpha=0.001, ... decrease_const=0.00001, ... shuffle=True, ... minibatches=50, ... random_state=1) l2: The  parameter for L2 regularization. This is used to decrease the degree of overfitting; equivalently, l1 is the  for L1 regularization. epochs: The number of passes over the training set. eta: The learning rate . alpha: A parameter for momentum learning used to add a factor of the previous gradient to the weight update for faster learning: (where t is the current time step or epoch). decrease_const: The decrease constant d for an adaptive learning rate  that decreases over time for better convergence . shuffle: Shuffle the training set prior to every epoch to prevent the algorithm from getting stuck in circles. minibatches: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning. Next, we train the MLP using 10,000 samples from the already shuffled MNIST dataset. Note that we only use 10,000 samples to keep the time for training reasonable (up to 5 minutes on standard desktop computer hardware). However, you are encouraged to use more training data for model fitting to increase the predictive accuracy: >>> nn.fit(X[:10000], y[:10000], print_progress=True) Epoch: 800/800 Similar to our earlier Adaline implementation, we save the cost for each epoch in a cost_ list, which we can now visualize, making sure that the optimization algorithm has reached convergence. Here, we plot only every 50th step to account for the 50 mini-batches (50 minibatches × 800 epochs): >>> import matplotlib.pyplot as plt >>> plt.plot(range(len(nn.cost_)//50), nn.cost_[::50], color='red') >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs') >>> plt.show() As we can see, the optimization algorithm converged after approximately 700 epochs. Now let's evaluate the performance of the model by calculating the prediction accuracy: >>> y_pred = nn.predict(X[:10000]) >>> acc = np.sum(y[:10000] == y_pred, axis=0) / 10000 >>> print('Training accuracy: %.2f%%' % (acc * 100)) Training accuracy: 97.60% As you can see, the model gets most of the training data right. But how does it generalize to data that it hasn't seen before during training? Let's calculate the test accuracy on 5,000 images that were not included in the training set: >>> y_pred = nn.predict(X[10000:15000]) >>> acc = np.sum(y[10000:15000] == y_pred, axis=0) / 5000 >>> print('Test accuracy: %.2f%%' % (acc * 100)) Test accuracy: 92.40% Summary Based on the discrepancy between the training and test accuracy, we can conclude that the model slightly overfits the training data. To decrease the degree of overfitting, we can change the number of hidden units or the values of the regularization parameters, or fit the model on more training data. Resources for Article: Further resources on this subject: Asynchronous Programming with Python[article] The Essentials of Working with Python Collections[article] Python functions – Avoid repeating code [article]
Read more
  • 0
  • 0
  • 3168

article-image-scraping-data
Packt
21 Sep 2015
18 min read
Save for later

Scraping the Data

Packt
21 Sep 2015
18 min read
In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives. (For more resources related to this topic, see here.) Analyzing a web page To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option: The data we are interested in is found in this part of the HTML: <table> <tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag: </label></td><td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr> … <tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table> This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version. Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here: This will open a panel showing the surrounding HTML hierarchy of the selected element: In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data. Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html. To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> url = 'http://example.webscraping.com/view/United Kingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] This result shows that the <td class="w2p_fw"> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows: >>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] '244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique: >>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities: >>> re.findall('<tr id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?) </td>', html)[0] '244,820 square kilometres' This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag. From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags:       <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <html> <body> <ul class="country"> <li>Area</li> <li>Population</li> </ul> </body> </html> Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] Beautiful Soup overview Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup: BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter. find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes. find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None. prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks. For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the area from our example country: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/ United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. Lxml Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html. As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> import lxml.html >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = lxml.html.fromstring(broken_html) # parse the HTML >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors. Here is an example using the lxml CSS selectors to extract the area data: >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print area 244,820 square kilometres The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class. CSS selectors CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need: Select any tag: * Select by tag <a>: a Select by class of "link": .link Select by tag <a> with class "link": a.link Select by tag <a> with ID "home": a#home Select by child <span> of tag <a>: a > span Select by descendant <span> of tag <a>: a span Select by tag <a> with attribute title of "Home": a[title=Home] The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors. Note that, internally, lxml converts the CSS selectors into an equivalent XPath. Comparing performance To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country's web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here: Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data: FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') import re def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results from bs4 import BeautifulSoup def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results import lxml.html def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results Scraping results Now that we have complete implementations for each scraper, we will test their relative performance with this snippet: import time NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/ United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper), ('BeautifulSoup', bs_scraper), ('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end – start) This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches. Here are the results from this script on my computer: $ python performance.py Regular expressions: 5.50 seconds BeautifulSoup: 42.84 seconds Lxml: 7.06 seconds The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module! Overview The following table summarizes the advantages and disadvantages of each approach to scraping: Scraping approach Performance Ease of use Ease to install Regular expressions Fast Hard Easy (built-in module) Beautiful Soup Slow Easy Easy (pure Python) Lxml Fast Easy Moderately difficult If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches. Adding a scrape callback to the link crawler Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python: def link_crawler(..., scrape_callback=None): … links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) … The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback. Here is a modified version of the lxml example scraper that can be used for the callback function: def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print url, row This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows: import csv class ScrapeCallback: def __init__(self): self.writer = csv.writer(open('countries.csv', 'w')) self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.writer.writerow(self.fields) def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field)) [0].text_content()) self.writer.writerow(row) To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is "called" as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python's special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names. This code shows how to pass this callback to the link crawler: link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=ScrapeCallback()) Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice: Success! We have completed our first working scraper. Summary In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples. Resources for Article: Further resources on this subject: Scientific Computing APIs for Python [article] Bizarre Python [article] Optimization in Python [article]
Read more
  • 0
  • 0
  • 3531

article-image-opencv-detecting-edges-lines-shapes
Oli Huggins
17 Sep 2015
19 min read
Save for later

OpenCV: Detecting Edges, Lines, and Shapes

Oli Huggins
17 Sep 2015
19 min read
Edges play a major role in both human and computer vision. We, as humans, can easily recognize many object types and their positons just by seeing a backlit silhouette or a rough sketch. Indeed, when art emphasizes edges and pose, it often seems to convey the idea of an archetype, such as Rodin's The Thinker or Joe Shuster's Superman. Software, too, can reason about edges, poses, and archetypes. This OpenCV tutorial has been taken from Learning OpenCV 3 Computer Vision with Python. If you want to learn more, click here. OpenCV provides many edge-finding filters, including Laplacian(), Sobel(), and Scharr(). These filters are supposed to turn non-edge regions to black, while turning edge regions to white or saturated colors. However, they are prone to misidentifying noise as edges. This flaw can be mitigated by blurring an image before trying to find its edges. OpenCV also provides many blurring filters, including blur() (simple average), medianBlur(), and GaussianBlur(). The arguments for the edge-finding and blurring filters vary, but always include ksize, an odd whole number that represents the width and height (in pixels) of the filter's kernel. For the purpose of blurring, let's use medianBlur(), which is effective in removing digital video noise, especially in color images. For the purpose of edge-finding, let's use Laplacian(), which produces bold edge lines, especially in grayscale images. After applying medianBlur(), but before applying Laplacian(), we should convert the BGR to grayscale. Once we have the result of Laplacian(), we can invert it to get black edges on a white background. Then, we can normalize (so that its values range from 0 to 1) and multiply it with the source image to darken the edges. Let's implement this approach in filters.py: def strokeEdges(src, dst, blurKsize = 7, edgeKsize = 5): if blurKsize >= 3: blurredSrc = cv2.medianBlur(src, blurKsize) graySrc = cv2.cvtColor(blurredSrc, cv2.COLOR_BGR2GRAY) else: graySrc = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY) cv2.Laplacian(graySrc, cv2.CV_8U, graySrc, ksize = edgeKsize) normalizedInverseAlpha = (1.0 / 255) * (255 - graySrc) channels = cv2.split(src) for channel in channels: channel[:] = channel * normalizedInverseAlpha cv2.merge(channels, dst) Note that we allow kernel sizes to be specified as arguments to strokeEdges(). The blurKsizeargument is used as ksize for medianBlur(), while edgeKsize is used as ksize for Laplacian(). With my webcams, I find that a blurKsize value of 7 and an edgeKsize value of 5 look best. Unfortunately, medianBlur() is expensive with a large ksize, such as 7. [box type="info" align="" class="" width=""]If you encounter performance problems when running strokeEdges(), try decreasing the blurKsize value. To turn off the blur option, set it to a value less than 3.[/box] Custom kernels – getting convoluted As we have just seen, many of OpenCV's predefined filters use a kernel. Remember that a kernel is a set of weights that determine how each output pixel is calculated from a neighborhood of input pixels. Another term for a kernel is a convolution matrix. It mixes up or convolvesthe pixels in a region. Similarly, a kernel-based filter may be called a convolution filter. OpenCV provides a very versatile function, filter2D(), which applies any kernel or convolution matrix that we specify. To understand how to use this function, let's first learn the format of a convolution matrix. This is a 2D array with an odd number of rows and columns. The central element corresponds to a pixel of interest and the other elements correspond to this pixel's neighbors. Each element contains an integer or floating point value, which is a weight that gets applied to an input pixel's value. Consider this example: kernel = numpy.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]]) Here, the pixel of interest has a weight of 9 and its immediate neighbors each have a weight of -1. For the pixel of interest, the output color will be nine times its input color, minus the input colors of all eight adjacent pixels. If the pixel of interest was already a bit different from its neighbors, this difference becomes intensified. The effect is that the image looks sharperas the contrast between neighbors is increased. Continuing our example, we can apply this convolution matrix to a source and destination image, respectively, as follows: cv2.filter2D(src, -1, kernel, dst) The second argument specifies the per-channel depth of the destination image (such as cv2.CV_8U for 8 bits per channel). A negative value (as used here) means that the destination image has the same depth as the source image. [box type="info" align="" class="" width=""]For color images, note that filter2D() applies the kernel equally to each channel. To use different kernels on different channels, we would also have to use the split()and merge() functions.[/box] Based on this simple example, let's add two classes to filters.py. One class, VConvolutionFilter, will represent a convolution filter in general. A subclass, SharpenFilter, will specifically represent our sharpening filter. Let's edit filters.py to implement these two new classes as follows: class VConvolutionFilter(object): """A filter that applies a convolution to V (or all of BGR).""" def __init__(self, kernel): self._kernel = kernel def apply(self, src, dst): """Apply the filter with a BGR or gray source/destination.""" cv2.filter2D(src, -1, self._kernel, dst) class SharpenFilter(VConvolutionFilter): """A sharpen filter with a 1-pixel radius.""" def __init__(self): kernel = numpy.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]]) VConvolutionFilter.__init__(self, kernel) Note that the weights sum up to 1. This should be the case whenever we want to leave the image's overall brightness unchanged. If we modify a sharpening kernel slightly so that its weights sum up to 0 instead, then we have an edge detection kernel that turns edges white and non-edges black. For example, let's add the following edge detection filter to filters.py: class FindEdgesFilter(VConvolutionFilter): """An edge-finding filter with a 1-pixel radius.""" def __init__(self): kernel = numpy.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]]) VConvolutionFilter.__init__(self, kernel) Next, let's make a blur filter. Generally, for a blur effect, the weights should sum up to 1 and should be positive throughout the neighborhood. For example, we can take a simple average of the neighborhood as follows: class BlurFilter(VConvolutionFilter): """A blur filter with a 2-pixel radius.""" def __init__(self): kernel = numpy.array([[0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04]]) VConvolutionFilter.__init__(self, kernel) Our sharpening, edge detection, and blur filters use kernels that are highly symmetric. Sometimes, though, kernels with less symmetry produce an interesting effect. Let's consider a kernel that blurs on one side (with positive weights) and sharpens on the other (with negative weights). It will produce a ridged or embossed effect. Here is an implementation that we can add to filters.py: class EmbossFilter(VConvolutionFilter): """An emboss filter with a 1-pixel radius.""" def __init__(self): kernel = numpy.array([[-2, -1, 0], [-1, 1, 1], [ 0, 1, 2]]) VConvolutionFilter.__init__(self, kernel) This set of custom convolution filters is very basic. Indeed, it is more basic than OpenCV's ready-made set of filters. However, with a bit of experimentation, you will be able to write your own kernels that produce a unique look. Modifying an application Now that we have high-level functions and classes for several filters, it is trivial to apply any of them to the captured frames in Cameo. Let's edit cameo.py and add the lines that appear in bold face in the following excerpt: import cv2 import filters from managers import WindowManager, CaptureManager class Cameo(object): def __init__(self): self._windowManager = WindowManager('Cameo', self.onKeypress) self._captureManager = CaptureManager( cv2.VideoCapture(0), self._windowManager, True) self._curveFilter = filters.BGRPortraCurveFilter() def run(self): """Run the main loop.""" self._windowManager.createWindow() while self._windowManager.isWindowCreated: self._captureManager.enterFrame() frame = self._captureManager.frame filters.strokeEdges(frame, frame) self._curveFilter.apply(frame, frame) self._captureManager.exitFrame() self._windowManager.processEvents() Here, I have chosen to apply two effects: stroking the edges and emulating Portra film colors. Feel free to modify the code to apply any filters you like. Here is a screenshot from Cameo, with stroked edges and Portra-like colors: Edge detection with Canny OpenCV also offers a very handy function, called Canny, (after the algorithm's inventor, John F. Canny) which is very popular not only because of its effectiveness, but also the simplicity of its implementation in an OpenCV program as it is a one-liner: import cv2 import numpy as np img = cv2.imread("../images/statue_small.jpg", 0) cv2.imwrite("canny.jpg", cv2.Canny(img, 200, 300)) cv2.imshow("canny", cv2.imread("canny.jpg")) cv2.waitKey() cv2.destroyAllWindows() The result is a very clear identification of the edges: The Canny edge detection algorithm is quite complex but also interesting: it's a five-step process that denoises the image with a Gaussian filter, calculates gradients, applies nonmaximum suppression (NMS) on edges and a double threshold on all the detected edges to eliminate false positives, and, lastly, analyzes all the edges and their connection to each other to keep the real edges and discard weaker ones. Contours detection Another vital task in computer vision is contour detection, not only because of the obvious aspect of detecting contours of subjects contained in an image or video frame, but because of the derivative operations connected with identifying contours. These operations are, namely computing bounding polygons, approximating shapes, and, generally, calculating regions of interest, which considerably simplifies the interaction with image data. This is because a rectangular region with numpy is easily defined with an array slice. We will be using this technique a lot when exploring the concept of object detection (including faces) and object tracking. Let's go in order and familiarize ourselves with the API first with an example: import cv2 import numpy as np img = np.zeros((200, 200), dtype=np.uint8) img[50:150, 50:150] = 255 ret, thresh = cv2.threshold(img, 127, 255, 0) image, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) color = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR) img = cv2.drawContours(color, contours, -1, (0,255,0), 2) cv2.imshow("contours", color) cv2.waitKey() cv2.destroyAllWindows() Firstly, we create an empty black image that is 200x200 pixels size. Then, we place a white square in the center of it, utilizing ndarray's ability to assign values for a slice. We then threshold the image, and call the findContours() function. This function takes three parameters: the input image, hierarchy type, and the contour approximation method. There are a number of aspects of particular interest about this function: The function modifies the input image, so it would be advisable to use a copy of the original image (for example, by passing img.copy()). Secondly, the hierarchy tree returned by the function is quite important: cv2.RETR_TREE will retrieve the entire hierarchy of contours in the image, enabling you to establish "relationships" between contours. If you only want to retrieve the most external contours, use cv2.RETR_EXTERNAL. This is particularly useful when you want to eliminate contours that are entirely contained in other contours (for example, in a vast majority of cases, you won't need to detect an object within another object of the same type). The findContours function returns three elements: the modified image, contours, and their hierarchy. We use the contours to draw on the color version of the image (so we can draw contours in green) and eventually display it. The result is a white square, with its contour drawn in green. Spartan, but effective in demonstrating the concept! Let's move on to more meaningful examples. Contours – bounding box, minimum area rectangle and minimum enclosing circle Finding the contours of a square is a simple task; irregular, skewed, and rotated shapes bring the best out of the cv2.findContours utility function of OpenCV. Let's take a look at the following image: In a real-life application, we would be most interested in determining the bounding box of the subject, its minimum enclosing rectangle, and circle. The cv2.findContours function in conjunction with another few OpenCV utilities makes this very easy to accomplish: import cv2 import numpy as np img = cv2.pyrDown(cv2.imread("hammer.jpg", cv2.IMREAD_UNCHANGED)) ret, thresh = cv2.threshold(cv2.cvtColor(img.copy(), cv2.COLOR_BGR2GRAY) , 127, 255, cv2.THRESH_BINARY) image, contours, hier = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) for c in contours: # find bounding box coordinates x,y,w,h = cv2.boundingRect(c) cv2.rectangle(img, (x,y), (x+w, y+h), (0, 255, 0), 2) # find minimum area rect = cv2.minAreaRect(c) # calculate coordinates of the minimum area rectangle box = cv2.boxPoints(rect) # normalize coordinates to integers box = np.int0(box) # draw contours cv2.drawContours(img, [box], 0, (0,0, 255), 3) # calculate center and radius of minimum enclosing circle (x,y),radius = cv2.minEnclosingCircle(c) # cast to integers center = (int(x),int(y)) radius = int(radius) # draw the circle img = cv2.circle(img,center,radius,(0,255,0),2) cv2.drawContours(img, contours, -1, (255, 0, 0), 1) cv2.imshow("contours", img) After the initial imports, we load the image, and then apply a binary threshold on a grayscale version of the original image. By doing this, we operate all find-contours calculations on a grayscale copy, but we draw on the original so that we can utilize color information. Firstly, let's calculate a simple bounding box: x,y,w,h = cv2.boundingRect(c) This is a pretty straightforward conversion of contour information to x and y coordinates, plus the height and width of the rectangle. Drawing this rectangle is an easy task: cv2.rectangle(img, (x,y), (x+w, y+h), (0, 255, 0), 2) Secondly, let's calculate the minimum area enclosing the subject: rect = cv2.minAreaRect(c) box = cv2.boxPoints(rect) box = np.int0(box) The mechanism here is particularly interesting: OpenCV does not have a function to calculate the coordinates of the minimum rectangle vertexes directly from the contour information. Instead, we calculate the minimum rectangle area, and then calculate the vertexes of this rectangle. Note that the calculated vertexes are floats, but pixels are accessed with integers (you can't access a "portion" of a pixel), so we'll need to operate this conversion. Next, we draw the box, which gives us the perfect opportunity to introduce the cv2.drawContours function: cv2.drawContours(img, [box], 0, (0,0, 255), 3) Firstly, this function—like all drawing functions—modifies the original image. Secondly, it takes an array of contours in its second parameter so that you can draw a number of contours in a single operation. So, if you have a single set of points representing a contour polygon, you need to wrap this into an array, exactly like we did with our box in the preceding example. The third parameter of this function specifies the index of the contour array that we want to draw: a value of -1 will draw all contours; otherwise, a contour at the specified index in the contour array (the second parameter) will be drawn. Most drawing functions take the color of the drawing and its thickness as the last two parameters. The last bounding contour we're going to examine is the minimum enclosing circle: (x,y),radius = cv2.minEnclosingCircle(c) center = (int(x),int(y)) radius = int(radius) img = cv2.circle(img,center,radius,(0,255,0),2) The only peculiarity of the cv2.minEnclosingCircle function is that it returns a two-element tuple, of which, the first element is a tuple itself, representing the coordinates of a circle's center, and the second element is the radius of this circle. After converting all these values to integers, drawing the circle is quite a trivial operation. The final result on the original image looks like this: Contours – convex contours and the Douglas-Peucker algorithm Most of the time, when working with contours, subjects will have the most diverse shapes, including convex ones. A convex shape is defined as such when there exists two points within that shape whose connecting line goes outside the perimeter of the shape itself. The first facility OpenCV offers to calculate the approximate bounding polygon of a shape is cv2.approxPolyDP. This function takes three parameters: A contour. An "epsilon" value representing the maximum discrepancy between the original contour and the approximated polygon (the lower the value, the closer the approximated value will be to the original contour). A boolean flag signifying that the polygon is closed. The epsilon value is of vital importance to obtain a useful contour, so let's understand what it represents. Epsilon is the maximum difference between the approximated polygon's perimeter and the perimeter of the original contour. The lower this difference is, the more the approximated polygon will be similar to the original contour. You may ask yourself why we need an approximate polygon when we have a contour that is already a precise representation. The answer is that a polygon is a set of straight lines, and the importance of being able to define polygons in a region for further manipulation and processing is paramount in many computer vision tasks. Now that we know what an epsilon is, we need to obtain contour perimeter information as a reference value; this is obtained with the cv2.arcLength function of OpenCV: epsilon = 0.01 * cv2.arcLength(cnt, True) approx = cv2.approxPolyDP(cnt, epsilon, True) Effectively, we're instructing OpenCV to calculate an approximated polygon whose perimeter can only differ from the original contour in an epsilon ratio. OpenCV also offers a cv2.convexHull function to obtain processed contour information for convex shapes, and this is a straightforward one-line expression: hull = cv2.convexHull(cnt) Let's combine the original contour, approximated polygon contour, and the convex hull in one image to observe the difference. To simplify things, I've applied the contours to a black image so that the original subject is not visible, but its contours are: As you can see, the convex hull surrounds the entire subject, the approximated polygon is the innermost polygon shape, and in between the two is the original contour, mainly composed of arcs. Detecting lines and circles Detecting edges and contours are not only common and important tasks, they also constitute the basis for other—more complex—operations. Lines and shape detection walk hand in hand with edge and contour detection, so let's examine how OpenCV implements these. The theory behind line and shape detection has its foundations in a technique called Hough transform, invented by Richard Duda and Peter Hart, extending (generalizing) the work done by Paul Hough in the early 1960s. Let's take a look at OpenCV's API for Hough transforms. Line detection First of all, let's detect some lines, which is done with the HoughLines and HoughLinesP functions. The only difference between the two functions is that one uses the standard Hough transform, and the second uses the probabilistic Hough transform (hence the P in the name). The probabilistic version is called as such because it only analyzes lines as subset of points and estimates the probability of these points to all belong to the same line. This implementation is an optimized version of the standard Hough transform, in that, it's less computationally intensive and executes faster. Let's take a look at a very simple example: import cv2 import numpy as np img = cv2.imread('lines.jpg') gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) edges = cv2.Canny(gray,50,120) minLineLength = 20 maxLineGap = 5 lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap) for x1,y1,x2,y2 in lines[0]: cv2.line(img,(x1,y1),(x2,y2),(0,255,0),2) cv2.imshow("edges", edges) cv2.imshow("lines", img) cv2.waitKey() cv2.destroyAllWindows() The crucial point of this simple script—aside from the HoughLines function call—is the setting of the minimum line length (shorter lines will be discarded) and maximum line gap, which is the maximum size of a gap in a line before the two segments start being considered as separate lines. Also, note that the HoughLines function takes a single channel binary image, processed through the Canny edge detection filter. Canny is not a strict requirement, but an image that's been denoised and only represents edges is the ideal source for a Hough transform, so you will find this to be a common practice. The parameters of HoughLinesP are the image, MinLineLength and MaxLineGap, which we mentioned previously, rho and theta which refers to the geometrical representations of the lines, which are usually 1 and np.pi/180, threshold which represents the threshold below which a line is discarded. The Hough transform works with a system of bins and votes, with each bin representing a line, so any line with a minimum of <threshold> votes is retained, and the rest are discarded. Circle detection OpenCV also has a function used to detect circles, called HoughCircles. It works in a very similar fashion to HoughLines, but where minLineLength and maxLineGap were the parameters to discard or retain lines, HoughCircles has a minimum distance between the circles' centers and the minimum and maximum radius of the circles. Here's the obligatory example: import cv2 import numpy as np planets = cv2.imread('planet_glow.jpg') gray_img = cv2.cvtColor(planets, cv2.COLOR_BGR2GRAY) img = cv2.medianBlur(gray_img, 5) cimg = cv2.cvtColor(img,cv2.COLOR_GRAY2BGR) circles = cv2.HoughCircles(img,cv2.HOUGH_GRADIENT,1,120, param1=100,param2=30,minRadius=0,maxRadius=0) circles = np.uint16(np.around(circles)) for i in circles[0,:]: # draw the outer circle cv2.circle(planets,(i[0],i[1]),i[2],(0,255,0),2) # draw the center of the circle cv2.circle(planets,(i[0],i[1]),2,(0,0,255),3) cv2.imwrite("planets_circles.jpg", planets) cv2.imshow("HoughCirlces", planets) cv2.waitKey() cv2.destroyAllWindows() Here's a visual representation of the result: Detecting shapes The detection of shapes using the Hough transform is limited to circles; however, we've already implicitly explored the detection of shapes of any kind, specifically, when we talked about approxPolyDP. This function allows the approximation of polygons, so if your image contains polygons, they will be quite accurately detected combining the usage of cv2.findContours and cv2.approxPolyDP. Summary At this point, you should have gained a good understanding of color spaces, the Fourier transform, and several kinds of filters made available by OpenCV to process images. You should also be proficient in detecting edges, lines, circles and shapes in general, additionally you should be able to find contours and exploit the information they provide about the subjects contained in an image. These concepts will serve as the ideal background to explore the topics in the next chapter, Image Segmentation and Depth Estimation. Further resources on this subject: OpenCV: Basic Image Processing OpenCV: Camera Calibration OpenCV: Tracking Faces with Haar Cascades
Read more
  • 0
  • 0
  • 59756
article-image-recommender-systems
Packt
16 Sep 2015
6 min read
Save for later

Recommender Systems

Packt
16 Sep 2015
6 min read
In this article by Suresh K Gorakala and Michele Usuelli, authors of the book Building a Recommendation System with R, we will learn how to prepare relevant data by covering the following topics: Selecting the most relevant data Exploring the most relevant data Normalizing the data Binarizing the data (For more resources related to this topic, see here.) Data preparation Here, we show how to prepare the data to be used in recommender models. These are the steps: Select the relevant data. Normalize the data. Selecting the most relevant data On exploring the data, you will notice that the table contains: Movies that have been viewed only a few times; their rating might be biased because of lack of data Users that rated only a few movies; their rating might be biased We need to determine the minimum number of users per movie and vice versa. The correct solution comes from an iteration of the entire process of preparing the data, building a recommendation model, and validating it. Since we are implementing the model for the first time, we can use a rule of thumb. After having built the models, we can come back and modify the data preparation. We define ratings_movies containing the matrix that we will use. It takes the following into account: Users who have rated at least 50 movies Movies that have been watched at least 100 times The following code shows this: ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] ratings_movies ## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings. ratings_movies contains about half the number of users and a fifth of the number of movies that MovieLense has. Exploring the most relevant data Let's visualize the top 2 percent of users and movies of the new matrix: # visualize the top matrix min_movies <- quantile(rowCounts(ratings_movies), 0.98) min_users <- quantile(colCounts(ratings_movies), 0.98) Let's build the heat-map: image(ratings_movies[rowCounts(ratings_movies) > min_movies, colCounts(ratings_movies) > min_users], main = ""Heatmap of the top users and movies"") As you have already noticed, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. However, we have visualized the top movies only. In order to have an overview of all the users, let's take a look at the distribution of the average rating by users: average_ratings_per_user <- rowMeans(ratings_movies) Let's visualize the distribution: qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) + ggtitle(""Distribution of the average rating per user"") As suspected, the average rating varies a lot across different users. Normalizing the data Users that give high (or low) ratings to all their movies might bias the results. We can remove this effect by normalizing the data in such a way that the average rating of each user is 0. The prebuilt normalize function does it automatically: ratings_movies_norm <- normalize(ratings_movies) Let's take a look at the average rating by user. sum(rowMeans(ratings_movies_norm) > 0.00001) ## [1] 0 As expected, the mean rating of each user is 0 (apart from the approximation error). We can visualize the new matrix using an image. Let's build the heat-map: # visualize the normalised matrix image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies,colCounts(ratings_movies_norm) > min_users],main = ""Heatmap of the top users and movies"") The first difference that we can notice are the colors, and it's because the data is continuous. Previously, the rating was an integer number between 1 and 5. After normalization, the rating can be any number between -5 and 5. There are still some lines that are more blue and some that are more red. The reason is that we are visualizing only the top movies. We already checked that the average rating is 0 for each user. Binarizing the data A few recommendation models work on binary data, so we might want to binarize our data, that is, define a table containing only 0s and 1s. The 0s will be treated as either missing values or bad ratings. In our case, we can do either of the following: Define a matrix that has 1 if the user rated the movie and 0 otherwise. In this case, we are losing the information about the rating. Define a matrix that has 1 if the rating is more than or equal to a definite threshold (for example 3) and 0 otherwise. In this case, giving a bad rating to a movie is equivalent to not rating it. Depending on the context, one choice is more appropriate than the other. The function to binarize the data is binarize. Let's apply it to our data. First, let's define a matrix equal to 1 if the movie has been watched, that is, if its rating is at least 1. ratings_movies_watched <- binarize(ratings_movies, minRating = 1) Let's take a look at the results. In this case, we will have black-and-white charts, so we can visualize a bigger portion of users and movies, for example, 5 percent. Similar to what we did earlier, let's select the 5 percent using quantile. The row and column counts are the same as the original matrix, so we can still apply rowCounts and colCounts on ratings_movies: min_movies_binary <- quantile(rowCounts(ratings_movies), 0.95) min_users_binary <- quantile(colCounts(ratings_movies), 0.95) Let's build the heat-map: image(ratings_movies_watched[rowCounts(ratings_movies) > min_movies_binary, colCounts(ratings_movies) > min_users_binary],main = ""Heatmap of the top users and movies"") Only a few cells contain non-watched movies. This is just because we selected the top users and movies. Let's use the same approach to compute and visualize the other binary matrix. Now, each cell is one if the rating is above a threshold, for example 3, and 0 otherwise. ratings_movies_good <- binarize(ratings_movies, minRating = 3) Let's build the heat-map: image(ratings_movies_good[rowCounts(ratings_movies) > min_movies_binary, colCounts(ratings_movies) > min_users_binary], main = ""Heatmap of the top users and movies"") As expected, we have more white cells now. Depending on the model, we can leave the ratings matrix as it is or normalize/binarize it. Summary In this article, you learned about data preparation and how you should select, explore, normalize, and binarize the data. Resources for Article: Further resources on this subject: Structural Equation Modeling and Confirmatory Factor Analysis [article] Warming Up [article] https://www.packtpub.com/books/content/supervised-learning [article]
Read more
  • 0
  • 0
  • 1108

article-image-dynamodb-best-practices
Packt
15 Sep 2015
24 min read
Save for later

DynamoDB Best Practices

Packt
15 Sep 2015
24 min read
 In this article by Tanmay Deshpande, the author of the book DynamoDB Cookbook, we will cover the following topics: Using a standalone cache for frequently accessed items Using the AWS ElastiCache for frequently accessed items Compressing large data before storing it in DynamoDB Using AWS S3 for storing large items Catching DynamoDB errors Performing auto-retries on DynamoDB errors Performing atomic transactions on DynamoDB tables Performing asynchronous requests to DynamoDB (For more resources related to this topic, see here.) Introduction We are going to talk about DynamoDB implementation best practices, which will help you improve the performance while reducing the operation cost. So let's get started. Using a standalone cache for frequently accessed items In this recipe, we will see how to use a standalone cache for frequently accessed items. Cache is a temporary data store, which will save the items in memory and will provide those from the memory itself instead of making a DynamoDB call. Make a note that this should be used for items, which you expect to not be changed frequently. Getting ready We will perform this recipe using Java libraries. So the prerequisite is that you should have performed recipes, which use the AWS SDK for Java. How to do it… Here, we will be using the AWS SDK for Java, so create a Maven project with the SDK dependency. Apart from the SDK, we will also be using one of the most widely used open source caches, that is, EhCache. To know about EhCache, refer to http://ehcache.org/. Let's use a standalone cache for frequently accessed items: To use EhCache, we need to include the following repository in pom.xml: <repositories> <repository> <id>sourceforge</id> <name>sourceforge</name> <url>https://oss.sonatype.org/content/repositories/ sourceforge-releases/</url> </repository> </repositories> We will also need to add the following dependency: <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.9.0</version> </dependency> Once the project setup is done, we will create a cachemanager class, which will be used in the following code: public class ProductCacheManager { // Ehcache cache manager CacheManager cacheManager = CacheManager.getInstance(); private Cache productCache; public Cache getProductCache() { return productCache; } //Create an instance of cache using cache manager public ProductCacheManager() { cacheManager.addCache("productCache"); this.productCache = cacheManager.getCache("productCache"); } public void shutdown() { cacheManager.shutdown(); } } Now, we will create another class where we will write a code to get the item from DynamoDB. Here, we will first initiate the ProductCacheManager: static ProductCacheManager cacheManager = new ProductCacheManager(); Next, we will write a method to get the item from DynamoDB. Before we fetch the data from DynamoDB, we will first check whether the item with the given key is available in cache. If it is available in cache, we will return it from cache itself. If the item is not found in cache, we will first fetch it from DynamoDB and immediately put it into cache. Once the item is cached, every time we need this item, we will get it from cache, unless the cached item is evicted: private static Item getItem(int id, String type) { Item product = null; if (cacheManager.getProductCache().isKeyInCache(id + ":" + type)) { Element prod = cacheManager.getProductCache().get(id + ":" + type); product = (Item) prod.getObjectValue(); System.out.println("Returning from Cache"); } else { AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); DynamoDB dynamoDB = new DynamoDB(client); Table table = dynamoDB.getTable("product"); product = table.getItem(new PrimaryKey("id", id, "type", type)); cacheManager.getProductCache().put( new Element(id + ":" + type, product)); System.out.println("Making DynamoDB Call for getting the item"); } return product; } Now we can use this method whenever needed. Here is how we can test it: Item product = getItem(10, "book"); System.out.println("First call :Item: " + product); Item product1 = getItem(10, "book"); System.out.println("Second call :Item: " + product1); cacheManager.shutdown(); How it works… EhCache is one of the most popular standalone caches used in the industry. Here, we are using EhCache to store frequently accessed items from the product table. Cache keeps all its data in memory. Here, we will save every item against its keys that are cached. We have the product table, which has the composite hash and range keys, so we will also store the items against the key of (Hash Key and Range Key). Note that caching should be used for only those tables that expect lesser updates. It should only be used for the table, which holds static data. If at all anyone uses cache for not so static tables, then you will get stale data. You can also go to the next level and implement a time-based cache, which holds the data for a certain time, and after that, it clears the cache. We can also implement algorithms, such as Least Recently Used (LRU), First In First Out (FIFO), to make the cache more efficient. Here, we will make comparatively lesser calls to DynamoDB, and ultimately, save some cost for ourselves. Using AWS ElastiCache for frequently accessed items In this recipe, we will do the same thing that we did in the previous recipe. The only thing we will change is that we will use a cloud hosted distributed caching solution instead of saving it on the local standalone cache. ElastiCache is a hosted caching solution provided by Amazon Web Services. We have two options to select which caching technology you would need. One option is Memcached and another option is Redis. Depending upon your requirements, you can decide which one to use. Here are links that will help you with more information on the two options: http://memcached.org/ http://redis.io/ Getting ready To get started with this recipe, we will need to have an ElastiCache cluster launched. If you are not aware of how to do it, you can refer to http://aws.amazon.com/elasticache/. How to do it… Here, I am using the Memcached cluster. You can choose the size of the instance as you wish. We will need a Memcached client to access the cluster. Amazon has provided a compiled version of the Memcached client, which can be downloaded from https://github.com/amazonwebservices/aws-elasticache-cluster-client-memcached-for-java. Once the JAR download is complete, you can add it to your Java Project class path: To start with, we will need to get the configuration endpoint of the Memcached cluster that we launched. This configuration endpoint can be found on the AWS ElastiCache console itself. Here is how we can save the configuration endpoint and port: static String configEndpoint = "my-elastic- cache.mlvymb.cfg.usw2.cache.amazonaws.com"; static Integer clusterPort = 11211; Similarly, we can instantiate the Memcached client: static MemcachedClient client; static { try { client = new MemcachedClient(new InetSocketAddress(configEndpoint, clusterPort)); } catch (IOException e) { e.printStackTrace(); } } Now, we can write the getItem method as we did for the previous recipe. Here, we will first check whether the item is present in cache; if not, we will fetch it from DynamoDB, and put it into cache. If the same request comes the next time, we will return it from the cache itself. While putting the item into cache, we are also going to put the expiry time of the item. We are going to set it to 3,600 seconds; that is, after 1 hour, the key entry will be deleted automatically: private static Item getItem(int id, String type) { Item product = null; if (null != client.get(id + ":" + type)) { System.out.println("Returning from Cache"); return (Item) client.get(id + ":" + type); } else { AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); DynamoDB dynamoDB = new DynamoDB(client); Table table = dynamoDB.getTable("product"); product = table.getItem(new PrimaryKey("id", id, "type", type)); System.out.println("Making DynamoDB Call for getting the item"); ElasticCache.client.add(id + ":" + type, 3600, product); } return product; } How it works… A distributed cache also works in the same fashion as the local one works. A standalone cache keeps the data in memory and returns it if it finds the key. In distributed cache, we have multiple nodes; here, keys are kept in a distributed manner. The distributed nature helps you divide the keys based on the hash value of the keys. So, when any request comes, it is redirected to a specified node and the value is returned from there. Note that ElastiCache will help you provide a faster retrieval of items at the additional cost of the ElastiCache cluster. Also note that the preceding code will work if you execute the application from the EC2 instance only. If you try to execute this on the local machine, you will get connection errors. Compressing large data before storing it in DynamoDB We are all aware of DynamoDB's storage limitations for the item's size. Suppose that we get into a situation where storing large attributes in an item is a must. In that case, it's always a good choice to compress these attributes, and then save them in DynamoDB. In this recipe, we are going to see how to compress large items before storing them. Getting ready To get started with this recipe, you should have your workstation ready with Eclipse or any other IDE of your choice. How to do it… There are numerous algorithms with which we can compress the large items, for example, GZIP, LZO, BZ2, and so on. Each algorithm has a trade-off between the compression time and rate. So, it's your choice whether to go with a faster algorithm or with an algorithm, which provides a higher compression rate. Consider a scenario in our e-commerce website, where we need to save the product reviews written by various users. For this, we created a ProductReviews table, where we will save the reviewer's name, its detailed product review, and the time when the review was submitted. Here, there are chances that the product review messages can be large, and it would not be a good idea to store them as they are. So, it is important to understand how to compress these messages before storing them. Let's see how to compress large data: First of all, we will write a method that accepts the string input and returns the compressed byte buffer. Here, we are using the GZIP algorithm for compressions. Java has a built-in support, so we don't need to use any third-party library for this: private static ByteBuffer compressString(String input) throws UnsupportedEncodingException, IOException { // Write the input as GZIP output stream using UTF-8 encoding ByteArrayOutputStream baos = new ByteArrayOutputStream(); GZIPOutputStream os = new GZIPOutputStream(baos); os.write(input.getBytes("UTF-8")); os.finish(); byte[] compressedBytes = baos.toByteArray(); // Writing bytes to byte buffer ByteBuffer buffer = ByteBuffer.allocate(compressedBytes.length); buffer.put(compressedBytes, 0, compressedBytes.length); buffer.position(0); return buffer; } Now, we can simply use this method to store the data before saving it in DynamoDB. Here is an example of how to use this method in our code: private static void putReviewItem() throws UnsupportedEncodingException, IOException { AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); DynamoDB dynamoDB = new DynamoDB(client); Table table = dynamoDB.getTable("ProductReviews"); Item product = new Item() .withPrimaryKey(new PrimaryKey("id", 10)) .withString("reviewerName", "John White") .withString("dateTime", "20-06-2015T08:09:30") .withBinary("reviewMessage", compressString("My Review Message")); PutItemOutcome outcome = table.putItem(product); System.out.println(outcome.getPutItemResult()); } In a similar way, we can write a method that decompresses the data on retrieval from DynamoDB. Here is an example: private static String uncompressString(ByteBuffer input) throws IOException { byte[] bytes = input.array(); ByteArrayInputStream bais = new ByteArrayInputStream(bytes); ByteArrayOutputStream baos = new ByteArrayOutputStream(); GZIPInputStream is = new GZIPInputStream(bais); int chunkSize = 1024; byte[] buffer = new byte[chunkSize]; int length = 0; while ((length = is.read(buffer, 0, chunkSize)) != -1) { baos.write(buffer, 0, length); } return new String(baos.toByteArray(), "UTF-8"); } How it works… Compressing data at client side has numerous advantages. Lesser size means lesser use of network and disk resources. Compression algorithms generally maintain a dictionary of words. While compressing, if they see the words getting repeated, then those words are replaced by their positions in the dictionary. In this way, the redundant data is eliminated and only their references are kept in the compressed string. While uncompressing the same data, the word references are replaced with the actual words, and we get our normal string back. Various compression algorithms contain various compression techniques. Therefore, the compression algorithm you choose will depend on your need. Using AWS S3 for storing large items Sometimes, we might get into a situation where storing data in a compressed format might not be sufficient enough. Consider a case where we might need to store large images or binaries that might exceed the DynamoDB's storage limitation per items. In this case, we can use AWS S3 to store such items and only save the S3 location in our DynamoDB table. AWS S3: Simple Storage Service allows us to store data in a cheaper and efficient manner. To know more about AWS S3, you can visit http://aws.amazon.com/s3/. Getting ready To get started with this recipe, you should have your workstation ready with the Eclipse IDE. How to do it… Consider a case in our e-commerce website where we would like to store the product images along with the product data. So, we will save the images on AWS S3, and only store their locations along with the product information in the product table: First of all, we will see how to store data in AWS S3. For this, we need to go to the AWS console, and create an S3 bucket. Here, I created a bucket called e-commerce-product-images, and inside this bucket, I created folders to store the images. For example, /phone/apple/iphone6. Now, let's write the code to upload the images to S3: private static void uploadFileToS3() { String bucketName = "e-commerce-product-images"; String keyName = "phone/apple/iphone6/iphone.jpg"; String uploadFileName = "C:\tmp\iphone.jpg"; // Create an instance of S3 client AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider()); // Start the file uploading File file = new File(uploadFileName); s3client.putObject(new PutObjectRequest(bucketName, keyName, file)); } Once the file is uploaded, you can save its path in one of the attributes of the product table, as follows: private static void putItemWithS3Link() { AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); DynamoDB dynamoDB = new DynamoDB(client); Table table = dynamoDB.getTable("productTable"); Map<String, String> features = new HashMap<String, String>(); features.put("camera", "13MP"); features.put("intMem", "16GB"); features.put("processor", "Dual-Core 1.4 GHz Cyclone (ARM v8-based)"); Set<String> imagesSet = new HashSet<String>(); imagesSet.add("https://s3-us-west-2.amazonaws.com/ e-commerce-product-images/phone/apple/iphone6/iphone.jpg"); Item product = new Item() .withPrimaryKey(new PrimaryKey("id", 250, "type", "phone")) .withString("mnfr", "Apple").withNumber("stock", 15) .withString("name", "iPhone 6").withNumber("price", 45) .withMap("features", features) .withStringSet("productImages", imagesSet); PutItemOutcome outcome = table.putItem(product); System.out.println(outcome.getPutItemResult()); } So whenever required, we can fetch the item by its key, and fetch the actual images from S3 using the URL saved in the productImages attribute. How it works… AWS S3 provides storage services at very cheaper rates. It's like a flat data dumping ground where we can store any type of file. So, it's always a good option to store large datasets in S3 and only keep its URL references in DynamoDB attributes. The URL reference will be the connecting link between the DynamoDB item and the S3 file. If your file is too large to be sent in one S3 client call, you may want to explore its multipart API, which allows you to send the file in chunks. Catching DynamoDB errors Till now, we discussed how to perform various operations in DynamoDB. We saw how to use AWS provided by SDK and play around with DynamoDB items and attributes. Amazon claims that AWS provides high availability and reliability, which is quite true considering the years of experience I have been using their services, but we still cannot deny the possibility where services such as DynamoDB might not perform as expected. So, it's important to make sure that we have a proper error catching mechanism to ensure that the disaster recovery system is in place. In this recipe, we are going to see how to catch such errors. Getting ready To get started with this recipe, you should have your workstation ready with the Eclipse IDE. How to do it… Catching errors in DynamoDB is quite easy. Whenever we perform any operations, we need to put them in the try block. Along with it, we need to put a couple of catch blocks in order to catch the errors. Here, we will consider a simple operation to put an item into the DynamoDB table: try { AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); DynamoDB dynamoDB = new DynamoDB(client); Table table = dynamoDB.getTable("productTable"); Item product = new Item() .withPrimaryKey(new PrimaryKey("id", 10, "type", "mobile")) .withString("mnfr", "Samsung").withNumber("stock", 15) .withBoolean("isProductionStopped", true) .withNumber("price", 45); PutItemOutcome outcome = table.putItem(product); System.out.println(outcome.getPutItemResult()); } catch (AmazonServiceException ase) { System.out.println("Error Message: " + ase.getMessage()); System.out.println("HTTP Status Code: " + ase.getStatusCode()); System.out.println("AWS Error Code: " + ase.getErrorCode()); System.out.println("Error Type: " + ase.getErrorType()); System.out.println("Request ID: " + ase.getRequestId()); } catch (AmazonClientException e) { System.out.println("Amazon Client Exception :" + e.getMessage()); } We should first catch AmazonServiceException, which arrives if the service you are trying to access throws any exception. AmazonClientException should be put last in order to catch any client-related exceptions. How it works… Amazon assigns a unique request ID for each and every request that it receives. Keeping this request ID is very important if something goes wrong, and if you would like to know what happened, then this request ID is the only source of information. We need to contact Amazon to know more about the request ID. There are two types of errors in AWS: Client errors: These errors normally occur when the request we submit is incorrect. The client errors are normally shown with a status code starting with 4XX. These errors normally occur when there is an authentication failure, bad requests, missing required attributes, or for exceeding the provisioned throughput. These errors normally occur when users provide invalid inputs. Server errors: These errors occur when there is something wrong from Amazon's side and they occur at runtime. The only way to handle such errors is retries; and if it does not succeed, you should log the request ID, and then you can reach the Amazon support with that ID to know more about the details. You can read more about DynamoDB specific errors at http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ErrorHandling.html. Performing auto-retries on DynamoDB errors As mentioned in the previous recipe, we can perform auto-retries on DynamoDB requests if we get errors. In this recipe, we are going to see how to perform auto=retries. Getting ready To get started with this recipe, you should have your workstation ready with the Eclipse IDE. How to do it… Auto-retries are required if we get any errors during the first request. We can use the Amazon client configurations to set our retry strategy. By default, the DynamoDB client auto-retries a request if any error is generated three times. If we think that this is not efficient for us, then we can define this on our own, as follows: First of all, we need to create a custom implementation of RetryCondition. It contains a method called shouldRetry, which we need to implement as per our needs. Here is a sample CustomRetryCondition class: public class CustomRetryCondition implements RetryCondition { public boolean shouldRetry(AmazonWebServiceRequest originalRequest, AmazonClientException exception, int retriesAttempted) { if (retriesAttempted < 3 && exception.isRetryable()) { return true; } else { return false; } } } Similarly, we can implement CustomBackoffStrategy. The back-off strategy gives a hint on after what time the request should be retried. You can choose either a flat back-off time or an exponential back-off time: public class CustomBackoffStrategy implements BackoffStrategy { /** Base sleep time (milliseconds) **/ private static final int SCALE_FACTOR = 25; /** Maximum exponential back-off time before retrying a request */ private static final int MAX_BACKOFF_IN_MILLISECONDS = 20 * 1000; public long delayBeforeNextRetry(AmazonWebServiceRequest originalRequest, AmazonClientException exception, int retriesAttempted) { if (retriesAttempted < 0) return 0; long delay = (1 << retriesAttempted) * SCALE_FACTOR; delay = Math.min(delay, MAX_BACKOFF_IN_MILLISECONDS); return delay; } } Next, we need to create an instance of RetryPolicy, and set the RetryCondition and BackoffStrategy classes, which we created. Apart from this, we can also set a maximum number of retries. The last parameter is honorMaxErrorRetryInClientConfig. It means whether this retry policy should honor the maximum error retry set by ClientConfiguration.setMaxErrorRetry(int): RetryPolicy retryPolicy = new RetryPolicy(customRetryCondition, customBackoffStrategy, 3, false); Now, initiate the ClientConfiguration, and set the RetryPolicy we created earlier: ClientConfiguration clientConfiguration = new ClientConfiguration(); clientConfiguration.setRetryPolicy(retryPolicy); Now, we need to set this client configuration when we initiate the AmazonDynamoDBClient; and once done, your retry policy with a custom back-off strategy will be in place: AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider(), clientConfiguration); How it works… Auto-retries are quite handy when we receive a sudden burst in DynamoDB requests. If there are more number of requests than the provisioned throughputs, then auto-retries with an exponential back-off strategy will definitely help in handling the load. So if the client gets an exception, then it will get auto retried after sometime; and if by then the load is less, then there wouldn't be any loss for your application. The Amazon DynamoDB client internally uses HttpClient to make the calls, which is quite a popular and reliable implementation. So if you need to handle such cases, this kind of an implementation is a must. In case of batch operations, if any failure occurs, DynamoDB does not fail the complete operation. In case of batch write operations, if a particular operation fails, then DynamoDB returns the unprocessed items, which can be retried. Performing atomic transactions on DynamoDB tables I hope we are all aware that operations in DynamoDB are eventually consistent. Considering this nature it obviously does not support transactions the way we do in RDBMS. A transaction is a group of operations that need to be performed in one go, and they should be handled in an atomic nature. (If one operation fails, the complete transaction should be rolled back.) There might be use cases where you would need to perform transactions in your application. Considering this need, AWS has provided open sources, client-side transaction libraries, which helps us achieve atomic transactions in DynamoDB. In this recipe, we are going to see how to perform transactions on DynamoDB. Getting ready To get started with this recipe, you should have your workstation ready with the Eclipse IDE. How to do it… To get started, we will first need to download the source code of the library from GitHub and build the code to generate the JAR file. You can download the code from https://github.com/awslabs/dynamodb-transactions/archive/master.zip. Next, extract the code and run the following command to generate the JAR file: mvn clean install –DskipTests On a successful build, you will see a JAR generated file in the target folder. Add this JAR to the project by choosing a configure build path in Eclipse: Now, let's understand how to use transactions. For this, we need to create the DynamoDB client and help this client to create two helper tables. The first table would be the Transactions table to store the transactions, while the second table would be the TransactionImages table to keep the snapshots of the items modified in the transaction: AmazonDynamoDBClient client = new AmazonDynamoDBClient( new ProfileCredentialsProvider()); client.setRegion(Region.getRegion(Regions.US_EAST_1)); // Create transaction table TransactionManager.verifyOrCreateTransactionTable(client, "Transactions", 10, 10, (long) (10 * 60)); // Create transaction images table TransactionManager.verifyOrCreateTransactionImagesTable(client, "TransactionImages", 10, 10, (long) (60 * 10)); Next, we need to create a transaction manager by providing the names of the tables we created earlier: TransactionManager txManager = new TransactionManager(client, "Transactions", "TransactionImages"); Now, we create one transaction, and perform the operations you will need to do in one go. Consider our product table where we need to add two new products in one single transaction, and the changes will reflect only if both the operations are successful. We can perform these using transactions, as follows: Transaction t1 = txManager.newTransaction(); Map<String, AttributeValue> product = new HashMap<String, AttributeValue>(); AttributeValue id = new AttributeValue(); id.setN("250"); product.put("id", id); product.put("type", new AttributeValue("phone")); product.put("name", new AttributeValue("MI4")); t1.putItem(new PutItemRequest("productTable", product)); Map<String, AttributeValue> product1 = new HashMap<String, AttributeValue>(); id.setN("350"); product1.put("id", id); product1.put("type", new AttributeValue("phone")); product1.put("name", new AttributeValue("MI3")); t1.putItem(new PutItemRequest("productTable", product1)); t1.commit(); Now, execute the code to see the results. If everything goes fine, you will see two new entries in the product table. In case of an error, none of the entries would be in the table. How it works… The transaction library when invoked, first writes the changes to the Transaction table, and then to the actual table. If we perform any update item operation, then it keeps the old values of that item in the TransactionImages table. It also supports multi-attribute and multi-table transactions. This way, we can use the transaction library and perform atomic writes. It also supports isolated reads. You can refer to the code and examples for more details at https://github.com/awslabs/dynamodb-transactions. Performing asynchronous requests to DynamoDB Till now, we have used a synchronous DynamoDB client to make requests to DynamoDB. Synchronous requests block the thread unless the operation is not performed. Due to network issues, sometimes, it can be difficult for the operation to get completed quickly. In that case, we can go for asynchronous client requests so that we submit the requests and do some other work. Getting ready To get started with this recipe, you should have your workstation ready with the Eclipse IDE. How to do it… Asynchronous client is easy to use: First, we need to the AmazonDynamoDBAsync class: AmazonDynamoDBAsync dynamoDBAsync = new AmazonDynamoDBAsyncClient( new ProfileCredentialsProvider()); Next, we need to create the request to be performed in an asynchronous manner. Let's say we need to delete a certain item from our product table. Then, we can create the DeleteItemRequest, as shown in the following code snippet: Map<String, AttributeValue> key = new HashMap<String, AttributeValue>(); AttributeValue id = new AttributeValue(); id.setN("10"); key.put("id", id); key.put("type", new AttributeValue("phone")); DeleteItemRequest deleteItemRequest = new DeleteItemRequest( "productTable", key); Next, invoke the deleteItemAsync method to delete the item. Here, we can optionally define AsyncHandler if we want to use the result of the request we had invoked. Here, I am also printing the messages with time so that we can confirm its asynchronous nature: dynamoDBAsync.deleteItemAsync(deleteItemRequest, new AsyncHandler<DeleteItemRequest, DeleteItemResult>() { public void onSuccess(DeleteItemRequest request, DeleteItemResult result) { System.out.println("Item deleted successfully: "+ System.currentTimeMillis()); } public void onError(Exception exception) { System.out.println("Error deleting item in async way"); } }); System.out.println("Delete item initiated" + System.currentTimeMillis()); How it works Asynchronous clients use AsyncHttpClient to invoke the DynamoDB APIs. This is a wrapper implementation on top of Java asynchronous APIs. Hence, they are quite easy to use and understand. The AsyncHandler is an optional configuration you can do in order to use the results of asynchronous calls. We can also use the Java Future object to handle the response. Summary We have covered various recipes on cost and performance efficient use of DynamoDB. Recipes like error handling and auto retries helps readers in make their application robust. It also highlights use of transaction library in order to implement atomic transaction on DynamoDB. Resources for Article: Further resources on this subject: The EMR Architecture[article] Amazon DynamoDB - Modelling relationships, Error handling[article] Index, Item Sharding, and Projection in DynamoDB [article]
Read more
  • 0
  • 0
  • 14696